NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.

Integration Vignette 010

From Encode2 Wiki
Jump to: navigation, search

People

Aim(s)

  • list and count types of genes and transcripts
  • explain types and additional remarks used
  • give other useful stats

Datasets

Data Build Location Notes
Gencode v3c GRCh37 (hg19) gtf file format description
 Please note that the GENCODE release we agreed to use for all analysis is 3c. 
 This data is displayed in the UCSC browser as the "Oct 2009" track and corresponds to Ensembl version 56,
 which can be queried here.
 Differences between the Gencode data (as shown in the UCSC broser and the GTF file) and Ensembl are:
   -only the main 24 chromosomes are used for the GENCODE file, ommiting lower level and haplotype annotation
   -the genes on the PAR region are listed with their locations on the X as well as on the Y chromosome.

Method

Command-line utilities, Excel, etc. are used to dissect the GENCODE GTF file in various ways.

Perl scripts are used to analyze genes, transcripts, exons, etc. using the GTF file or the Ensembl 56 database.

This explains the content of the GTF file in detail.

This shows a description of gene and transcript types with stats for 3c as well as the numbers of genes and transcripts of each type.


  • Can the analysis be repeated on a new freeze rapidly?

YES

What does this show?

Counts

Gene level

All genes: 47553

Protein-coding genes: 22550

Genes containing manually annotated Havana transcripts: 29210

Protein-coding genes containing manually annotated Havana transcripts: 14348

-> extrapolation: 63.6% of genes have been manually annotated

Genes containing additional Ensembl transcripts: 9101

Genes containing only Ensembl transcripts: 18343


"Genic regions" (exons, all genes): 61.010.745 bp (1.97%)

Transcript level

All transcripts: 132067

Protein-coding transcripts: 68880

Manually annotated Havana transcripts: 87940

Ensembl-only transcripts: 44127

Distinct peptide variants: 63014

Transcripts/Gene (genome-wide, protein-coding genes with > 1 transcript):

Mean: 4.9

Maximum: 8.4 (chromosome 3)

Lenghts

A. All genes & transcripts:

Element mean median std.dev
Gene lengths (genomic) 30822.35 87995.03 2996.00
Transcript lengths (genomic) 38254.67 87897.11 10651.00
Transcript lengths (cDNA) 1713.26 1960.13 955.00
Exon lengths 244.75 497.78 129.00
Intron lengths 6090.21 19184.72 1545.00
Number of exons / transcript 7.00 8.05 4.00


B. Protein-coding genes & transcripts only:

element mean median std.dev
Gene lengths 56869.16 117499.21 20055.50
Transcript lengths (protein_coding) 2462.10 2240.73 1910.00
Exon lengths 246.72 529.29 129.00
Intron lengths 6102.83 19131.92 1597.00
5' UTR lengths 220.71 130 196.03
3' UTR lengths 833.31 367 834.09
Number of exons / transcript 9.98 9.51 7.00


Types

Suggested super-grouping of genes if needed:

  • protein-coding
  • processed-transcripts
  • miRNAs
  • other RNAs
  • RNA-pseudogenes
  • pseudogenes
  • IG genes


Gene types.3c.png Gene types by level.3c.png

Other

Gencode 3c utr lengths.png Gene density.png


  • Is this likely to be a main analysis or supplementary information?

Key numbers in text, graphs and tables in supplement

References

Pilot GENCODE publication: 16925838