NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.
Integration Vignette 010
- list and count types of genes and transcripts
- explain types and additional remarks used
- give other useful stats
|Gencode v3c||GRCh37 (hg19)||gtf file||format description|
Please note that the GENCODE release we agreed to use for all analysis is 3c. This data is displayed in the UCSC browser as the "Oct 2009" track and corresponds to Ensembl version 56, which can be queried here. Differences between the Gencode data (as shown in the UCSC broser and the GTF file) and Ensembl are: -only the main 24 chromosomes are used for the GENCODE file, ommiting lower level and haplotype annotation -the genes on the PAR region are listed with their locations on the X as well as on the Y chromosome.
Command-line utilities, Excel, etc. are used to dissect the GENCODE GTF file in various ways.
Perl scripts are used to analyze genes, transcripts, exons, etc. using the GTF file or the Ensembl 56 database.
This explains the content of the GTF file in detail.
This shows a description of gene and transcript types with stats for 3c as well as the numbers of genes and transcripts of each type.
- Can the analysis be repeated on a new freeze rapidly?
What does this show?
All genes: 47553
Protein-coding genes: 22550
Genes containing manually annotated Havana transcripts: 29210
Protein-coding genes containing manually annotated Havana transcripts: 14348
-> extrapolation: 63.6% of genes have been manually annotated
Genes containing additional Ensembl transcripts: 9101
Genes containing only Ensembl transcripts: 18343
"Genic regions" (exons, all genes): 61.010.745 bp (1.97%)
All transcripts: 132067
Protein-coding transcripts: 68880
Manually annotated Havana transcripts: 87940
Ensembl-only transcripts: 44127
Distinct peptide variants: 63014
Transcripts/Gene (genome-wide, protein-coding genes with > 1 transcript):
Maximum: 8.4 (chromosome 3)
A. All genes & transcripts:
|Gene lengths (genomic)||30822.35||87995.03||2996.00|
|Transcript lengths (genomic)||38254.67||87897.11||10651.00|
|Transcript lengths (cDNA)||1713.26||1960.13||955.00|
|Number of exons / transcript||7.00||8.05||4.00|
B. Protein-coding genes & transcripts only:
|Transcript lengths (protein_coding)||2462.10||2240.73||1910.00|
|5' UTR lengths||220.71||130||196.03|
|3' UTR lengths||833.31||367||834.09|
|Number of exons / transcript||9.98||9.51||7.00|
Suggested super-grouping of genes if needed:
- other RNAs
- IG genes
- Is this likely to be a main analysis or supplementary information?
Key numbers in text, graphs and tables in supplement
Pilot GENCODE publication: 16925838