NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.
Defined Human Gene Annotation Set
We aim to annotate all evidence-based gene features at high accuracy on the human reference sequence. This includes identifying all protein-coding loci with associated alternative variants, non-coding loci which have transcript evidence, and pseudogenes. We integrate computational approaches (including comparative methods), manual annotation and targeted experimental verification. Loci will go through a workflow from being predicted to being verified, or being manually annotated with selected experimental verifications. They should end up with a status tag "Gencode" or be dismissed.
Analysis Hit List/Deliverables
For the third freeze (July 2009) we are providing a deeper merge with ensembl loci. The notes below have been updated accordingly, there is an overview here.
For data freezes we supply genome-wide features on three different confidence levels, to get annotation close to the GENCODE gene set we are aiming for, level 1 + 2 should be used:
- Level 1: validated
- At this time only pseudogene loci, that were predicted by the analysis-pipelines from YALE, UCSC as well as by HAVANA manual annotation from WTSI.
- Level 2: manual annotation
- HAVANA manual annotation from WTSI.
- Level 3: automated annotation
- ENSEMBL annotation.
Havana and Ensembl transcripts can be found in shared loci and are merged into one model where possible.
This data is be supplied in single file in GTF2.2 format as defined here with the following tags added to the attributes column where appropriate:
- gene and transcript level types, status, version, ids and names.
- level [1,2,3]: validition status as described.
- tag "pseudo_consens": member of the pseudogene set predicted by YALE, UCSC and HAVANA.
- tag "CCDS": member of the consensus CDS gene set, confirming coding regions between ENSEMBL, UCSC, NCBI and HAVANA.
Please note: if start codons are split between two exons, two start-codon features will be listed.
More details abot the format and how to use it are listed here.
The raw data of the freezes can be fetched from our FTP site.
Stats for merged gene sets
Version 3 (July 09 freeze)
All genes: 46875, genes containing Havana transcripts: 28480, genes containing Ensembl transcripts: 27496
Prot-cod. genes: 22276
All transcripts: 127705, Havana transcripts: 85051, Ensembl transcripts: 42654
Prot-cod. transcripts: 66169
All genes: 46766, genes containing Havana transcripts: 28426, genes containing Ensembl transcripts: 27393
Prot-cod. genes: 22194
All transcripts: 127296, Havana transcripts: 84824, Ensembl transcripts: 42472
Prot-cod. transcripts: 65911
99.6% (19807/19891) of current CCDS ids are covered (GRCh37, CCDS release: 6.8.09).
QC pipeline results
A comparison to all RefSeq proteins shows:
Alignment for 99.12%
>= 100.00% for 93.52%
perfect match for 0.9151
A comparison to all SwissProt proteins shows:
Alignment for 97.24%
>= 100.00% for 80.76%
perfect match for 0.7235
All Genes: Havana genes: 28557, Ensembl genes: 10647, Ratio: 72.8%
Prot-cod. Genes: Havana genes: 13527, Ensembl genes: 6194, Ratio: 68.6%
This is the data freeze that should be used for the analysis paper. It's available here.
All Genes: Havana genes: 24362, Ensembl genes: 11884, Ratio: 67.2%
Prot-cod. Genes: Havana genes: 12335, Ensembl genes: 7617, Ratio: 61.8%
Details for merged gene set of 02.02.09 (outdated)
Current stats of HAVANA set as of 11.02.09
Mean number of transcript / locus 2.9
Median number of transcript / locus 1
Mean number of exons / transcript 6.4
Median number of exons / transcript 3
Mean number of transcripts / protein_coding locus 4.7
Median number of transcripts / protein_coding locus 2
Mean number of exons / protein_coding transcript 7.4
Median number of exons / protein_coding transcript 31