NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.

Defined Human Gene Annotation Set

From Encode2 Wiki
Jump to: navigation, search

Introduction

We aim to annotate all evidence-based gene features at high accuracy on the human reference sequence. This includes identifying all protein-coding loci with associated alternative variants, non-coding loci which have transcript evidence, and pseudogenes. We integrate computational approaches (including comparative methods), manual annotation and targeted experimental verification. Loci will go through a workflow from being predicted to being verified, or being manually annotated with selected experimental verifications. They should end up with a status tag "Gencode" or be dismissed.

-> main wiki project page.

Analysis Summary

Analysis Hit List/Deliverables

For the third freeze (July 2009) we are providing a deeper merge with ensembl loci. The notes below have been updated accordingly, there is an overview here.

For data freezes we supply genome-wide features on three different confidence levels, to get annotation close to the GENCODE gene set we are aiming for, level 1 + 2 should be used:

  • Level 1: validated
At this time only pseudogene loci, that were predicted by the analysis-pipelines from YALE, UCSC as well as by HAVANA manual annotation from WTSI.
  • Level 2: manual annotation
HAVANA manual annotation from WTSI.
  • Level 3: automated annotation
ENSEMBL annotation.

Havana and Ensembl transcripts can be found in shared loci and are merged into one model where possible.

This data is be supplied in single file in GTF2.2 format as defined here with the following tags added to the attributes column where appropriate:

  • gene and transcript level types, status, version, ids and names.
  • level [1,2,3]: validition status as described.
  • tag "pseudo_consens": member of the pseudogene set predicted by YALE, UCSC and HAVANA.
  • tag "CCDS": member of the consensus CDS gene set, confirming coding regions between ENSEMBL, UCSC, NCBI and HAVANA.

Please note: if start codons are split between two exons, two start-codon features will be listed.

More details abot the format and how to use it are listed here.

The raw data of the freezes can be fetched from our FTP site.

For questions, please contact Felix Kokocinski or Gencode.

Results

Stats for merged gene sets

Version 3 (July 09 freeze)

GRCh37:

All genes: 46875, genes containing Havana transcripts: 28480, genes containing Ensembl transcripts: 27496

Prot-cod. genes: 22276

All transcripts: 127705, Havana transcripts: 85051, Ensembl transcripts: 42654

Prot-cod. transcripts: 66169

NCBI36:

All genes: 46766, genes containing Havana transcripts: 28426, genes containing Ensembl transcripts: 27393

Prot-cod. genes: 22194

All transcripts: 127296, Havana transcripts: 84824, Ensembl transcripts: 42472

Prot-cod. transcripts: 65911


Other stats

99.6% (19807/19891) of current CCDS ids are covered (GRCh37, CCDS release: 6.8.09).


QC pipeline results

A comparison to all RefSeq proteins shows:

Alignment for 99.12%

>= 100.00% for 93.52%

perfect match for 0.9151


A comparison to all SwissProt proteins shows:

Alignment for 97.24%

>= 100.00% for 80.76%

perfect match for 0.7235



06.02.2009

All Genes: Havana genes: 28557, Ensembl genes: 10647, Ratio: 72.8%

Prot-cod. Genes: Havana genes: 13527, Ensembl genes: 6194, Ratio: 68.6%

This is the data freeze that should be used for the analysis paper. It's available here.


01.10.08

All Genes: Havana genes: 24362, Ensembl genes: 11884, Ratio: 67.2%

Prot-cod. Genes: Havana genes: 12335, Ensembl genes: 7617, Ratio: 61.8%

Sandbox (Ideas)

Details for merged gene set of 02.02.09 (outdated)

All loci:
Chromosome HAVANA-loci ENSEMBL-loci Merge-Ratio
1 4226 0 100.0%
2 3074 0 100.0%
3 1027 966 51.5%
4 455 737 38.2%
5 627 695 47.4%
6 2284 0 100.0%
7 3493 237 93.6%
8 216 977 18.1%
9 1933 0 100.0%
10 1790 0 100.0%
11 463 1438 24.4%
12 150 1338 10.1%
13 924 0 100.0%
14 486 726 40.1%
15 408 557 42.3%
16 632 461 57.8%
17 854 861 49.8%
18 194 233 45.4%
19 376 1421 20.9%
20 1005 0 100.0%
21 597 0 100.0%
22 1037 0 100.0%
X 1861 0 100.0%
Y 445 0 100.0%


Protein-coding loci:
Chromosome HAVANA-loci ENSEMBL-loci Merge-Ratio
1 1968 0 100.0%
2 1221 0 100.0%
3 527 528 50.0%
4 416 323 56.3%
5 569 271 67.7%
6 985 0 100.0%
7 1202 16 98.7%
8 168 514 24.6%
9 771 0 100.0%
10 741 0 100.0%
11 269 1033 20.7%
12 113 892 11.2%
13 312 0 100.0%
14 267 343 43.8%
15 370 246 60.1%
16 573 247 69.9%
17 571 551 50.9%
18 180 93 65.9%
19 241 1137 17.5%
20 529 0 100.0%
21 225 0 100.0%
22 435 0 100.0%
X 823 0 100.0%
Y 51 0 100.0%


Current stats of HAVANA set as of 11.02.09

Mean number of transcript / locus 2.9

Median number of transcript / locus 1


Mean number of exons / transcript 6.4

Median number of exons / transcript 3


Mean number of transcripts / protein_coding locus 4.7

Median number of transcripts / protein_coding locus 2


Mean number of exons / protein_coding transcript 7.4

Median number of exons / protein_coding transcript 31