NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.

Gene annotation project (Hubbard)

From Encode2 Wiki
Revision as of 08:38, 3 December 2012 by Sg10 (talk | contribs) (GENCODE Publications)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

GENCODE Introduction

The ENCODE gene annotation scale up project started on the 1st of October, 2007. The project is using the same name 'GENCODE' as was used by the ENCODE pilot project which it builds upon. The aim of GENCODE as a sub-project of the ENCODE scale-up project is to annotate all evidence-based gene features in the entire human genome at a high accuracy. The result will be a set of annotations including all protein-coding loci with alternatively transcribed variants, non-coding loci with transcript evidence, and pseudogenes. The process to create this annotation involves manual curation, different computational analysis and targeted experimental approaches. Putative loci can be verified by wet-lab experiments and computational predictions will be analysed manually. The international team working in the GENCODE project is headed by Tim Hubbard at the WT Sanger Institute. A list of other PIs involved in this project can be found in the following link,GENCODE co-PIs.

A public project page at the Wellcome Trust Sanger Institute can be found here.

GENCODE Overview

GENCODE Publications

Links to the papers can be found on the website


Frenkel-Morgenstern M, Lacroix V, Ezkurdia I, Levin Y, Gabashvili A, Prilusky J, Del Pozo A, Tress M, Johnson R, Guigo R, Valencia A. Chimeras taking shape: Potential functions of proteins encoded by chimeric RNA transcripts. Genome Res. 2012 Jul;22(7):1231-42. Epub 2012 May 15. PubMed PMID: 22588898[PubMed - in process]; PubMed Central PMCID: PMC3396365.

Ezkurdia I, del Pozo A, Frankish A, Rodriguez JM, Harrow J, Ashman K, Valencia A, Tress ML. Comparative proteomics reveals a significant bias towards alternative protein isoforms with conserved structure and function. Mol Biol Evol. 2012 Apr 17. [Epub ahead of print]. PubMed PMID: 22446687.

Frankish A, Mudge JM, Thomas M, Harrow J. The importance of identifying alternative splicing in vertebrate genome annotation. Database (Oxford). 2012 Mar 20. PubMed PMID:22434846; PubMed Central PMCID: PMC3308168.

Harte RA, Farrell CM, Loveland JE, Suner MM, Wilming L, Aken B, Barrell D, Frankish A, Wallin C, Searle S, Diekhans M, Harrow J, Pruitt KD. Tracking and coordinating an international curation effort for the CCDS Project. Database (Oxford). 2012 Mar 20. PubMed PMID: 22434842; PubMed Central PMCID: PMC3308164.

MacArthur DG, Balasubramanian S, Frankish A, Huang N, Morris J, Walter K, Jostins L, Habegger L, Pickrell JK, Montgomery SB, Albers CA, Zhang ZD, Conrad DF, Lunter G, Zheng H, Ayub Q, DePristo MA, Banks E, Hu M, Handsaker RE, Rosenfeld JA, Fromer M, Jin M, Mu XJ, Khurana E, Ye K, Kay M, Saunders GI, Suner MM, Hunt T, Barnes IH, Amid C, Carvalho-Silva DR, Bignell AH, Snow C, Yngvadottir B, Bumpstead S, Cooper DN, Xue Y, Romero IG; 1000 Genomes Project Consortium, Wang J, Li Y, Gibbs RA, McCarroll SA, Dermitzakis ET, Pritchard JK, Barrett JC, Harrow J, Hurles ME, Gerstein MB, Tyler-Smith C. A systematic survey of loss-of-function variants in human protein-coding genes. Science. 2012 Feb 17;335(6070):823-8. PubMed PMID: 22344438; PubMed Central PMCID: PMC3299548[Available on 2012/8/17].

Djebali S, Lagarde J, Kapranov P, Lacroix V, Borel C, Mudge JM, Howald C, Foissac S, Ucla C, Chrast J, Ribeca P, Martin D, Murray RR, Yang X, Ghamsari L, Lin C, Bell I, Dumais E, Drenkow J, Tress ML, Gelpí JL, Orozco M, Valencia A, van Berkum NL, Lajoie BR, Vidal M, Stamatoyannopoulos J, Batut P, Dobin A, Harrow J, Hubbard T, Dekker J, Frankish A, Salehi-Ashtiani K, Reymond A, Antonarakis SE, Guigó R, Gingeras TR. Evidence for transcript networks composed of chimeric RNAs in human cells. PLoS One. 2012;7(1):e28213. Epub 2012 Jan 4. PubMed PMID: 22238572; PubMed Central PMCID: PMC3251577.


Lin MF, Jungreis I, Kellis M. PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics. 2011 Jul 1;27(13):i275-82. PubMed PMID: 21685081; PubMed Central PMCID: PMC3117341.

Balasubramanian S, Habegger L, Frankish A, MacArthur DG, Harte R, Tyler-Smith C, Harrow J, Gerstein M. Gene inactivation and its implications for annotation in the era of personal genomics. Genes Dev. 2011 Jan 1;25(1):1-10. PubMed PMID: 21205862; PubMed Central PMCID: PMC3012931.

Coffey AJ, Kokocinski F, Calafato MS, Scott CE, Palta P, Drury E, Joyce CJ, Leproust EM, Harrow J, Hunt S, Lehesjoki AE, Turner DJ, Hubbard TJ, Palotie A. The GENCODE exome: sequencing the complete human exome. Eur J Hum Genet. 2011 Mar 2. PubMed PMID: 21364695; PubMed Central PMCID: PMC3137498.


Khurana E, Lam HY, Cheng C, Carriero N, Cayting P, Gerstein MB. Segmental duplications in the human genome reveal details of pseudogene formation. Nucleic Acids Res. 2010 Nov 1; 38 (20) :6997-7007. PubMed PMID: 20615899; PubMed Central PMCID: PMC2978362.

Kokocinski F, Harrow J, Hubbard T. AnnoTrack--a tracking system for genome annotation. BMC Genomics. 2010 Oct 5;11:538. PubMed PMID: 20923551; PubMed Central PMCID: PMC3091687.

Ørom UA, Derrien T, Beringer M, Gumireddy K, Gardini A, Bussotti G, Lai F, Zytnicki M, Notredame C, Huang Q, Guigo R, Shiekhattar R. Long noncoding RNAs with enhancer-like function in human cells. Cell. 2010 Oct 1; 143 (1) :46-58. PubMed PMID: 20887892.

Poliseno L, Salmena L, Zhang J, Carver B, Haveman WJ, Pandolfi PP. A coding-independent function of gene and pseudogene mRNAs regulates tumour biology. Nature. 2010 Jun 24; 465 (7301) :1033-8. PubMed PMID: 20577206.

Rhead B, Karolchik D, Kuhn RM, Hinrichs AS, Zweig AS, Fujita PA, Diekhans M, Smith KE, Rosenbloom KR, Raney BJ, Pohl A, Pheasant M, Meyer LR, Learned K, Hsu F, Hillman-Jackson J, Harte RA, Giardine B, Dreszer TR, Clawson H, Barber GP, Haussler D, Kent WJ. The UCSC Genome Browser database: update 2010. Nucleic Acids Res. 2010 Jan; 38 (Database issue) :D613-9. PubMed PMID: 19906737; PubMed Central PMCID: PMC2808870.

Rosenbloom KR, Dreszer TR, Pheasant M, Barber GP, Meyer LR, Pohl A, Raney BJ, Wang T, Hinrichs AS, Zweig AS, Fujita PA, Learned K, Rhead B, Smith KE, Kuhn RM, Karolchik D, Haussler D, Kent WJ. ENCODE whole-genome data in the UCSC Genome Browser. Nucleic Acids Res. 2010 Jan; 38 (Database issue) :D620-5. PubMed PMID: 19920125; PubMed Central PMCID: PMC2808953.

Madupu R, Brinkac LM, Harrow J, Wilming LG, Böhme U, Lamesch P, Hannick LI. Meeting report: a workshop on Best Practices in Genome Annotation. Database (Oxford). 2010; 2010:baq001. PubMed PMID: 20428316; PubMed Central PMCID: PMC2860899.

Zhang ZD, Frankish A, Hunt T, Harrow J, Gerstein M. Identification and analysis of unitary pseudogenes: historic and contemporary gene losses in humans and other primates. Genome Biol. 2010; 11 (3) :R26. PubMed PMID: 20210993; PubMed Central PMCID: PMC2864566.


Amid C, Rehaume LM, Brown KL, Gilbert JG, Dougan G, Hancock RE, Harrow JL. Manual annotation and analysis of the defensin gene cluster in the C57BL/6J mouse reference genome. BMC Genomics. 2009 Dec 15; 10:606. PubMed PMID: 20003482; PubMed Central PMCID: PMC2807441.

Boles MK, Wilkinson BM, Wilming LG, Liu B, Probst FJ, Harrow J, Grafham D, Hentges KE, Woodward LP, Maxwell A, Mitchell K, Risley MD, Johnson R, Hirschi K, Lupski JR, Funato Y, Miki H, Marin-Garcia P, Matthews L, Coffey AJ, Parker A, Hubbard TJ, Rogers J, Bradley A, Adams DJ, Justice MJ. Discovery of candidate disease genes in ENU-induced mouse mutants by large-scale sequencing, including a splice-site mutation in nucleoredoxin. PLoS Genet. 2009 Dec; 5 (12) :e1000759. PubMed PMID: 20011118; PubMed Central PMCID: PMC2782131.

Liu YJ, Zheng D, Balasubramanian S, Carriero N, Khurana E, Robilotto R, Gerstein MB. Comprehensive analysis of the pseudogenes of glycolytic enzymes in vertebrates: the anomalously high number of GAPDH pseudogenes highlights a recent burst of retrotrans-positional activity. BMC Genomics. 2009 Oct 16; 10:480. PubMed PMID: 19835609; PubMed Central PMCID: PMC2770531.

Pruitt KD, Harrow J, Harte RA, Wallin C, Diekhans M, Maglott DR, Searle S, Farrell CM, Loveland JE, Ruef BJ, Hart E, Suner MM, Landrum MJ, Aken B, Ayling S, Baertsch R, Fernandez-Banet J, Cherry JL, Curwen V, Dicuccio M, Kellis M, Lee J, Lin MF, Schuster M, Shkeda A, Amid C, Brown G, Dukhanina O, Frankish A, Hart J, Maidak BL, Mudge J, Murphy MR, Murphy T, Rajan J, Rajput B, Riddick LD, Snow C, Steward C, Webb D, Weber JA, Wilming L, Wu W, Birney E, Haussler D, Hubbard T, Ostell J, Durbin R, Lipman D. The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. Genome Res. 2009 Jul; 19 (7) :1316-23. PubMed PMID: 19498102; PubMed Central PMCID: PMC2704439.

Guo X, Zhang Z, Gerstein MB, Zheng D. Small RNAs originated from pseudogenes: cis- or trans-acting?. PLoS Comput Biol. 2009 Jul; 5 (7) :e1000449. PubMed PMID: 19649160; PubMed Central PMCID: PMC2708354.

Lu DV, Brown RH, Arumugam M, Brent MR. Pairagon: a highly accurate, HMM-based cDNA-to-genome aligner. Bioinformatics. 2009 Jul 1;25(13):1587-93. Epub 2009 May 4. PubMed PMID: 19414532; PubMed Central PMCID: PMC2732315.

Kuhn RM, Karolchik D, Zweig AS, Wang T, Smith KE, Rosenbloom KR, Rhead B, Raney BJ, Pohl A, Pheasant M, Meyer L, Hsu F, Hinrichs AS, Harte RA, Giardine B, Fujita P, Diekhans M, Dreszer T, Clawson H, Barber GP, Haussler D, Kent WJ. The UCSC Genome Browser Database: update 2009. Nucleic Acids Res. 2009 Jan;37(Database issue):D755-61. Epub 2008 Nov 7. PubMed PMID: 18996895; PubMed Central PMCID: PMC2686463

Balasubramanian S, Zheng D, Liu YJ, Fang G, Frankish A, Carriero N, Robilotto R, Cayting P, Gerstein M. Comparative analysis of processed ribosomal protein pseudogenes in four mammalian genomes. Genome Biol. 2009;10(1):R2. Epub 2009 Jan 5. PubMed PMID: 19123937; PubMed Central PMCID: PMC2687790

Lam HY, Khurana E, Fang G, Cayting P, Carriero N, Cheung KH, Gerstein MB. Pseudofam: the pseudogene families database. Nucleic Acids Res. 2009 Jan; 37 (Database issue) :D738-43. PubMed PMID: 18957444; PubMed Central PMCID: PMC2686518.

Harrow J, Nagy A, Reymond A, Alioto T, Patthy L, Antonarakis SE, Guigó R. Identifying protein-coding genes in genomic sequences. Genome Biol. 2009; 10 (1) :201. PubMed PMID: 19226436; PubMed Central PMCID: PMC2687780.


Djebali S, Kapranov P, Foissac S, Lagarde J, Reymond A, Ucla C, Wyss C, Drenkow J, Dumais E, Murray RR, Lin C, Szeto D, Denoeud F, Calvo M, Frankish A, Harrow J, Makrythanasis P, Vidal M, Salehi-Ashtiani K, Antonarakis SE, Gingeras TR, Guigó R. Efficient targeted transcript discovery via array-based normalization of RACE libraries. Nat Methods. 2008 Jul; 5 (7) :629-35. PubMed PMID: 18500348; PubMed Central PMCID: PMC2713501.

Czech B, Malone CD, Zhou R, Stark A, Schlingeheyde C, Dus M, Perrimon N, Kellis M, Wohlschlegel JA, Sachidanandam R, Hannon GJ, Brennecke J. An endogenous small interfering RNA pathway in Drosophila. Nature. 2008 Jun 5; 453 (7196) :798-802. PubMed PMID: 18463631; PubMed Central PMCID: PMC2895258.

Horton R, Gibson R, Coggill P, Miretti M, Allcock RJ, Almeida J, Forbes S, Gilbert JG, Halls K, Harrow JL, Hart E, Howe K, Jackson DK, Palmer S, Roberts AN, Sims S, Stewart CA, Traherne JA, Trevanion S, Wilming L, Rogers J, de Jong PJ, Elliott JF, Sawcer S, Todd JA, Trowsdale J, Beck S. Variation analysis and gene annotation of eight MHC haplotypes: the MHC Haplotype Project. Immunogenetics. 2008 Jan; 60 (1) :1-18. PubMed PMID: 18193213; PubMed Central PMCID: PMC2206249.

Mudge JM, Armstrong SD, McLaren K, Beynon RJ, Hurst JL, Nicholson C, Robertson DH, Wilming LG, Harrow JL. Dynamic instability of the major urinary protein gene family revealed by genomic and phenotypic comparisons between C57 and 129 strain mice. Genome Biol. 2008; 9 (5) :R91. PubMed PMID: 18507838; PubMed Central PMCID: PMC2441477.

Tress ML, Wesselink JJ, Frankish A, López G, Goldman N, Löytynoja A, Massingham T, Pardi F, Whelan S, Harrow J, Valencia A. Determination and validation of principal gene products. Bioinformatics. 2008 Jan 1; 24 (1) :11-7. PubMed PMID: 18006548; PubMed Central PMCID: PMC2734078.

Zhang ZD, Cayting P, Weinstock G, Gerstein M. Analysis of nuclear receptor pseudogenes in vertebrates: how the silent tell their stories. Mol Biol Evol. 2008 Jan; 25 (1) :131-43. PubMed PMID: 18065488.

Karolchik D, Kuhn RM, Baertsch R, Barber GP, Clawson H, Diekhans M, Giardine B, Harte RA, Hinrichs AS, Hsu F, Kober KM, Miller W, Pedersen JS, Pohl A, Raney BJ, Rhead B, Rosenbloom KR, Smith KE, Stanke M, Thakkapallayil A, Trumbower H, Wang T, Zweig AS, Haussler D, Kent WJ. The UCSC Genome Browser Database: 2008 update. Nucleic Acids Res. 2008 Jan; 36 (Database issue) :D773-9. PubMed PMID: 18086701; PubMed Central PMCID: PMC2238835.

Wilming LG, Gilbert JG, Howe K, Trevanion S, Hubbard T, Harrow JL. The vertebrate genome annotation (Vega) database. Nucleic Acids Res. 2008 Jan; 36 (Database issue) :D753-60. PubMed PMID: 18003653; PubMed Central PMCID: PMC2238886.


Kuhn RM, Karolchik D, Zweig AS, Trumbower H, Thomas DJ, Thakkapallayil A, Sugnet CW, Stanke M, Smith KE, Siepel A, Rosenbloom KR, Rhead B, Raney BJ, Pohl A, Pedersen JS, Hsu F, Hinrichs AS, Harte RA, Diekhans M, Clawson H, Bejerano G, Barber GP, Baertsch R, Haussler D, Kent WJ. The UCSC genome browser database: update 2007. Nucleic Acids Res. 2007 Jan; 35 (Database issue) :D668-73. PubMed PMID: 17142222; PubMed Central PMCID: PMC1669757.

GENCODE Project Documents

NHGRI Quarterly Reports

GENCODE Conference Calls

Experimental Validation Subgroup

Project page: Experimental Validation Pipeline

Pseudogene Subgroup

lncRNA Subgroup

Data Coordination

A link to the GENCODE project's data release page and further data access point, the FTP site is here.

The Gencode annotation is released regularly to the DCC for display as tracks in UCSC genome browser (level 1/2 and level 3) and is also available as GTF files.

The GTF format used is described here.

The Gencode tracking system AnnoTrack can be found here.

There is an Integration_Vignette_010 with stats about release 3c and some more stats here.


In order to share, distribute and integrate the data from the different groups, we are using the Distributed Annotation system. Every data producer has set up a server displaying the data live. For the 2008 DAS workshop we made a number of DAS tutorials which cover how to set up DAS servers and script against servers. There are also tutorials for the perl proserver hereand here and for the Java Dazzle server here.

Details of presentations and tutorials from the 2009 DAS workshop can be found here[1]

A list of DAS servers that are set up as part of the Gencode project is available from the DAS Registry.

Format specifications

Server details

More detailed documentation for some of the DAS servers for the GENCODE project:

Release schedule

  • ENCODE analysis release: version 7
  • Previous ENCODE analysis release: version 3c
  • Current release (May 2011 Freeze): version 9
  • Upcoming release: version 13

GENCODE Gene Annotation at Sanger

Gene / Transcript Types & Status Definitions

Information on the nomenclature used in the GENCODE gene set can be found here and here with stats.


Otterlace is the Sanger's interactive annotation viewing and curation interface. Documentation explaining how to run Otterlace and Zmap (Otterlace's genomic sequence browser) as well as two other annotation tools, Blixem and Dotter, can be downloaded as a PDF from here.


If you are interested in using "otterlace" (the Sanger's interactive annotation interface) to view our live annotation data, you can download our latest client:

* otterlace_49-05.dmg Macintosh Universal

They all have a README file which gives some basic instructions on installation.

The Mac version depends on a copy of It works on both Mac OS X 10.4 (Tiger) and Mac OS X 10.5 (Leopard). For Leopard, we recommend installing the latest XQuartz release.

Email if you have difficulty getting it to work, or if you would like a distribution for Linux. Linux installation will usually involve installing some CPAN modules.

Users are authenticated using our SingleSignOn system, which you'll need to sign up to. Send an email with the email address you sign up with to mentioning ENCODE and we'll add you to our list of authorized users.

otterlace runs in read-only mode by default. It takes a long time to open clones containing all of the evidence used to build our transcript models, so if you only want to see our annotation you can choose not to load this "pipeline" data with a checkbox in the interface.

GENCODE Meetings

GENCODE Winter Meeting 2009

Hinxton Hall, 20th and 21st January 2009

GENCODE Spring Meeting 2010

Washington, 9th March 2010

GENCODE Summer Meeting 2010

Hinxton Hall, 30th June and 1st July 2010

GENCODE Autumn Meeting 2011

Clare College, 15th and 16th September 2011