NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.

GENCODE genetypes

From Encode2 Wiki
Jump to: navigation, search

These are the descriptions and numbers for Gencode release 3c (used for integrative analysis)


Biotype Current
gene number
Current
transcript number
description
IG_C_gene 21 41 Immunoglobulin (Ig) variable chain and T-cell receptor (TcR) genes imported or annotated according to the [imgt.cines.fr IMGT] database.
IG_D_gene 30 30
IG_J_gene 93 93
IG_V_gene 226 228
IG_pseudogene 0 161 Inactivated immunoglobulin gene.
Mt_rRNA 2 2 Non-coding RNA predicted by the Ensembl pipeline using sequences from RFAM and miRBase
Mt_tRNA 22 22
miRNA 1660 1662
misc_RNA 1561 1561
rRNA 455 455
snRNA 1436 1436
snoRNA 1217 1217
Mt_tRNA_pseudogene 580 580 Non-coding RNAs predicted to be pseudogenes by the Ensembl pipeline
tRNA_pseudogene 128 128
snoRNA_pseudogene 484 484
snRNA_pseudogene 494 494
scRNA_pseudogene 839 839
rRNA_pseudogene 338 338
misc_RNA_pseudogene 7 7
miRNA_pseudogene 20 20
TR_pseudogene
TEC 0 44 "To be Experimentally Confirmed". This is used for non-spliced EST clusters that have polyA features. This category has been specifically created for the ENCODE project to highlight regions that could indicate the presence of protein coding genes that require experimental validation, either by 5' RACE or RT-PCR to extend the transcripts, or by confirming expression of the putatively-encoded peptide with specific antibodies. Should not be used in any analysis.
nonsense_mediated_decay 0 4703 If the coding sequence (following the appropriate reference) of a transcript finishes >50bp from a downstream splice site then it is tagged as NMD. If the variant does not cover the full reference coding sequence then it is annotated as NMD if NMD is unavoidable i.e. no matter what the exon structure of the missing portion is the transcript will be subject to NMD.
retained_intron 0 8984 Alternatively spliced transcript believed to contain intronic sequence relative to other, coding, variants.
protein_coding 22550 68880 Contains an open reading frame (ORF).
processed_transcript 6496 28921 Doesn't contain an ORF, similar to long non-coding RNAs (lincRNA).
non_coding 0 161 Transcript which is known from the literature to not be protein coding.
ambiguous_orf 0 67 Transcript believed to be protein coding, but with more than one possible open reading frame.
antisense 0 10 Transcript believed to be an antisense product used in the regulation of the gene to which it belongs.
pseudogene 8894 2179 Have homology to proteins but generally suffer from a disrupted coding sequence and an active homologous gene can be found at another locus. Sometimes these entries have an intact coding sequence or an open but truncated ORF, in which case there is other evidence used (for example genomic polyA stretches at the 3' end) to classify them as a pseudogene. Can be further classified as one of the following.
processed_pseudogene 0 6368 Pseudogene that lack introns and is thought to arise from reverse transcription of mRNA followed by reinsertion of DNA into the genome.
polymorphic_pseudogene 0 33 Pseudogene owing to a SNP/DIP but in other individuals/haplotypes/strains the gene is translated.
retrotransposed 0 290 Pseudogene owing to a reverse transcribed and re-inserted sequence.
transcribed_processed_pseudogene 0 62 Pseudogene where protein homology or genomic structure indicates a pseudogene, but the presence of locus-specific transcripts indicates expression.
transcribed_unprocessed_pseudogene 0 148
unitary_pseudogene 0 123 A species specific unprocessed pseudogene without a parent gene, as it has an active orthologue in another species.
unprocessed_pseudogene 0 1277 Pseudogene that can contain introns since produced by gene duplication.