NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.

Das format specifications

From Encode2 Wiki
Jump to: navigation, search

In order to facilitate easy data exchange and integration, the DAS sources should return data in a format, that is non-ambiguous and agreed upon. Below is a list of the tags and how we could use them for the purpose within GENCODE. These are only suggestions at this point, any comments are welcome, either here or better on the mailing list. Once agreed upon, please update your own DAS server accordingly.

We are trying to use definitions from the GFF3 format specifications and the sequence ontology (SO) project. Please note the following differences:

  • Orientation (strand) has values "+", "-" or "0"
  • The main ID to use must be in the Feature Id tag

Please include timestamps in the correct format for every feature, indicating when this feature was created or changed.

Information on how to encode alignments can be found here

A program to verify your DAS source and to identify problems can be found here.

List of tags, their meaning within GENCODE and an example tag:

1. Feature (id):

Required: Unique id of features. ID attribute of column 9 of GFF3.

  <FEATURE id="OTTHUMT00000157419.1">

2. Feature (label):

Optional: label to show for this feature

  <FEATURE id="OTTHUMT00000157419.1" label="OTTHUMT00000157419.1">

3. Type (id):

Required: Type of this feature, as found in the SO. Must be one of the following (as defined by SO), please check if you need additional terms.

exon: SO:0000147
 "A region of the transcript sequence within a gene which is not removed from the primary RNA transcript by RNA splicing."
intron: SO:0000188
 "A segment of DNA that is transcribed, but removed from within the transcript by splicing together the sequences (exons) on either side of it."
UTR: SO:0000203
 "Messenger RNA sequences that are untranslated and lie five prime and three prime to sequences which are translated."
CDS: SO:0000316
 "A contiguous (coding) sequence which begins with, and includes, a start codon and ends with, and includes, a stop codon."
CDS_fragment: SO:0001384
 A contiguous (coding) sequence which misses start or stop codon.
coding_start: SO:0000323
 "The first base to be translated into protein."
coding_end: SO:0000327
 "The last base to be translated into protein. It does not include the stop codon."
sequence_variant: SO:0001060
 "A sequence_variant is a non exact copy of a sequence_feature or genome exhibiting one or more sequence_alteration."
transcript SO:0000673
 "An RNA synthesized on a DNA or RNA template by an RNA polymerase."

4. Type (category):

Required: Origin of the annotation: ECO code describing the type of method

  <TYPE id="exon" category="inferred from RT-PCR experiment (ECO:0000109)">exon</TYPE>

These code can be queried with the OboEdit tool and this data file (loaded with "File -> Load Terms"). Some examples that might be useful:

  • id: ECO:0000109; name: inferred from RT-PCR experiment
  • id: ECO:00000067; name: inferred from electronic annotation
  • id: ECO:0000053; name: inferred from reviewed computational analysis
  • id: ECO:0000028; name: inferred from motif similarity
  • id: ECO:0000044; name: inferred from sequence similarity

5. Start:

Required: Start location of this feature in absolute NCBI36 genomic coodinates


6. End:

Required: End location of this feature in absolute NCBI36 genomic coodinates


7. Orientation:

Required: "+", "-" for stranded features, "0" for strand-independent features


8. Phase:

Required: "0","1","2" for coding features, "-" for others


9. Score:

Required: analysis-specific score (floating point number only) if applicable (multiple scores in Note), "-" otherwise


10. Note:

Required: List of "Type=Value" pairs for information for this feature as found in column 9 of the GFF specs, plus additions.

     important to easily identify an updated data set

  • Created (optional): datestamp as before indicating when this feature first appeared
  • Name (optional): The official name of the feature, should also go into Feature label/id
  • Alias (optional): A secondary name for the feature (can be split up into Genealias and Transcriptalias if necessary)
  • Parent (optional): Indicates the parent of the feature; GROUP-IDs still need to be given to comply with DAS specs.
  • Optional: types and status as applicable: Genetype, Genestatus, Transcripttype, Transcriptstatus
  • Optional: Description: official gene description or similar
  • Optional: Note: free text remarks
 <NOTE>NOTE=contains no frame-shifts or stop codons</NOTE>

11. Method:

Required: Name of the method generating this data

  <METHOD id="havana_manual_annotation">havana_manual_annotation</METHOD>

12. Target:

Optional: for additional positional features (zero or more).

  <TARGET id="AF254982.1-001" start="9884493" stop="9884938">GENE|9884493|9884938</TARGET>

13. Link (zero or more):

Optional: http link for this feature

    <LINK href="">
      show in vega transcript view</LINK>

14. Group:

Optional: Joins features together, ie. exons and introns to transcripts. id (required), label (optional), type (optional)

  <GROUP id="OTTHUMT00000157419" type="OTTHUMG00000074130">

14.a. Note:

Optional: list of "Type=Value" pairs for information for this GROUP as in 10.

   <NOTE>DESCR=novel protein containing an Immunoglobulin V-set domain</NOTE>

14.b. Link (zero or more):

Optional: http link for this GROUP.

    <LINK href="">
      show in vega transcript view</LINK>