NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.
Data submission packaging
- Before packaging, test your data files as custom tracks
- The UCSC data wrangler will provide you with tab-separated text files meta-data files, with the correct extensions (.DAF, .DDF). You will need to modify the DDF to provide file names etc. from your experimental data. The DDF file is a tab-delimited file. When saving Excel spreadsheets as tab-delimited text files, it may be necessary to remove the .txt extensions.
DAF: The Data Agreement File
The Data Agreement File is a metadata file that specifies track types and submission file formats for a data type. It is created by the DCC as part of the data agreement with the data providing lab, and should not be changed unilaterally. All fields not marked optional are required. NOTE: This file was originally called the Project Information File (PIF), but the name was changed to avoid conflict with Windows .pif executable format.
- DAF header field definitions:
|compositeSuffix||Basis for table and filenames. Usually this will be <lab><dataType>, e.g. UwDnaseSeq|
|dataAgreementSuffix||Additional basis for table and filenames. Optional -- use when there are multiple data types or labs within a composite.|
|group||Track group in the browser (e.g. map, expression, regulation)|
|variables||Comma delimited list of variable columns used in the DDF. This should match the expVars found in the mdb for that composite.|
|assembly||Genome assembly used to map data (hg18, hg19, mm9)|
|medianFragmentLength||Median length for fragments size selected for sequencing. OBSOLETE|
|fragmentLengthRange||Range of fragment length sizes selected for sequencing. OBSOLETE|
|dafVersion||This is used to make sure older DAFs aren't used if the DAF format changes|
|dataVersion||If data already released to the public needs to be resubmitted, this term should be included and should contain a simple integer: 2,3,4... (1 is implied).|
|validationSettings||default settings for the validator, link to the settings|
- DAF view list field definitions:
|view||Each file type has its own view (validated).|
|shortLabelPrefix||Optional; used to construct short labels of subtracks OBSOLETE|
|longLabelPrefix||Used to construct long labels of subtracks OBSOLETE|
|type||Data format; specifies processing, display and storage (e.g. File Formats, bed, bigWig, bam, fastq)|
|hasReplicates||value: "yes" or "no"; Specifies whether this track/view includes replicate number|
|required||value: "yes" or "no"; Specifies whether this track/view is required|
# Data Agreement File for Stanford ChIP-Seq project # This file specifies the data agreement between your lab and # the DCC. It should not be changed unilaterally. # Lab and general info grant Myers lab HudsonAlpha dataType ChipSeq compositeSuffix HudsonAlphaChipSeq variables cell, antibody assembly hg18 medianFragmentLength 225 fragmentLengthRange 150-300 dafVersion 1.0 # Track/view definition view Alignments longLabelPrefix HudsonAlpha ChIP-Seq Alignments type tagAlign hasReplicates yes required yes view RawSignal longLabelPrefix HudsonAlpha ChIP-Seq Raw Signal type wig hasReplicates yes required no view Signal longLabelPrefix HudsonAlpha ChIP-Seq Signal type wig hasReplicates no required yes view Peaks longLabelPrefix HudsonAlpha ChIP-Seq Peaks type narrowPeak hasReplicates no required yes view RawData type fastq hasReplicates yes required yes
DDF: The Data Definition File
The DDF is a tab-delimited columnar file that contains entries for the files which you are providing, along with required metadata so that the files can be interpreted by the DCC. The columns are defined by the DCC as part of the data agreement, and should not be changed unilaterally. Some of the fields (e.g. cell, antibody) require controlled terms which must be registered to the EncodeWiki before data submission. The view column contains one of the data representations listed in the DAF file for the project. This is a tab-delimited file, so blank fields should be entered with zero characters (i.e. not "NA").
The File Formats page contains details about the required and optional files for each experiment type.
- Field definitions:
|files||Comma separated list of filesnames of files with track data; should match whatever is in the tar file (e.g. may include leading subdirectory names). Filenames may include the '?' and '*' wildcard characters.|
|view||Matches entry in DAF file. Defines how data is loaded and displayed.|
|cell||Required if specified in the variables list in the DAF. The value is validated against the controlled vocabulary.|
|antibody||Required if specified in the variables list in the DAF. The value is validated against the controlled vocabulary, or it may be "input" or "control" for input data.|
|replicate||Required if any view has "hasReplicates yes"; integer replicate version; only used for views that have "hasReplicates yes" in DAF.|
|accession||Optional list of comma separated accession numbers.|
|labVersion||Optional; free text; useful for lab specific characterization of datasets and inclusion of metadata not formalized already.|
|softwareVersion||Free text; required for Peaks view to identify peak callers; can also be used to specify software used for other views.|
- NOTE: Terms entered into the DDF will be saved as metadata and will be viewable in the public browser as well as on the downloads page.
- An Example:
files view cell antibody accession softwareVersion replicate data/SK_N_MC_1_chr?.bed Alignments SK-N-MC FOXP2 1 data/Raw_SK_N_MC_1_chr*.wig RawSignal SK-N-MC FOXP2 1 data/SK_N_MC_2_chr?.bed Alignments SK-N-MC FOXP2 2 data/Raw_SK_N_MC_2_chr*.wig RawSignal SK-N-MC FOXP2 2 data/SK_N_MC_chr*.wig Signal SK-N-MC FOXP2 data/SK_N_MC_chr*.pk Peaks SK-N-MC FOXP2 1.297 data/SK_N_MC_1_chr1.fastq RawData SK-N-MC FOXP2 GEO666 1 data/SK_N_MC_2_chr1.fastq RawData SK-N-MC FOXP2 GEO666 2 data/K562.bed Alignments K562 GABP 1 data/K562.wig Signal K562 GABP data/K562.pk Peaks K562 GABP 1.297 data/K562.fastq RawData K562 GABP GEO999 1
Create an Upload Archive
Create a gzipped tar archive with the metadata files at the top level of the archive.
- An Example:
tar cvzf stanfordChip.Jun08.tar.gz StanfordChip.DAF StanfordChip.June08.DDF data/*