NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.

Data submission packaging

From Encode2 Wiki
Jump to: navigation, search

Getting Started

  • Before packaging, test your data files as custom tracks
  • The UCSC data wrangler will provide you with tab-separated text files meta-data files, with the correct extensions (.DAF, .DDF). You will need to modify the DDF to provide file names etc. from your experimental data. The DDF file is a tab-delimited file. When saving Excel spreadsheets as tab-delimited text files, it may be necessary to remove the .txt extensions.

DAF: The Data Agreement File

The Data Agreement File is a metadata file that specifies track types and submission file formats for a data type. It is created by the DCC as part of the data agreement with the data providing lab, and should not be changed unilaterally. All fields not marked optional are required. NOTE: This file was originally called the Project Information File (PIF), but the name was changed to avoid conflict with Windows .pif executable format.

  • DAF header field definitions:
grant validated
lab validated
dataType validated
compositeSuffix Basis for table and filenames. Usually this will be <lab><dataType>, e.g. UwDnaseSeq
dataAgreementSuffix Additional basis for table and filenames. Optional -- use when there are multiple data types or labs within a composite.
group Track group in the browser (e.g. map, expression, regulation)
variables Comma delimited list of variable columns used in the DDF. This should match the expVars found in the mdb for that composite.
assembly Genome assembly used to map data (hg18, hg19, mm9)
medianFragmentLength Median length for fragments size selected for sequencing. OBSOLETE
fragmentLengthRange Range of fragment length sizes selected for sequencing. OBSOLETE
dafVersion This is used to make sure older DAFs aren't used if the DAF format changes
dataVersion If data already released to the public needs to be resubmitted, this term should be included and should contain a simple integer: 2,3,4... (1 is implied).
validationSettings default settings for the validator, link to the settings
  • DAF view list field definitions:
view Each file type has its own view (validated).
shortLabelPrefix Optional; used to construct short labels of subtracks OBSOLETE
longLabelPrefix Used to construct long labels of subtracks OBSOLETE
type Data format; specifies processing, display and storage (e.g. File Formats, bed, bigWig, bam, fastq)
hasReplicates value: "yes" or "no"; Specifies whether this track/view includes replicate number
required value: "yes" or "no"; Specifies whether this track/view is required
  • Example
# Data Agreement File for Stanford ChIP-Seq project

# This file specifies the data agreement between your lab and 
# the DCC.  It should not be changed unilaterally.

 # Lab and general info
 grant             Myers
 lab               HudsonAlpha
 dataType          ChipSeq
 compositeSuffix   HudsonAlphaChipSeq
 variables         cell, antibody
 assembly          hg18
 medianFragmentLength 225
 fragmentLengthRange  150-300
 dafVersion        1.0

 # Track/view definition
 view             Alignments
 longLabelPrefix  HudsonAlpha ChIP-Seq Alignments
 type             tagAlign
 hasReplicates    yes
 required         yes

 view             RawSignal
 longLabelPrefix  HudsonAlpha ChIP-Seq Raw Signal
 type             wig
 hasReplicates    yes
 required         no

 view             Signal
 longLabelPrefix  HudsonAlpha ChIP-Seq Signal
 type             wig
 hasReplicates    no
 required         yes

 view             Peaks
 longLabelPrefix  HudsonAlpha ChIP-Seq Peaks
 type             narrowPeak
 hasReplicates    no
 required         yes

 view             RawData
 type             fastq
 hasReplicates    yes 
 required         yes

DDF: The Data Definition File

The DDF is a tab-delimited columnar file that contains entries for the files which you are providing, along with required metadata so that the files can be interpreted by the DCC. The columns are defined by the DCC as part of the data agreement, and should not be changed unilaterally. Some of the fields (e.g. cell, antibody) require controlled terms which must be registered to the EncodeWiki before data submission. The view column contains one of the data representations listed in the DAF file for the project. This is a tab-delimited file, so blank fields should be entered with zero characters (i.e. not "NA").

The File Formats page contains details about the required and optional files for each experiment type.

  • Field definitions:
files Comma separated list of filesnames of files with track data; should match whatever is in the tar file (e.g. may include leading subdirectory names). Filenames may include the '?' and '*' wildcard characters.
view Matches entry in DAF file. Defines how data is loaded and displayed.
cell Required if specified in the variables list in the DAF. The value is validated against the controlled vocabulary.
antibody Required if specified in the variables list in the DAF. The value is validated against the controlled vocabulary, or it may be "input" or "control" for input data.
replicate Required if any view has "hasReplicates yes"; integer replicate version; only used for views that have "hasReplicates yes" in DAF.
accession Optional list of comma separated accession numbers.
labVersion Optional; free text; useful for lab specific characterization of datasets and inclusion of metadata not formalized already.
softwareVersion Free text; required for Peaks view to identify peak callers; can also be used to specify software used for other views.
  • NOTE: Terms entered into the DDF will be saved as metadata and will be viewable in the public browser as well as on the downloads page.
  • An Example:
files                       view            cell     antibody   accession  softwareVersion replicate
data/SK_N_MC_1_chr?.bed     Alignments      SK-N-MC  FOXP2                                 1
data/Raw_SK_N_MC_1_chr*.wig RawSignal       SK-N-MC  FOXP2                                 1
data/SK_N_MC_2_chr?.bed     Alignments      SK-N-MC  FOXP2                                 2
data/Raw_SK_N_MC_2_chr*.wig RawSignal       SK-N-MC  FOXP2                                 2
data/SK_N_MC_chr*.wig       Signal          SK-N-MC  FOXP2
data/SK_N_MC_chr*.pk        Peaks           SK-N-MC  FOXP2                 1.297
data/SK_N_MC_1_chr1.fastq   RawData         SK-N-MC  FOXP2      GEO666                     1
data/SK_N_MC_2_chr1.fastq   RawData         SK-N-MC  FOXP2      GEO666                     2
data/K562.bed               Alignments      K562     GABP                                  1
data/K562.wig               Signal          K562     GABP                                    
data/                Peaks           K562     GABP                  1.297           
data/K562.fastq             RawData         K562     GABP       GEO999                     1

Create an Upload Archive

Create a gzipped tar archive with the metadata files at the top level of the archive.

  • An Example:
     tar cvzf stanfordChip.Jun08.tar.gz StanfordChip.DAF StanfordChip.June08.DDF data/*