NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.

Data Submission Instructions

From Encode2 Wiki
Revision as of 08:50, 4 May 2012 by Vsmalladi (talk | contribs) (Additional validation experiments)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Step 1: Establish a data agreement for each data type produced by your project.

  • Data agreements are created with the DCC engineer (data wrangler) for your project, as listed at Data Plans.
  • Each data type (e.g. 'Chip-Seq', 'RNA-chip') will usually correspond to a composite track (containing many subtracks) in the browser.
  • The data agreement negotiated with your data wrangler will include:
  1. Documentation of experimental methods
  2. Experimental variables which will become metadata terms (e.g. cell types (human, mouse), antibodies, etc.)
  3. Format for data files
  • Upon completion of the data agreement, your data wrangler will create two metadata files:
  1. The Data Agreement File or DAF describes constants of the data agreement (data types, file formats, and experimental variables), and is used as a reference for each data submission. This file should not be changed unilaterally.
  2. A sample Data Definition File or DDF is filled in for each data submission by the submitter, using the DAF file as reference.

Step 2: Ensure any needed metadata terms are registered with the DCC.

  • Check the appropriate wiki pages for terms needed and add any that are not found. These terms must be registered with the DCC before submissions can be successfully validated.
    • To register new tier 3 cell lines add them to the wiki and submit the cell growth protocol to the Resources Working Group for approval. Upon approval the cell line will be registered with the DCC.
    • Antibodies can be added to the wiki. Contact your wrangler if these antibodies do not appear in the Registered list by the time you are ready to submit data.
    • Any additional terms that may be needed should also be added to the wiki. If this requirement applies to your data, it should have become clear during step 1 above.

Step 3: Submit processed data to the DCC using the Data Submission Pipeline website.

  • Create a Data Submission Package:
    • Fill out the sample DDF file including references to all files to be submitted.
    • Create an archive file (compressed file: tar.gz, tgz, tar.bz2, zip) containing the DAF and DDF in the top level directory and all data files. The DDF should provide paths to data files that are relative to the root of the tar file.
  • Create a user account and log in to the Data Submission Pipeline website.
  • Create a new submission directory on the website. A submission directory will hold a single submission for a unique set of tracks/tables displayed in the browser. (help on website)
  • Submit the archive file to that submission directory. Submissions can be made via URL, local file upload or through ftp.
  • The data wrangler will be automatically notified once your data has been successfully submitted.

Step 4: Review and approve the results of the data submission.

  • Once data is submitted, validated, and loaded into the database by the automated process, the data wrangler for your project will configure the resulting tracks on the test browser.
  • You will be contacted by your data wrangler to review the tracks when they are available.
  • Once you have approved them, they will undergo our standard QA process and then be released on the public browser. You will be notified by email when public release occurs.

Submitting corrected sets of data.

  • If your original submission failed, a list of errors can be found on the Data Submission Pipeline website by clicking on the submission status (e.g. "validate failed"). Your data wrangler can help you understand and resolve any problems listed, such as "invalid file format".
  • When errors have been corrected, create a new submission directory and submit your corrected archive containing the full set of files to this new directory. By submitting to a new directory each time, you are assured that none of your data is overwritten. Submission directories containing failed submissions will be cleaned up by your data wrangler, once the full submission is successful.

Submitting new sets of data.

  • When you are ready with a new set of data, you will either start again at step 1 or step 2 above.
    • If your new set of data is for a type you have previously submitted, you will reuse the DAF file for that type and start at step 2 above.
    • However, if the new set of data is for new data type (e.g. ChIP-seq, DNase-seq) then start at step 1 above. Contact your data wrangler to work out a new DAF and example DDF appropriate for the new data type.

Submitting additional data.

Additional non-displayed files per experiment

Files such as individual protocol documents or additional analysis files that either can't (fastq files) or won't (peak files generated by alternative analysis) be displayed in the browser track, can be submitted as downloads only. These files will get metaData and will be searchable and filterable. If they are a displayable file type, they can be downloaded and displayed as a custom track in the Browser. These downloads only files are considered to be in the "attic." This requires a DAF that specifies downloads only for that view.

Additional experiments

The ENCODE policy is to display two replicates per experiment and to not display non-production data that just varies on experimental or technical details (such as donor, read length, small protocol differences). The ENCODE display is to publish production data and not for working out experimental variation. However, if there are additional experiments of interest. These can be submitted as downloads only or "attic." They will not be displayed in the track or show up in the table browser, but they will have metaData, including a UCSC Accession, and be available by file search and on the downloads pages. These experiments can be downloaded and used with our custom track feature. If there is a significant amount of these experiments that the lab considers worth displaying, a track hub should be considered.

Additional non-displayed files per composite

If there is data at a composite level, not an experiment level that should be submitted and available with the main data, it can be placed in the supplemental directory. These files can be images or word docs or spike-in sequences. They will not have metaData. They will be linked to from the track description page and will be available on the downloads pages. The lab is responsible for clear naming and providing any info files that would be required to interpret the data. track description.

Additional validation experiments

Additional validation data or experiments can be linked to the browser track in many ways. It can be submitted to GEO directly and the link to the GEO submission can be placed in the track description and the composite metaData. It can be placed in the supplemental directory of the composite. If the data would benefit greatly from display, a track hub could be used.

If there are any questions, contact your wrangler for assistance.