NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.

Data submission status

From Encode2 Wiki
Revision as of 11:51, 3 October 2012 by Cricket (talk | contribs) (Pipeline)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

The data submission pipeline is open for submissions:


Automated reporting tools (beta):


The latest ENCODE file validator is available here:

More information on SAM/BAM format is here: SAM

The hg19 reference sequences are here: [1]

The mm9 reference sequences are here: [2]

As an example, the following commands sort, index, and validate a BAM file:

 % samtools sort unsorted_file.bam file.bam
 % samtools index file.bam
 % validateFiles -type=BAM -chromInfo=male.hg19.chrom.sizes -genome=hg19.2bit -mismatches=6 -nMatch -isSort file.bam

Freeze history

Dates of previous freezes are here:

Information about the mid-course freeze is:

Previous spreadsheets are here:


To submit data to GEO, follow the instructions here:

You will use the "GEOarchive spreadsheet format".

- Be sure to look at the example tabs in the spreadsheet - If you are adding data to an existing series on GEO, you may leave the "SERIES" section in the spreadsheet blank - You shouldn't send tarred data. Untar all data. Fastqs and other large files can and should be gzipped - If you are submitting fastq raw data, you do not need to submit alignment bams - For a new series, you'll need to include the bioproject ID. Put this below the "SRA_center_name_code" field in the SERIES header of the spreadsheet. The ID numbers are below:

human genomic: 63443 human transcriptomic: 30709 human protein: 63447

mouse genomic: 63471 mouse transcriptomic: 66167 mouse protein: 63475

Once you have prepared your metadata, processed & raw files, you will use the FTP upload instructions:

- GEO indicates that you can zip/tar your entire submission or send an untarred directory. We recommend just sending a directory, it is much easier (tips for using the FTP are included at the bottom)

After uploading files, email

- Identify your submission as ENCODE project data - Indicate that you'd like to submit to the "ucsc_encode_dcc" account - Give the names (or location) of the uploaded files - If you are adding data to an existing series on GEO, include in your email that this data is to be added to GSE#####.

FTP Tips:

For Linux/Unix, we recommend that you try 'ncftp' with optimized settings as detailed in:

Here is a typical 'ncftp' session:

1. Connect to the server:


2. Set buffer size (optional):

set so-bufsize 33554432

3. Transfer your archive, or an entire directory plus content using:

put archive_name.tar.gz


put -R directory_name

In Windows and Mac OS X we recommend the free client software, FileZilla:

1. Once the FileZilla client is installed, connect using:

host username geo password D0gDAzr0va

2. Drag-n-drop your file(s) or directory into the /fasp directory on the FTP server. When transferring multiple files we prefer that the files be dropped into a directory (i.e. fasp/your_directory).