NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.
Data submission status
The data submission pipeline is open for submissions: http://encodesubmit.ucsc.edu.
Automated reporting tools (beta):
- Human ENCODE status summary by project (clickable bar chart)
- Mouse ENCODE status summary by project (clickable bar chart)
- ENCODE status detail by experiment (spreadsheet)
The latest ENCODE file validator is available here:
More information on SAM/BAM format is here: SAM
The hg19 reference sequences are here: 
The mm9 reference sequences are here: 
As an example, the following commands sort, index, and validate a BAM file:
% samtools sort unsorted_file.bam file.bam % samtools index file.bam % validateFiles -type=BAM -chromInfo=male.hg19.chrom.sizes -genome=hg19.2bit -mismatches=6 -nMatch -isSort file.bam
Dates of previous freezes are here:
Information about the mid-course freeze is:
Previous spreadsheets are here:
GEO SUBMISSION INSTRUCTIONS
To submit data to GEO, follow the instructions here: http://www.ncbi.nlm.nih.gov/geo/info/seq.html#seqdeposit
You will use the "GEOarchive spreadsheet format".
- Be sure to look at the example tabs in the spreadsheet - If you are adding data to an existing series on GEO, you may leave the "SERIES" section in the spreadsheet blank - You shouldn't send tarred data. Untar all data. Fastqs and other large files can and should be gzipped - If you are submitting fastq raw data, you do not need to submit alignment bams - For a new series, you'll need to include the bioproject ID. Put this below the "SRA_center_name_code" field in the SERIES header of the spreadsheet. The ID numbers are below:
human genomic: 63443 human transcriptomic: 30709 human protein: 63447
mouse genomic: 63471 mouse transcriptomic: 66167 mouse protein: 63475
Once you have prepared your metadata, processed & raw files, you will use the FTP upload instructions: http://www.ncbi.nlm.nih.gov/geo/info/seq.html#FTP
- GEO indicates that you can zip/tar your entire submission or send an untarred directory. We recommend just sending a directory, it is much easier (tips for using the FTP are included at the bottom)
After uploading files, email firstname.lastname@example.org::
- Identify your submission as ENCODE project data - Indicate that you'd like to submit to the "ucsc_encode_dcc" account - Give the names (or location) of the uploaded files - If you are adding data to an existing series on GEO, include in your email that this data is to be added to GSE#####.
For Linux/Unix, we recommend that you try 'ncftp' with optimized settings as detailed in: ftp://ftp.ncbi.nih.gov/README.ftp
Here is a typical 'ncftp' session:
1. Connect to the server:
2. Set buffer size (optional):
set so-bufsize 33554432
3. Transfer your archive, or an entire directory plus content using:
put -R directory_name
In Windows and Mac OS X we recommend the free client software, FileZilla: http://filezilla-project.org/download.php?type=client
1. Once the FileZilla client is installed, connect using:
host ftp-private.ncbi.nih.gov username geo password D0gDAzr0va
2. Drag-n-drop your file(s) or directory into the /fasp directory on the FTP server. When transferring multiple files we prefer that the files be dropped into a directory (i.e. fasp/your_directory).