NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.

Data submission status

From Encode2 Wiki
Jump to: navigation, search

The data submission pipeline is open for submissions: http://encodesubmit.ucsc.edu.

Submissions

Automated reporting tools (beta):

Tools

The latest ENCODE file validator is available here:

More information on SAM/BAM format is here: SAM

The hg19 reference sequences are here: [1]

The mm9 reference sequences are here: [2]

As an example, the following commands sort, index, and validate a BAM file:

 % samtools sort unsorted_file.bam file.bam
 % samtools index file.bam
 % validateFiles -type=BAM -chromInfo=male.hg19.chrom.sizes -genome=hg19.2bit -mismatches=6 -nMatch -isSort file.bam


Freeze history

Dates of previous freezes are here:

Information about the mid-course freeze is:

Previous spreadsheets are here:

GEO SUBMISSION INSTRUCTIONS

To submit data to GEO, follow the instructions here: http://www.ncbi.nlm.nih.gov/geo/info/seq.html#seqdeposit

You will use the "GEOarchive spreadsheet format".

- Be sure to look at the example tabs in the spreadsheet - If you are adding data to an existing series on GEO, you may leave the "SERIES" section in the spreadsheet blank - You shouldn't send tarred data. Untar all data. Fastqs and other large files can and should be gzipped - If you are submitting fastq raw data, you do not need to submit alignment bams - For a new series, you'll need to include the bioproject ID. Put this below the "SRA_center_name_code" field in the SERIES header of the spreadsheet. The ID numbers are below:

human genomic: 63443 human transcriptomic: 30709 human protein: 63447

mouse genomic: 63471 mouse transcriptomic: 66167 mouse protein: 63475

Once you have prepared your metadata, processed & raw files, you will use the FTP upload instructions: http://www.ncbi.nlm.nih.gov/geo/info/seq.html#FTP

- GEO indicates that you can zip/tar your entire submission or send an untarred directory. We recommend just sending a directory, it is much easier (tips for using the FTP are included at the bottom)

After uploading files, email geo@ncbi.nlm.nih.gov::

- Identify your submission as ENCODE project data - Indicate that you'd like to submit to the "ucsc_encode_dcc" account - Give the names (or location) of the uploaded files - If you are adding data to an existing series on GEO, include in your email that this data is to be added to GSE#####.

FTP Tips:

For Linux/Unix, we recommend that you try 'ncftp' with optimized settings as detailed in: ftp://ftp.ncbi.nih.gov/README.ftp

Here is a typical 'ncftp' session:

1. Connect to the server:

ncftp ftp://geo:D0gDAzr0va@ftp-private.ncbi.nih.gov/fasp

2. Set buffer size (optional):

set so-bufsize 33554432

3. Transfer your archive, or an entire directory plus content using:

put archive_name.tar.gz

or

put -R directory_name

In Windows and Mac OS X we recommend the free client software, FileZilla: http://filezilla-project.org/download.php?type=client

1. Once the FileZilla client is installed, connect using:

host ftp-private.ncbi.nih.gov username geo password D0gDAzr0va

2. Drag-n-drop your file(s) or directory into the /fasp directory on the FTP server. When transferring multiple files we prefer that the files be dropped into a directory (i.e. fasp/your_directory).