NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.
This page describes use of the Sequence Alignment MAP (SAM) format for reporting short read alignments to the ENCODE DCC. This format was developed by the 1000 Genomes project for capturing short sequence read alignments in a platform-independent, complete, and efficient manner. The modENCODE DCC has been accepting SAM format during 2009, and the following page is largely based on their work. The ENCODE DCC will begin accepting SAM (or the binary encoding of SAM files (BAM) in January 2010, and will replace the UCSC tagAlign and pairedTagAlign formats.
The "samtools" toolkit (http://samtools.sourceforge.net) provides tools for conversion from standard sequencing pipeline formats, as well as a viewer and other utilities.
The spec for SAM format can be found at: http://samtools.sourceforge.net/SAM1.pdf
his format supports a user-defined header for metadata. The ENCODE DCC will reuse the header convention defined by the modENCODE DCC.
SAM header spec from modENCODE
These headers will not be present in the files that are converted with the samtools converters, so you will need to add a header to the top of the file to pass our vetting process.
At the top of the file, you need to include a header so that we know:
- what chromosomes are referred to in the file, and their lengths
- the genome build from which your reads are mapped
- the organism that this is for
This header must be tab-delimited.
@HD VN:1.0 SO:sorted SAM/BAM format version and sort tag @PG ID=<program> VN=<version> CL="<command line>" Alignment program, version, and command line with all flags
From the SAM spec:
Each header line begins with character ʻ@ʼ followed by a two-letter record type code. In the header, each line is TAB- delimited and each data ﬁeld has an explicit ﬁeld tag, which is represented using two ASCII characters.
@HD VN:1.0 SO:coordinate @PG ID=Bowtie VN=0.11.3 CL="/amber1/archive/sgseq/workspace/software/bowtie/bowtie-0.11.3/bowtie /amber1/archive/sgseq/workspace/SZ/aligned/bowtie_indexes/h_sapiens_36.3 --solexa1.3-quals -m 1 -n 2 -e 150 -X 50 -S -1 /amber1/archive/sgseq/workspace/SZ/aligned/fastq/dv-index1/s_4_1_fastq.txt -2 /amber1/archive/sgseq/workspace/mSZ/aligned/fastq/dv-index1/s_4_2_fastq.txt"
The ENCODE DCC validates that sequences match the reference genome at the location specified with a maximum of X mismatches within the first Y bases from the beginning of the sequence, discounting any bases marked 'N' as negotiated with each lab. We also validate that there are no chrY mapping on cell lines known to be female.
There are several converters that come with the samtools toolkit. These are under active development, but many work. The converters that come with samtools include:
Note that the bowtie to sam converter that comes standard in samtools does not properly pair paired-end reads. The DCC has a conversion script to accommodate these kind of reads, available from the svn repository below.
There are some additional converters that the DCC has written to deal with some other file formats. You can download these converters as (and modify as needed) from:
Please contact your wrangler or email@example.com if you are experiencing trouble converting your sequencing reads to SAM format.