NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.

SAM

From Encode2 Wiki
Jump to: navigation, search

Background

This page describes use of the Sequence Alignment MAP (SAM) format for reporting short read alignments to the ENCODE DCC. This format was developed by the 1000 Genomes project for capturing short sequence read alignments in a platform-independent, complete, and efficient manner. The modENCODE DCC has been accepting SAM format during 2009, and the following page is largely based on their work. The ENCODE DCC will begin accepting SAM (or the binary encoding of SAM files (BAM) in January 2010, and will replace the UCSC tagAlign and pairedTagAlign formats.

The "samtools" toolkit (http://samtools.sourceforge.net) provides tools for conversion from standard sequencing pipeline formats, as well as a viewer and other utilities.

File specification

The spec for SAM format can be found at: http://samtools.sourceforge.net/SAM1.pdf

his format supports a user-defined header for metadata. The ENCODE DCC will reuse the header convention defined by the modENCODE DCC.

SAM header spec from modENCODE

These headers will not be present in the files that are converted with the samtools converters, so you will need to add a header to the top of the file to pass our vetting process.

At the top of the file, you need to include a header so that we know:

  1. what chromosomes are referred to in the file, and their lengths
  2. the genome build from which your reads are mapped
  3. the organism that this is for

This header must be tab-delimited.

Additional header tags recommended for ENCODE (may later be required)

@HD     VN:1.0 SO:sorted                                      SAM/BAM format version and sort tag
@PG     ID=<program> VN=<version> CL="<command line>"        Alignment program, version, and command line with all flags

From the SAM spec:

Each header line begins with character ʻ@ʼ followed by a two-letter record type code. In the header, each line is TAB- delimited and each data field has an explicit field tag, which is represented using two ASCII characters.

For example:

 @HD      VN:1.0 SO:coordinate
 @PG      ID=Bowtie VN=0.11.3 CL="/amber1/archive/sgseq/workspace/software/bowtie/bowtie-0.11.3/bowtie /amber1/archive/sgseq/workspace/SZ/aligned/bowtie_indexes/h_sapiens_36.3 --solexa1.3-quals -m 1 -n 2 -e 150 -X 50 -S -1 /amber1/archive/sgseq/workspace/SZ/aligned/fastq/dv-index1/s_4_1_fastq.txt -2 /amber1/archive/sgseq/workspace/mSZ/aligned/fastq/dv-index1/s_4_2_fastq.txt"

Validation

The ENCODE DCC validates that sequences match the reference genome at the location specified with a maximum of X mismatches within the first Y bases from the beginning of the sequence, discounting any bases marked 'N' as negotiated with each lab. We also validate that there are no chrY mapping on cell lines known to be female.

Converters

There are several converters that come with the samtools toolkit. These are under active development, but many work. The converters that come with samtools include:

  • maq2sam
  • blast2sam
  • bowtie2sam
  • soap2sam
  • psl2sam

Note that the bowtie to sam converter that comes standard in samtools does not properly pair paired-end reads. The DCC has a conversion script to accommodate these kind of reads, available from the svn repository below.

There are some additional converters that the DCC has written to deal with some other file formats. You can download these converters as (and modify as needed) from:

 svn co svn://public-svn.modencode.org/modencode/private/tools/converters

Please contact your wrangler or encode@soe.ucsc.edu if you are experiencing trouble converting your sequencing reads to SAM format.