NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.

File Formats

From Encode2 Wiki
Jump to: navigation, search

This page describes the file formats for ENCODE data. Many of these are variations of the UCSC BED format. When BED formats are extended, they are referred to as BEDn+, where n is the number of fields that are standard BED. In the descriptions below, the notation BEDn+m is used to indicate the format has n standard BED fields, followed by m non-standard fields. For example, a BED file with 9 fields where the first 6 are standard, is referred to as a BED6+3.

Note that BED chromosomal positions are defined on a zero-based half-open interval, so that the first position is 0, and chromEnd - chromStart is the length of the aligning region of the genome. (E.g. a 30nt sequence aligning to the first 30 bases on a chromosome would have chromStart=0 and chromEnd=30).


BigWig: Format

See http://genome.ucsc.edu/FAQ/FAQformat#format3


Bam: Format

See http://genome.ucsc.edu/FAQ/FAQformat#format3


ENCODE RNA Elements: BED6 + 3 Scores Format

field type description
chrom string Chromosome (or contig, scaffold, etc.)
chromStart int The starting position of the feature in the chromosome. The first base in a chromosome is numbered 0.
chromEnd int The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
name string Name given to a region (preferably unique). Use '.' if no name is assigned.
score int Indicates how dark the peak will be displayed in the browser (1-1000). Ideally average signalValue per base spread between 100-1000.
strand char +/- to denote strand or orientation (whenever applicable). Use '.' if no orientation is assigned.
level float Expression level such as RPKM or FPKM.
signif float Statistical significance such as IDR. Use a period (".") when a significance score is not applicable.
score2 int Additional measurement/count e.g. number of reads.

SAM: Sequence Alignment Map Format

This format is used to provide genomic mapping of short sequence tags. This format supercedes the tagAlign formats used initially for ENCODE submissions (through the mid-course evaluation data freeze, January 15, 2010). See the SAM/BAM wiki page for more information


narrowPeak: Narrow (or Point-Source) Peaks Format

This format is used to provide called peaks of signal enrichment based on pooled, normalized (interpreted) data. It is a BED6+4 format.

field type description
chrom string Name of the chromosome
chromStart int The starting position of the feature in the chromosome. The first base in a chromosome is numbered 0.
chromEnd int The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
name string Name given to a region (preferably unique). Use '.' if no name is assigned.
score int Indicates how dark the peak will be displayed in the browser (1-1000). If '0', the DCC will assign this based on signal value. Ideally average signalValue per base spread between 100-1000.
strand char +/- to denote strand or orientation (whenever applicable). Use '.' if no orientation is assigned.
signalValue float Measurement of overall (usually, average) enrichment for the region.
pValue float Measurement of statistical signficance (-log10). Use -1 if no pValue is assigned.
qValue float Measurement of statistical significance using false discovery rate. Use -1 if no qValue is assigned.
peak int Point-source called for this peak; 0-based offset from chromStart. Use -1 if no point-source called.

e.g.

chrX    9091548 9091648 .       0       .       182     5.0945  -1  50
chrX    9358722 9358822 .       0       .       91      4.6052  -1  40
chrX    9391082 9391182 .       0       .       182     9.2103  -1  75


broadPeak: Broad Peaks (or Regions) Format

This format is used to provide called regions of signal enrichment based on pooled, normalized (interpreted) data. It is a BED 6+3 format.

field type description
chrom string Name of the chromosome.
chromStart int The starting position of the feature in the chromosome. The first base in a chromosome is numbered 0.
chromEnd int The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
name string Name given to a region (preferably unique). Use '.' if no name is assigned.
score int Indicates how dark the peak will be displayed in the browser (1-1000). If '0', the DCC will assign this based on signal value. Ideally average signalValue per base spread between 100-1000.
strand char +/- to denote strand or orientation (whenever applicable). Use '.' if no orientation is assigned.
signalValue float Measurement of overall (usually, average) enrichment for the region.
pValue float Measurement of statistical signficance (-log10). Use -1 if no pValue is assigned.
qValue float Measurement of statistical significance using false discovery rate. Use -1 if no qValue is assigned.


gappedPeak: Gapped Peaks (or Regions) Format

This format is used to provide called regions of signal enrichment based on pooled, normalized (interpreted) data where the regions may be spliced or incorporate gaps in the genomic sequence. It is a BED12+3 format.

field type description
chrom string Name of the chromosome.
chromStart int The starting position of the feature in the chromosome. The first base in a chromosome is numbered 0.
chromEnd int The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
name string Name given to a region (preferably unique). Use '.' if no name is assigned.
score int Indicates how dark the peak will be displayed in the browser (1-1000). If '0', the DCC will assign this based on signal value. Ideally average signalValue per base spread between 100-1000.
strand char +/- to denote strand or orientation (whenever applicable). Use '.' if no orientation is assigned.
thickStart int The starting position at which the feature is drawn thickly.
thickEnd int The ending position at which the feature is drawn thickly.
itemRgb string An RGB value of the form R,G,B (e.g. 255,0,0).
blockCount int The number of blocks (exons) in the BED line.
blockSizes string A comma-separated list of the block sizes.
blockStarts string A comma-separated list of block starts.
signalValue float Measurement of overall (usually, average) enrichment for the region
pValue float Measurement of statistical signficance (-log10). Use -1 if no pValue is assigned.
qValue float Measurement of statistical significance using false discovery rate. Use -1 if no qValue is assigned.


encodePeak Tracks

The narrowPeak, broadPeak, and gappedPeak track types are all part of the encodePeak family of track types. See that page for details on how to create both normal and custom encodePeak tracks.


FastQ Format

This format is used to provide short sequence reads together with quality scores. FastQ is a standard format, and is described in the SourceForge FASTQ Format Specification. Note that there are two methods of specifying the quality scores (Sanger and Solexa). We prefer that you deliver files with the Sanger style quality scores, but we will accept files with Solexa quality scores. Each base and each quality score is represented by a single character, with the punctuation characters '@' and '+' used to separate them; briefly:

@sequence_name
sequence_bases
+
quality_scores

If you are using maq, you can use the following command to convert solexa to sanger quality scores:

maq sol2sanger solexa.fastq sanger.fastq


nameValue Format

(Is this still accepted?)

This format is used to provide (name,value) pairs. Examples of this are Gene Chip and Exon Chip microarray experiments where results comprise (probe,expression) pair data. The format is simply a tab-separated file of (name,value) pairs one pair per line.

name    value


BedGraph Format

This format is an alternative to wiggle format, used to provide real-valued data in discrete genomic regions. Bedgraph provides higher-resolution output when data is extracted with the Table Browser, but lacks the display efficiency of wiggle format, so is restricted to smaller datasets.

This format is a UCSC standard described in the BedGraph Track Format specification. The track line is useful for previewing the data as custom tracks in the genome browser, but is not required for data submission.


NRE: Bed6 Format

This format is to be used by Elnitski's lab to submit NRE data (Negative Regulator Elements: Enhancer Blockers and Silencers). It is the typical UCSC bed format with 6 fields declared for NREs:

field type description
chrom string Name of the chromosome.
chromStart int The starting position of the feature in the chromosome. The first base in a chromosome is numbered 0.
chromEnd int The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
name string Name given to the NRE (preferably unique and expected to include a promoter name).
score int A score between 0 and 1000. For NRE data, these scores will be used to quantify strength.
strand char Defines the strand - either '+' or '-'. For NRE data this will be equivalent to the forward or backward declaration.


BiP: Bed8 Format

This format is to be used by Elnitski's lab to submit BiP data (Bi-directional Promoters). It is the typical UCSC bed format with 8 fields. By using the 7th and 8th fields to declare promoter position within a larger region, the promoter will be displayed as a thick region between two genes:

field type description
chrom string Name of the chromosome.
chromStart int The starting position of the upstream gene in the chromosome. The first base in a chromosome is numbered 0.
chromEnd int The ending position of the downstream gene in the chromosome. The chromEnd base is not included in the display.
name string Name given to the BiP (must be unique and could include gene names).
score int A score between 0 and 1000.
strand char Not applicable for bi-directional promoters. Therefore, always '.' (period).
promoterStart int The starting position of the bi-directional promoter.
promoterEnd int The ending position of the bi-directional promoter.


pairedInteraction: Paired Interaction Format

This format is used to link chromosomal regions. It is used in conjunction with a BED6 (chrom, start, end, name, score, strand) file containing the regions.

field type description
name1 string Unique name given to a region (from BED6 file)
name2 string Unique name given to a region (from BED6 file)
score int Indicates how dark the interaction will be displayed in the browser (1-1000). Items intended to be visible should be assigned a score >=300. If '0', the DCC will assign this based on signal value.
strand char +/- to denote strand or orientation (whenever applicable). Use '.' if no orientation is assigned.
signalValue float Measurement of overall (usually, average) enrichment for the region.
pValue float Measurement of statistical signficance (-log10). Use -1 if no pValue is assigned.
qValue float Measurement of statistical significance using false discovery rate. Use -1 if no qValue is assigned.

e.g.

chr7.1  chr3.2        300       .       182     5.0945  -1 
chr7.2  chr5.8        900       .       91      4.6052  -1


gcf: Genomic Coverage File

This file gives the portion of a genome covered by a given assay. For Chip-chIP or other array-based assays, this is the covered portion of the genome. For Chip-seq, this is the "mappable" portion of the genome. The format is (chromosome, start, stop).


field type description
chr string name of chromosome
start int start coordinates of contiguous covered region
stop int stop coordinates of contiguous covered region

e.g.

chr1   15000   5600000
chr1   5800000 5900000


snpcov: Snip Coverage File

chr string name of chromosome
pos int 1 based start coordinate
snp ib string
A cnts unsigned ints A counts by position
C cnts unsigned ints C counts by position
T cnts unsigned ints T counts by position
G cnts unsigned ints G counts by position
N cnts unsigned ints N counts by position
allele string Reference allele
A string Genotype of the "A" Allele
B string Genotype of the "B" Allele
State A P,M,A,H,L,R Paternal, Maternal, Ambiguous, Heterezygous, Left, Right
State B P,M,A,H,L,R Paternal, Maternal, Ambiguous, Heterezygous, Left, Right
Count A unsigned int Counts of Allele A
Count B unsigned int Counts of Allele B
Assay string Assay (eg, H3K4me3)
CellLine string Cell line/Sample
Processing string Processing pipeline used


Data Submission Formats Which Used to be Accepted

tagAlign: BED3+3 Format

Tag Alignment is used to provide genomic mapping of short sequence tags. It is a BED3+3 format. Not usable for submission.

field type description
chrom string Name of the chromosome
chromStart int The starting position of the feature in the chromosome. The first base in a chromosome is numbered 0.
chromEnd int The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
sequence string Sequence of this read
score int Indicates uniqueness or quality (preferably 1000/alignmentCount).
strand char Orientation of this read (+ or -)

e.g.

chrX 8823384 8823409 AGAAGGAAAATGATGTGAAGACATA 1000 +
chrX 8823387 8823412 TCTTATGTCTTCACATCATTTTCCT 500  -

pairedTagAlign: BED6+2 Format

Tag Alignment Format for Paired Reads is used to provide genomic mapping of paired-read short sequence tags. It is a BED6+2 format. Not usable for submission.

field type description
chrom string Name of the chromosome
chromStart int The starting position of the feature in the chromosome. The first base in a chromosome is numbered 0.
chromEnd int The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
name string Identifier of paired-read
score int Indicates uniqueness or quality (preferably 1000/alignment-count).
strand char Orientation of this read (+ or -)
seq1 string Sequence of first read
seq2 string Sequence of second read

Wiggle Format

This format is used to provide continuous real-valued data across genomic regions, for example signal or enrichment graphs . This format is a UCSC standard described in the Wiggle Track Format specification. The track line is useful for previewing the data as custom tracks in the genome browser, but is not required for data submission.

There are three variations of wiggle file format: variableStep, fixedStep and BED. Most ENCODE data will be best represented by variableStep format: two column data that begins with a declaration line, followed by chromosome positions and data values:

 variableStep  chrom=chrN  [span=windowSize]
 chromStartA  dataValueA
 chromStartB  dataValueB