NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.

Software resources for ENCODE data management and analysis

From Encode2 Wiki
Jump to: navigation, search

Software developed for ENCODE analysis

Program Description Authors Availability Contact
Signal generation
signalmap in uw_utils Remap signal using one of several functions (max, median, mean, min, sum, var)

Input: desired output regions (BED), signal data (BED). Example in example/example.sh shows how to start with tagAlign instead
Output: Signal (BED)
Richard Sandstrom, Scott Kuehn, Bill Noble, Shane Neph encodestatistics.org Richard Sandstrom, sull at u dot washington dot edu
tagalign2rawSignal Generate ERPB or ERPKM rawSignal from tagAlign files using one of several normalizations

Input: tag alignment files (tagAlign)
Output: rawSignal (bedGraph)
Anshul Kundaje encodeftp.cse.ucsc.edu (use global username and password for this wiki) Anshul Kundaje
Signal validation
ACTpy Computes a histogram (fixed bin size or count) of signal intensity around annotated features

Input: signal files, annotations (file types are custom, but described in the header of ACTnew.py)
Output: location and signal level of each bin
Robert Bjornson, Joel Rozowsky, Zhiping Weng, Yutao Fu act.gersteinlab.org Robert Bjornson, bjornson at yale dot edu
ACT correlation Analyzes the correlation between (two signals?)

Input: ???
Output: ????
Robert Bjornson, Joel Rozowsky, Zhiping Weng, Yutao Fu act.gersteinlab.org Joel Rozowsky, joel.rozowsky at yale dot edu
GSA Aggregates a point or segment signal around a set of anchors within a given range

Input: list of anchors, signal file, genomic scope (all can be in BED, GFF, or simple coordinate formats)
Output: clickable GnuPlot/R aggregation plots, textual aggregation signal output
Joel Rozowsky, Zhiping Weng, Yutao Fu http://zlab.bu.edu/GSA/ Yutao Fu, bibin at bu dot edu
Peak calling
QuEST Quantitative Enrichment of Sequence Tags.
Input: ???
Output: ???
Arend Sidow's Lab, Stanford http://mendel.stanford.edu/sidowlab/downloads/quest/  ???
Segmentation
Segway A way to segment the genome: a dynamic Bayesian network approach to segmentation

Input: signal tracks and sequence (genomedata)
Output: Segmentation (BED), Posterior (wigFix), model (GMTKL), discovered parameters (GMTK input master)
Michael M. Hoffman and Bill Noble E-mail mmh1 at washington dot edu Michael Hoffman, mmh1 at washington dot edu
segway-layer Transform more conventional segmentations with one segment per BED row into a layered thick/thin representation with one BED row per (chromosome, segment label) pair

Input: Segmentation (BED)
Output: Segmentation thick/thin (BED)
Michael M. Hoffman and Bill Noble E-mail mmh1 at washington dot edu Michael Hoffman, mmh1 at washington dot edu
Segmentation validation
segtools External validation of segmentation. Calculates and plots distribution of segment lengths, histograms of signal values found in various segments, nucleotide and dinucleotide frequency stratified by segments, overlap with TSSes

Input: segmentation (BED), signal tracks and sequence (genomedata), feature file for TSSes (GFF)
Output: report (HTML), tabulations (tab-delimited), plots (PNG, PDF)
Michael M. Hoffman, Orion Buske, Mirela Andronescu, and Bill Noble encodestatistics.org Orion Buske, orion.buske at gmail.com
startend Computes fold enrichments of states relative to the distance to the nearest transcription start or end site

Input: posterior distributions by chromosome, RefSeq gene table
Output: table of fold enrichments relative to nearest TSS, table of fold enrichments relative to nearest TES
Jason Ernst encodestatistics.org Jason Ernst, jernst at mit dot edu
Generic feature file tools: BED
genome structural correction (GSA) significance of overlap (bases or regions) between two sets of segments over a given domain

Input: 2 BED files to compare and a GCF file defining the domain of the analysis
Output: p-value and z-score
Nathan Boley encodestatistics.org Nathan Boley, npboley at gmail dot com
bedmergesort in uw_utils sort your largest bed-like input file, first by chromosome (lexicographically), then by the start coordinate

Input: BED
Output: BED
Richard Sandstrom encodestatistics.org Richard Sandstrom, sull at u dot washington dot edu
feat-dist in uw_utils compare the relative locations of two genomic features, calculating the numerical distance between two different segments

Input: master feature file (BED), comparator file (BED)
Output: integers (newline-delimited text)
Richard Sandstrom, Bill Noble, Scott Kuehn, Shane Neph encodestatistics.org Richard Sandstrom, sull at u dot washington dot edu
Setops in uw_utils perform set operations on BED files (complement, difference, element of, intersection, merge, not element of, symmetric difference, union all)

Input: BED
Output: BED
Richard Sandstrom, Scott Kuehn, Bill Noble encodestatistics.org Richard Sandstrom, sull at u dot washington dot edu
jarch in uw_utils compression utility optimized for BED-like data

Input: 3 or 5-column BED file
Output: compressed jarch file for use with gchr
Richard Sandstrom, Scott Kuehn, Shane Neph encodestatistics.org Richard Sandstrom, sull at u dot washington dot edu
gchr in uw_utils retrieves all data or data for a specific chromosome (efficiently) from jarch file

Input: jarch file
Output: uncompressed BED file data
Richard Sandstrom, Scott Kuehn, Shane Neph encodestatistics.org Richard Sandstrom, sull at u dot washington dot edu
Generic feature file tools: GFF
Overlap Computes the overlap between two sets of genomic features.
Input: two GFF files
Output: one GFF file
Sarah Djebali, Roderic Guigó http://genome.crg.es/software/index.php#overlap Sarah Djebali, sarah dot djebali at crg dot es
Project Project a set of genomic features onto their genomic sequences.
Input: one GFF file
Output: one GFF file
Sarah Djebali, Roderic Guigó http://genome.crg.es/software/index.php#project Sarah Djebali, sarah dot djebali at crg dot es
Database
genomedata efficient storage of numeric data defined in multiple tracks on a genomic scale, allowing fast random access to hundreds of gigabytes of data, while retaining a small disk space footprint

Input: rawSignal data (wigVar, wigFix, bedGraph), sequence (FASTA)
Output: signal tracks and sequence (genomedata)
Michael M. Hoffman and Bill Noble E-mail mmh1 at washington dot edu Michael Hoffman, mmh1 at washington dot edu
Pipeline
HTS-Workflow An Integrated LIMSystem and data analysis pipeline for high throughput sequencing experiments Myers and Wold Groups http://htsworkflow.caltech.edu/  ???

Software used by production groups

Currently there are a number of alignment programs being used for high throughput sequencing by the various labs. Also, several different algorithms are being used for calling peaks or sites. Additionally there are tools developed by different labs within the ENCODE consortium to verify data and provide some measure of quality control for the sequencing pipeline. The Data Analysis Center (DAC) has been asked to assess whether standarizing is appropriate in some cases. Please take a moment to update your software resources and describe them below. Don't fear to be verbose in the table below if several techniques and parameters are used for different experiments.

Lab Aligner (Dups) Peak Calling Data Verification/QC Other Software of Interest
Bernstein (Broad) MAQ
Crawford (Duke) maq (<= 10 non-unique) Fseq (stdDev)
Gingeras (Affy/CSHL) Overlap, Project
Hubbard (Sanger) Otterlace (manual annotation), AnnoTrack (data tracking & integration)
Myers (Stanford) QuEST HTS-Workflow
Stam (UW) Eland (Unique; <=2 mismatches) Hotspot
Snyder (Yale) Eland (Unique; <=2 mismatches) PeakSeq
Pilot region projects
Dekker (UMass)
Elnitski (NHGRI)
Margulies (NHGRI)
Tenenbaum (SUNY)
Weng (BU/UMass


Aligners

Multiple sequence alignment programs have been used by the ENCODE labs. Additionally, one parameter of great importance is the handling of non-unique aligners. For some applications unique alignment might be considered essential. For others, alignments that map to a few or even many places in the genome may seem appropriate.

Program Description Handles
Non-Unique
Availability
Arachne (Broad) Whole-Genome Shotgun Assembler Y http://www.broad.mit.edu/wga/
CalTech to be described
Crossmatch Developed by Phil Green From University of Washington Genome Center
ELAND Illumina(Solexa) Aligner
Solexa_Pipeline_User_Guide.pdf
N
MAQ Standard Aligner used by 1000 Genomes.
Developed by Durbin's Lab, Sanger
Y Source Forge