NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.

Elements

From Encode2 Wiki
Jump to: navigation, search

Introduction

Previous analysis page

Cat-herder - Joel Rozowsky

Goal Compare ChIP-seq peak-caller algorithms so as to be able to uniformly generate a set of binding sites (i.e. elements) for data-sets from different labs.

File:Example.jpg=== Conference Calls ===

Agenda and Minutes 2011
Date Subgroup Agenda Minutes Other Docs
Thurs 2011-08-11 11am-noon ET Elements / Networks Xianjun Dong - Slides (pdf)
Thurs 2011-08-04 11am-noon ET Elements / Networks 2011-08-04 Agenda Chao Cheng - Slides (pdf)
Thurs 2011-07-21 11am-noon ET Elements / Networks 2011-07-21 Agenda Yong Chen - Slides (rename ppt to pdf!)
Thurs 2011-07-14 11am-noon ET Elements / Networks 2011-07-14 Minutes Jiali Motif positional preferences (slides)

http://encodewiki.ucsc.edu/EncodeDCC/images/d/dc/Motif_dist.pdf


Thurs 2011-06-30 11am-noon ET Elements / Networks 2011-06-30 Agenda 2011-06-30 Minutes Kevin track comparisons (slides)

Kevin track comparisons (supplementary spreadsheet)

Xianjun_enhancer comparisons (PDF)

Koon-Kiu PPI between TFs update

Thurs 2011-06-09 11am-noon ET Elements / Networks 2011-06-09 Agenda 2011-06-09 Minutes Note on Encode naming

Kevin Element Tracks version 1.1 Track feature comparisons (GM12878)

Thurs 2011-05-19 11am-noon ET Elements / Networks 2011-05-19 Agenda 2011-05-19 Minutes Nitin/Kevin GSC updates
Thurs 2011-05-05 11am-noon ET Elements / Networks 2011-05-05 Agenda 2011-05-05 Minutes Anshul/Manoj-TF Associations

Nitin/Kevin GSC updates

Thurs 2011-04-21 11am-noon ET Elements / Networks 2011-04-21 Agenda 2011-04-21 Minutes
Thurs 2011-04-14 11am-noon ET Elements / Networks 2011-04-14 Agenda 2011-04-14 Minutes
Thurs 2011-04-07 11am-noon ET Elements / Networks 2011-04-07 Agenda Discovering Cooperative Transcription Factor Binding using the ABC Test and ALPHABIT Pipeline (Konrad Karczewski)
Thurs 2011-03-24 11am-noon EDT Elements / Networks 2011-03-24 Minutes
Thurs 2011-03-17 11am-noon EDT Elements / Networks 2011-03-17 Agenda 2011-03-17 Minutes
Thurs 2011-02-24 11am-noon EST Elements / Networks 2011-02-24 Agenda TF-combinatorics (Manoj/Anshul)
Thurs 2011-01-26 11am-noon EST Elements / Networks 2011-01-26 Agenda Integrating miRNAs into regulatory networks (Koon-Kiu Yan)
Thurs 2011-01-13 11am-noon EST Elements / Networks 2011-01-13 Minutes

Files for July 2010 Meeting (Barcelona)

  • Motif Results (Pouya Kheradpour) based on Jan 2010 freeze

Motif Results Motifs.txt.gz

  • SPP Uniform Peak Calls on July 2010 Freeze data

Genome version: hg19/Grch37

The peak calls and related information can be downloaded from

ftp://encodeftp.cse.ucsc.edu/users/akundaje/rawdata/peaks/jul2010/idr0_02

Username: encode Password: human

The cutoff on number of peaks selected for each dataset for various IDR thresholds are listed in this google doc. Datasets with more than 2 replicates have multiple cutoffs ( separated by ';' ) corresponding to each pairwise comparison of replicates. The maximum is used as the final cutoff for each dataset. More details below.

DIRECTORIES AND FILES

narrowPeak/ : contains all IDR (0.02) corrected narrowPeak format peak files

idrplots/ : contains all the IDR-based diagnostic plots

jul2010.numPeaks.IDR.xlsx : Table containing cutoff on number of peaks for each dataset at various IDR thresholds. Datasets with more than 2 replicates have multiple cutoffs ( separated by ';' ) corresponding to each pairwise comparison of replicates. The maximum is used as the final cutoff for each dataset. More details in the Procedure section.

PROCEDURE

The SPP peak caller was used

1. All TF ChIP-seq datasets were paired with their corresponding 'Control' datasets.

2. Peaks were called on all TF ChIP-seq replicate datasets using a relaxed FDR threshold of 0.7. Control replicates are pooled together but TF ChIP-seq replicates are NOT pooled at this step. We refer to these as ReplicatePeakCall.

3. Aligned reads from all replicates for a particular TF ChIP-seq experiment are now pooled. Peaks were called on the pooled ChIP-seq data wrt. pooled Control data. Once again an FDR threshold of 0.7 was used. We refer to these as PooledPeakCall.

So now for each unique TF ChIP-seq dataset, we have one PooledPeakCall file and two or more ReplicatePeakCall files (depending on the number of replicates).

4. We perform IDR/consistency analysis on all pairs of ReplicatePeakCall files that correspond to each PooledPeakCall file. Consistency is evaluated in terms of the rank of the peaks and their reproducibility. A copula mixture model is fitted to pairs of replicates. The method was developed by Qunhua Li (qli@stat.berkeley.edu) in Peter Bickel's group at Berkeley.

5. For each pairwise comparison of ReplicatePeakCall files, we obtain the number of peaks that pass an IDR threshold 0.02. We refer to this as PairwiseNumPeakCutoff.

6. For each PooledPeakCall file, we obtain the largest of all the corresponding PairwiseNumPeakCutoff thresholds (i.e. the cutoff based on the most consistent pair of replicates). We refer to this as MaxPairwiseNumPeakCutoff. We use this cutoff to trim the PooledPeakCall file i.e. we keep the top N peaks where N = MaxPairwiseNumPeakCutoff.

10. If MaxPairwiseNumPeakCutoff < 100 for a particulare PooledPeakCall file, it generally means that the dataset has very low signal to noise enrichment. This happens is a few cases (~6 datasets). In such cases, the IDR based threshold can be too conservative. In order to get squeeze the most signal out of these datasets, we opt to select a threshold on signal enrichment. We only keep peaks where the signal fold-enrichment > 25. This is based on the observation that for a large fraction of 'good' datasets, the IDR based threshold tends to be equivalent to a signal fold-enrichment threshold of ~25.

11. For datasets with no replicate data, we use the signal fold-enrichment cutoff of 25 to trim the peak call file. There are 20 such datasets.

COMMENTS:

- In earlier versions of the IDR pipeline, we transferred the model learned on pairwise replicate comparisons to the pooled data. This step was based on assumptions that really do not hold and this step was slow and highly unstable and extremely sensitive to the number of peaks called on the pooled data. This step has now been eliminated.

- The new procedure is now very robust and completely immune to peak calling thresholds ie. FDR threshold or signal enrichment thresholds. The only requirement is that when you call peaks use a very relaxed threshold so as to include a large list of peaks that have few false negatives.

- ONE THRESHOLD TO RULE THEM ALL! There is only one threshold to set for all datasets and that is a single IDR threshold. I tested several IDR thresholds ranging from 0.01 to 0.05. I found 0.02 to be the most optimal trade off between false positive and false negatives over a wide-range of data quality. The definition of false positives and false negatives are based on manual inspection of signal profiles, distribution of reads around the peak summit and other peak characteristics for a wide range of datasets.

Files for March 2010 Meeting

  • GENCODE Promoters split by CpG (Anshul Kundaje)
    • README
    • GFF file (Column 6 contains the CpG score)
    • Distribution of CpG scores. Note the bimodal distribution
    • Regions with scores > 0.4 can be considered High CpG promoters. This is based on the bimodal distribution of CpG ratios [[1]]
  • DNase element coverage (Steven Wilder)
    • File of single linkage elements (split at 5kb) created from all Duke and UW DNase data from the January 2010 freeze. The union of samples from the same cell line within and across labs was taken. Each line shows the region, the count of cell lines covered, a bit string representing the sum of bits representing cell lines (see key), and the integer representing this binary number in decimal.

Uniformly Aligned ChIP-Seq/DNase/FAIRE Sequence Reads

Use MAQ to align data sets (Settings -C 10 -e 70 -n 2)

K562 DNase1 (DNase) [Crawford - Duke]

Description File Alignment
DNase1 rep 1 ftp://encodeftp.cse.ucsc.edu/freeze/2009_02/Elements/wgEncodeDukeDNaseSeqAlignmentsRep1K562.tagAlign.gz MAQ
DNase1 rep 2 ftp://encodeftp.cse.ucsc.edu/freeze/2009_02/Elements/wgEncodeDukeDNaseSeqAlignmentsRep2K562.tagAlign.gz MAQ
DNase1 rep 3 ftp://encodeftp.cse.ucsc.edu/freeze/2009_02/Elements/wgEncodeDukeDNaseSeqAlignmentsRep3K562.tagAlign.gz MAQ
Input ftp://encodeftp.cse.ucsc.edu/freeze/2009_02/Elements/wgEncodeUtaustinChIPseqAlignmentsK562Input.tagAlign.gz MAQ

K562 CTCF (Sequence Specific) [Bernstein - Broad]

Description File Alignment
CTCF rep 1 ftp://encodeftp.cse.ucsc.edu/freeze/2009_02/Elements/ctcf_k562_rep1.tagAlign.gz MAQ
CTCF rep 2 ftp://encodeftp.cse.ucsc.edu/freeze/2009_02/Elements/ctcf_k562_rep2.tagAlign.gz MAQ
Input rep 1 ftp://encodeftp.cse.ucsc.edu/freeze/2009_02/Elements/k562_rep1.tagAlign.gz MAQ
Input rep 2 ftp://encodeftp.cse.ucsc.edu/freeze/2009_02/Elements/k562_rep2.tagAlign.gz MAQ
CTCF rep 1 http://genome-test.cse.ucsc.edu/goldenPath/hg18/wgEncodeBroadChipSeq/wgEncodeBroadChipSeqAlignmentsRep1K562Ctcf.tagAlign.gz Submitted
CTCF rep 2 http://genome-test.cse.ucsc.edu/goldenPath/hg18/wgEncodeBroadChipSeq/wgEncodeBroadChipSeqAlignmentsRep2K562Ctcf.tagAlign.gz Submitted
Input rep 1 http://genome-test.cse.ucsc.edu/goldenPath/hg18/wgEncodeBroadChipSeq/wgEncodeBroadChipSeqAlignmentsRep1K562Control.tagAlign.gz Submitted
Input rep 2 http://genome-test.cse.ucsc.edu/goldenPath/hg18/wgEncodeBroadChipSeq/wgEncodeBroadChipSeqAlignmentsRep2K562Control.tagAlign.gz Submitted

See ftp://encodeftp.cse.ucsc.edu/freeze/2009_02/Elements/Methods.pdf for additional notes.

K562 Pol II (Localized Region Binding) [Snyder - Yale]

Description File Alignment
Pol II rep 1 http://archive.gersteinlab.org/proj/ENCODE/ELEMENTS/MAQ_Aligned_Reads/K562_PolII/K562_PolII_Rep1.tagAlign.gz MAQ
Pol II rep 2 http://archive.gersteinlab.org/proj/ENCODE/ELEMENTS/MAQ_Aligned_Reads/K562_PolII/K562_PolII_Rep2.tagAlign.gz MAQ
Input http://archive.gersteinlab.org/proj/ENCODE/ELEMENTS/MAQ_Aligned_Reads/K562_PolII/K562_InputDNA.tagAlign.gz MAQ
Pol II rep 1 http://genome-test.cse.ucsc.edu/goldenPath/hg18/wgEncodeYaleChIPseq/wgEncodeYaleChIPseqAlignmentsRep1K562Pol2.tagAlign.gz Submitted (Eland)
Pol II rep 2 http://genome-test.cse.ucsc.edu/goldenPath/hg18/wgEncodeYaleChIPseq/wgEncodeYaleChIPseqAlignmentsRep2K562Pol2.tagAlign.gz Submitted (Eland)
Input http://genome-test.cse.ucsc.edu/goldenPath/hg18/wgEncodeYaleChIPseq/wgEncodeYaleChIPseqAlignmentsK562Input.tagAlign.gz Submitted (Eland)


Submitted Peak-Caller Results

  • DNase 1 Scoring (K562 DNase1)
Description File Parameters
Hotspot (Thurman) Media:Hotspot.DNaseI.tgz Default
F-Seq (Boyle) Media:Fseq.DNaseHS.tgz -b bff_20/ -f 0 -o peaks/ -of npf -p ploidy/ -v -t 8 sequence.final.bed


  • Sequence Specific ChIP-Seq Factor (K562 CTCF)

Results should be in the form of the UCSC narrowPeak format (i.e giving the nucleotide position of each "peak").

Description File Parameters
Hotspot (Thurman) Media:Hotspot.CTCF.tgz Media:Hotspot.CTCF.150bp.merge.tgz Default. First archive contains 1bp peak calls. Second contains 150bp peaks, merged if overlapping.
F-Seq (Boyle) Media:Fseq.CTCF.tgz -b bff_35/ -l 300 -o peaks/ -of npf -p ploidy/ -v -t 16 sequence.final.bed
MACS (Pepke) Media:macs.ctcf_k562.tgz Tag size 36 for Rep1 and combined. Tag size 51 for Rep2.
PeakSeq (Rozowsky) Media:PeakSeq_CTCF.tar.gz Default
CisGenome (Muratet) Media:cisgenome.ctcf_k562.tgz First round summary:

hts_windowsummaryv2_2sample -w 100

First round peak detection: hts_peakdetectorv2_2sample -w 100 -s 25 -c 0.1 -m 17 -p 0.288865 -ssf 1 -cf 15 -cr 15 -br 1 -brl 30

Median length of top 5000 first round peaks are 132, shift reads by 66bp: hts_alnshift2bar -s 66

Second round summary: hts_windowsummaryv2_2sample -w 100 -z 1

Second round peak detection hts_peakdetectorv2_2sample -w 100 -s 25 -c 0.1 -m 18 -p 0.282769 -ssf 1 -cf 15 -cr 15 -br 1 -brl 30 -z 1 -fc 2 -tfc 2

Erange (Pepke) Media:erange.ctcf_k562.tgz Default.
QuEST (Pepke) Media:quest.ctcf_k562.tgz Script configure (generate_QuEST_parameters.pl).
SISSRS (Muratet) Media:sissrs.ctcf_k562.tgz -s 3080436051 -D 0.001 -e 10 -p 0.001 -m 0.8 -w 20 -E 4 -r -u
SPP (Park) Media:Spp.ctcf.tar.gz fdr=0.001, method=tag.lwcc, mle.filter=T, min.dist=300, tec.z=100, whs=325


  • Localized Region Binding (K562 Pol II)

Results should be in the form of the UCSC broadPeak format (i.e. giving the localized region, less than a couple Kb, where binding occurs).

Description File Parameters
Hotspot (Thurman) Media:Hotspot.PolII.tgz Default
F-Seq (Boyle) Media:Fseq.PolII.tgz -b bff_35/ -l 800 -o peaks/ -of npf -p ploidy/ -v -t 8 sequence.final.bed
MACS (Pepke) Media:Macs.polII_k562.tar.gz Tagsize=27.
PeakSeq (Rozowsky) Media:PeakSeq_PolII.tar.gz Default
CisGenome (Muratet) Media:cisgenome.pol2_k562.tgz hts_windowsummaryv2_2sample -w 100

hts_peakdetectorv2_2sample -w 100 -s 25 -c 0.1 -m 6 -p 0.374436 -ssf 1 -cf 5 -cr 5

hts_alnshift2bar -s 27

hts_windowsummaryv2_2sample -w 100 -z 1

hts_peakdetectorv2_2sample -w 100 -s 25 -c 0.1 -m 6 -p 0.374609 -g 100 -ssf 1 -cf 5 -cr 5 -z 1 -fc 2 -tfc 2

Erange (Pepke) Media:erange.polII_k562.tar.gz -nodirectionality -notrim
QuEST (Pepke) Media:quest.polII_k562.tar.gz bandwidth=100, calc_window=1000
SISSRS (Muratet) Media:sissrs.pol2_k562.tgz -s 3080436051 -D 0.001 -e 10 -p 0.001 -m 0.8 -w 20 -E 4 -r -u
Broad Segmenter (Mikelsson)

Physical interactions between ENCODE Transcription factors

  • Integration of PPI data to ENCODE TFs
Description File
Interactions from BIOGRID File:Interacting TFs BIOGRID.txt
Interactions from Ravasi et.al. Cell 2010 File:Interacting TFs Ravasi Cell2010.txt

Proposed Revised Timeline

  • Prepare Aligned Sequences - Wed 28 Jan
  • Lists of Peaks Called - Monday 13 Feb
  • Comparison Presentation on AWG Call - TBD

Results

Space for listing peak caller comparison concerns

Blacklist

Blacklist of problematic genomic regions

Links

People