NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.
- 1 Introduction
- 2 Proposed Revised Timeline
- 3 Results
- 4 Blacklist
- 5 Links
- 6 People
Cat-herder - Joel Rozowsky
Goal Compare ChIP-seq peak-caller algorithms so as to be able to uniformly generate a set of binding sites (i.e. elements) for data-sets from different labs.
File:Example.jpg=== Conference Calls ===
|Agenda and Minutes 2011|
|Thurs 2011-08-11 11am-noon ET||Elements / Networks||Xianjun Dong - Slides (pdf)|
|Thurs 2011-08-04 11am-noon ET||Elements / Networks||2011-08-04 Agenda||Chao Cheng - Slides (pdf)|
|Thurs 2011-07-21 11am-noon ET||Elements / Networks||2011-07-21 Agenda||Yong Chen - Slides (rename ppt to pdf!)|
|Thurs 2011-07-14 11am-noon ET||Elements / Networks||2011-07-14 Minutes|| Jiali Motif positional preferences (slides)
|Thurs 2011-06-30 11am-noon ET||Elements / Networks||2011-06-30 Agenda||2011-06-30 Minutes||Kevin track comparisons (slides)|
|Thurs 2011-06-09 11am-noon ET||Elements / Networks||2011-06-09 Agenda||2011-06-09 Minutes||Note on Encode naming|
|Thurs 2011-05-19 11am-noon ET||Elements / Networks||2011-05-19 Agenda||2011-05-19 Minutes||Nitin/Kevin GSC updates|
|Thurs 2011-05-05 11am-noon ET||Elements / Networks||2011-05-05 Agenda||2011-05-05 Minutes||Anshul/Manoj-TF Associations|
|Thurs 2011-04-21 11am-noon ET||Elements / Networks||2011-04-21 Agenda||2011-04-21 Minutes|
|Thurs 2011-04-14 11am-noon ET||Elements / Networks||2011-04-14 Agenda||2011-04-14 Minutes|
|Thurs 2011-04-07 11am-noon ET||Elements / Networks||2011-04-07 Agenda||Discovering Cooperative Transcription Factor Binding using the ABC Test and ALPHABIT Pipeline (Konrad Karczewski)|
|Thurs 2011-03-24 11am-noon EDT||Elements / Networks||2011-03-24 Minutes|
|Thurs 2011-03-17 11am-noon EDT||Elements / Networks||2011-03-17 Agenda||2011-03-17 Minutes|
|Thurs 2011-02-24 11am-noon EST||Elements / Networks||2011-02-24 Agenda||TF-combinatorics (Manoj/Anshul)|
|Thurs 2011-01-26 11am-noon EST||Elements / Networks||2011-01-26 Agenda||Integrating miRNAs into regulatory networks (Koon-Kiu Yan)|
|Thurs 2011-01-13 11am-noon EST||Elements / Networks||2011-01-13 Minutes|
Files for July 2010 Meeting (Barcelona)
- Motif Results (Pouya Kheradpour) based on Jan 2010 freeze
Motif Results Motifs.txt.gz
- SPP Uniform Peak Calls on July 2010 Freeze data
Genome version: hg19/Grch37
The peak calls and related information can be downloaded from
Username: encode Password: human
The cutoff on number of peaks selected for each dataset for various IDR thresholds are listed in this google doc. Datasets with more than 2 replicates have multiple cutoffs ( separated by ';' ) corresponding to each pairwise comparison of replicates. The maximum is used as the final cutoff for each dataset. More details below.
DIRECTORIES AND FILES
narrowPeak/ : contains all IDR (0.02) corrected narrowPeak format peak files
idrplots/ : contains all the IDR-based diagnostic plots
jul2010.numPeaks.IDR.xlsx : Table containing cutoff on number of peaks for each dataset at various IDR thresholds. Datasets with more than 2 replicates have multiple cutoffs ( separated by ';' ) corresponding to each pairwise comparison of replicates. The maximum is used as the final cutoff for each dataset. More details in the Procedure section.
The SPP peak caller was used
1. All TF ChIP-seq datasets were paired with their corresponding 'Control' datasets.
2. Peaks were called on all TF ChIP-seq replicate datasets using a relaxed FDR threshold of 0.7. Control replicates are pooled together but TF ChIP-seq replicates are NOT pooled at this step. We refer to these as ReplicatePeakCall.
3. Aligned reads from all replicates for a particular TF ChIP-seq experiment are now pooled. Peaks were called on the pooled ChIP-seq data wrt. pooled Control data. Once again an FDR threshold of 0.7 was used. We refer to these as PooledPeakCall.
So now for each unique TF ChIP-seq dataset, we have one PooledPeakCall file and two or more ReplicatePeakCall files (depending on the number of replicates).
4. We perform IDR/consistency analysis on all pairs of ReplicatePeakCall files that correspond to each PooledPeakCall file. Consistency is evaluated in terms of the rank of the peaks and their reproducibility. A copula mixture model is fitted to pairs of replicates. The method was developed by Qunhua Li (email@example.com) in Peter Bickel's group at Berkeley.
5. For each pairwise comparison of ReplicatePeakCall files, we obtain the number of peaks that pass an IDR threshold 0.02. We refer to this as PairwiseNumPeakCutoff.
6. For each PooledPeakCall file, we obtain the largest of all the corresponding PairwiseNumPeakCutoff thresholds (i.e. the cutoff based on the most consistent pair of replicates). We refer to this as MaxPairwiseNumPeakCutoff. We use this cutoff to trim the PooledPeakCall file i.e. we keep the top N peaks where N = MaxPairwiseNumPeakCutoff.
10. If MaxPairwiseNumPeakCutoff < 100 for a particulare PooledPeakCall file, it generally means that the dataset has very low signal to noise enrichment. This happens is a few cases (~6 datasets). In such cases, the IDR based threshold can be too conservative. In order to get squeeze the most signal out of these datasets, we opt to select a threshold on signal enrichment. We only keep peaks where the signal fold-enrichment > 25. This is based on the observation that for a large fraction of 'good' datasets, the IDR based threshold tends to be equivalent to a signal fold-enrichment threshold of ~25.
11. For datasets with no replicate data, we use the signal fold-enrichment cutoff of 25 to trim the peak call file. There are 20 such datasets.
- In earlier versions of the IDR pipeline, we transferred the model learned on pairwise replicate comparisons to the pooled data. This step was based on assumptions that really do not hold and this step was slow and highly unstable and extremely sensitive to the number of peaks called on the pooled data. This step has now been eliminated.
- The new procedure is now very robust and completely immune to peak calling thresholds ie. FDR threshold or signal enrichment thresholds. The only requirement is that when you call peaks use a very relaxed threshold so as to include a large list of peaks that have few false negatives.
- ONE THRESHOLD TO RULE THEM ALL! There is only one threshold to set for all datasets and that is a single IDR threshold. I tested several IDR thresholds ranging from 0.01 to 0.05. I found 0.02 to be the most optimal trade off between false positive and false negatives over a wide-range of data quality. The definition of false positives and false negatives are based on manual inspection of signal profiles, distribution of reads around the peak summit and other peak characteristics for a wide range of datasets.
Files for March 2010 Meeting
- GENCODE Promoters split by CpG (Anshul Kundaje)
- Write up on consistency/IDR analysis (Qunhua Li)
- SPP Uniform Peak Calls (Anshul Kundaje)
- PeakSeq Uniform Peak Calls (Joel Rozowsky/Steven Wilder)
- DNase element coverage (Steven Wilder)
- File of single linkage elements (split at 5kb) created from all Duke and UW DNase data from the January 2010 freeze. The union of samples from the same cell line within and across labs was taken. Each line shows the region, the count of cell lines covered, a bit string representing the sum of bits representing cell lines (see key), and the integer representing this binary number in decimal.
Uniformly Aligned ChIP-Seq/DNase/FAIRE Sequence Reads
Use MAQ to align data sets (Settings -C 10 -e 70 -n 2)
K562 DNase1 (DNase) [Crawford - Duke]
K562 CTCF (Sequence Specific) [Bernstein - Broad]
See ftp://encodeftp.cse.ucsc.edu/freeze/2009_02/Elements/Methods.pdf for additional notes.
K562 Pol II (Localized Region Binding) [Snyder - Yale]
Submitted Peak-Caller Results
- DNase 1 Scoring (K562 DNase1)
|F-Seq (Boyle)||Media:Fseq.DNaseHS.tgz||-b bff_20/ -f 0 -o peaks/ -of npf -p ploidy/ -v -t 8 sequence.final.bed|
- Sequence Specific ChIP-Seq Factor (K562 CTCF)
Results should be in the form of the UCSC narrowPeak format (i.e giving the nucleotide position of each "peak").
|Hotspot (Thurman)||Media:Hotspot.CTCF.tgz Media:Hotspot.CTCF.150bp.merge.tgz||Default. First archive contains 1bp peak calls. Second contains 150bp peaks, merged if overlapping.|
|F-Seq (Boyle)||Media:Fseq.CTCF.tgz||-b bff_35/ -l 300 -o peaks/ -of npf -p ploidy/ -v -t 16 sequence.final.bed|
|MACS (Pepke)||Media:macs.ctcf_k562.tgz||Tag size 36 for Rep1 and combined. Tag size 51 for Rep2.|
|CisGenome (Muratet)||Media:cisgenome.ctcf_k562.tgz||First round summary:
hts_windowsummaryv2_2sample -w 100
First round peak detection: hts_peakdetectorv2_2sample -w 100 -s 25 -c 0.1 -m 17 -p 0.288865 -ssf 1 -cf 15 -cr 15 -br 1 -brl 30
Median length of top 5000 first round peaks are 132, shift reads by 66bp: hts_alnshift2bar -s 66
Second round summary: hts_windowsummaryv2_2sample -w 100 -z 1
Second round peak detection hts_peakdetectorv2_2sample -w 100 -s 25 -c 0.1 -m 18 -p 0.282769 -ssf 1 -cf 15 -cr 15 -br 1 -brl 30 -z 1 -fc 2 -tfc 2
|QuEST (Pepke)||Media:quest.ctcf_k562.tgz||Script configure (generate_QuEST_parameters.pl).|
|SISSRS (Muratet)||Media:sissrs.ctcf_k562.tgz||-s 3080436051 -D 0.001 -e 10 -p 0.001 -m 0.8 -w 20 -E 4 -r -u|
|SPP (Park)||Media:Spp.ctcf.tar.gz||fdr=0.001, method=tag.lwcc, mle.filter=T, min.dist=300, tec.z=100, whs=325|
- Localized Region Binding (K562 Pol II)
Results should be in the form of the UCSC broadPeak format (i.e. giving the localized region, less than a couple Kb, where binding occurs).
|F-Seq (Boyle)||Media:Fseq.PolII.tgz||-b bff_35/ -l 800 -o peaks/ -of npf -p ploidy/ -v -t 8 sequence.final.bed|
|CisGenome (Muratet)||Media:cisgenome.pol2_k562.tgz|| hts_windowsummaryv2_2sample -w 100
hts_peakdetectorv2_2sample -w 100 -s 25 -c 0.1 -m 6 -p 0.374436 -ssf 1 -cf 5 -cr 5
hts_alnshift2bar -s 27
hts_windowsummaryv2_2sample -w 100 -z 1
hts_peakdetectorv2_2sample -w 100 -s 25 -c 0.1 -m 6 -p 0.374609 -g 100 -ssf 1 -cf 5 -cr 5 -z 1 -fc 2 -tfc 2
|Erange (Pepke)||Media:erange.polII_k562.tar.gz||-nodirectionality -notrim|
|QuEST (Pepke)||Media:quest.polII_k562.tar.gz||bandwidth=100, calc_window=1000|
|SISSRS (Muratet)||Media:sissrs.pol2_k562.tgz||-s 3080436051 -D 0.001 -e 10 -p 0.001 -m 0.8 -w 20 -E 4 -r -u|
|Broad Segmenter (Mikelsson)|
Physical interactions between ENCODE Transcription factors
- Integration of PPI data to ENCODE TFs
|Interactions from BIOGRID||File:Interacting TFs BIOGRID.txt|
|Interactions from Ravasi et.al. Cell 2010||File:Interacting TFs Ravasi Cell2010.txt|
Proposed Revised Timeline
- Prepare Aligned Sequences - Wed 28 Jan
- Lists of Peaks Called - Monday 13 Feb
- Comparison Presentation on AWG Call - TBD
- Other pages