NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.

Large-scale behavior

From Encode2 Wiki
(Redirected from Large-scale Behaviour)
Jump to: navigation, search

Introduction

Previous analysis page "Large-scale segmentation of individual data sets"

The analysis will begin with an initial tool-building phase, followed by an iterative procedure of

  1. producing various segmentations, and
  2. analyzing the segmentations using our tools.

Phone calls

The dial in details are as follows

  • UK Freephone: 08081095644
  • US Freephone: 877 420 0272
  • (Alternative International No.: +44 2083222500)

Participant Access Code: 1571500#

Newest results

Segmentation progress

Repressed/Active/Dead segmentation from AWG meeting at Segway.rad.bed.gz

Summary for November 2010 PI meeting

Large-scale Behaviour/Summary November 2010

Analysis tasks

Large-scale annotation zoo

  • Promoter proximal regions/enhancers: Michael Hoffman, Orion Buske, Jason Ernst
    • Analysis 2.a.i.ii: Do different segments from the segmentation correspond to different splits of promoter by CpG and expression levels?
    • Status: Jason has completed multiple round5b ChromHMM segmentations. Michael has completed multiple round5b Segway segmentations on whole genome. CpG and expression level (K562 RPKM, by gene) have been added to Segtools reports.
    • Done: By 08Feb2010 large-scale call: Michael to e-mail Roderic Guigo about getting RPKM data for GENCODE genes and report back. It sounds like we should use the total RNA data instead of nuclear or cytosolic.
    • Done: By 16Feb2010 large-scale call: Michael to set up table of files for other groups to use. Should include name, URL, short description, and any patterns (in conservation, etc) that you would expect other groups to find.
    • Done: By 16Feb2010 large-scale call: Jason to send one segmentation to Orion with annotation of which labels should be promoter-proximal and enhancer.
    • Done: By 16Feb2010 large-scale call: Michael to send one segmentation to Orion with annotation of which labels should be promoter-proximal and enhancer.
    • Done: By 29Feb2010 large-scale call: Orion to have Segtools reports of Michael's and Jason's segmentations (including expression level measure)
      • Expression level measure is file: wgEncodeGencodeFullV3.krpkm.split.gtf.gz in segtools reports. There are three groups, corresponding to all GENCODE genes split by K562 RPKM value (top 25%, zeros (50%), remaining (25%)).
  • Silent domains: Michael Hoffman, Orion Buske, Jason Ernst, Ross Hardison
    • Analysis 2.e: Identify silent domains as regions lacking transcripts in the cell (line), with no/low H3K36me3 and high (moderate) H3K27me3 signal. Define the boundaries as the change point in H3K27me3 signal. Perform same analyses as for promoters examining the domain as a whole, and also the domain boundaries.
    • Status: Waiting until after March Meeting.
    • Done: By 08Feb2010 large-scale call: Orion to make H3K27me3 segmentation from peak calls. Broad peaks accounted for ~ 37% of genome. Enriched at centers of IG and pseudogenes; depleted at centers of protein-coding and "other" genes.
    • Done: By 16Feb2010 large-scale call: Orion to upload H3K27me3 segtools report.
  • New gene elements: Zhiping Weng, Gerstein/Rozowsky, Michael Hoffman, Jason Ernst, Anshul Kundaje
    • Analysis 2.f: From each of the predictive models (i.e. signals that we have identified as being characteristic of these elements either from segmentation, SVM or other machine learning approach supervised or otherwise) identify new elements. The segmentation is also a classifier and so could be used to identify new gene elements. Can we find things like dead zones?
    • Status: Michael will try to find things from the Segtools report after March Meeting
    • Done: By 29Feb2010 large-scale call (preferably soon: Orion to let Jason/Michael know when reports are done
    • Done: By 22Feb2010 large-scale call: Michael to look at reports and identify hypothetical new behavior
    • Action: By 22Mar2010 large-scale call: Michael and Jason to identify all unexplained or surprising segment labels
  • TF environment: Jason Ernst, Anshul Kundaje
    • Analysis 2.h: For each TF, produce aggregate plots of signal over the element and immediate surrounds (+/- 500bp? 1000bp?) for histone modifications, nucleosome positioning and DNase 1 sensitivity. Split these plots by TF binding signal levels.
    • Status: Jason has done nucleosome calls using NPS software from Shirley Liu and Broad histone modification data. Anshul is making aggregate signal plots with new data.
    • Done: By 08Feb2010 large-scale call: Orion to download NPS data from [1]. Data appears to be fine.
    • Action: By March Meeting: Anshul to upload some plots (~8 TFs in K562) to do this for discussion in the call.
  • Broad chromosome organization: Broad, Michael Hoffman, Jason Ernst, Steve Wilder
    • For the following genome splits:
      • gene rich/poor
      • transcriptionally active/inactive
    • Analysis 3.a/b.i: Determine the bulk average histone modifications, DNase1, etc with confidence intervals for these region. To do this calculate a per kb figure for each segment, or perhaps generate 1 kb windows and make aggregate plots?
    • Analysis 3.a/b.ii: Identify enrichment for particular segments from the genome segmentation within these splits.
    • Done: Steven to present findings during 22Feb2010 large-scale call
    • Done: By 16Feb2010 large-scale call: Steven Wilder to use a sliding window approach to mark gene rich/poor regions.
    • Action: By March Meeting: Orion to run Segtools on gene rich/poor regions segmentation
  • Chromosome interaction: Job Dekker group
    • Analysis 3.c: Identify pairs of interaction regions from Dekker 5C data. Plot scatter plots of modification signals, TF ChIP-seq signal, Dnase1 signal, transcription etc for the two sites and calculate r, r2. Note: Isn’t there a caveat that interacting regions will be pulled down together for ChIP seq assays?
    • Status: Finished list of interaction pairs (~3000 in ENCODE pilot regions). Presentation (Brian Lajoie).
    • Done: By 16Feb2010 large-scale call: Job to provide a list of interactions to be used through March meeting. Can do another round after March meeting
    • Done: By 22Feb2010 large-scale call: Brian to have updated README
    • Action: By March Meeting: Brian to have some correlation plots of signals between the interacting regions
  • Transcript domains: Michael Hoffman, Ross, Georgi
    • Analysis 3.e: What proportion of the transcriptional output of the genome is organized into well-defined, multi-gene/transcript domains that can be systematically connected with chromatin modifications, accessibility, and interactions (5C)?
    • Done: By 16Feb2010 large-scale call: Michael to send segmentation to Georgi
    • Done: By 29Feb2010 large-scale call: Michael to ping Georgi re: sent segmentations. No response.
    • Action: Discuss with Georgi at March Meeting.
  • Large-scale transcript-unspecific chromatin domains: Jason, Tony, James
    • Analysis 3.f: Are there large-scale chromatin phenomena that are not connected with the transcriptional output of the underlying genomic sequence (e.g., expression status of genes within some defined domain)?
    • Status: Still unsure of what exactly this is supposed to entail. This analysis will probably fall out of the segmentation results.
  • CTCF/cohesin/Rad21: Jason, James, Tony
    • Analysis 3.i: The role of CTCF, Cohesin, Rad21 in chromosomal organization, identify
      • CTCF sites at boundaries of domains i.e. Insulators
      • CTCF sites that lie between regulatory elements and promoters i.e. enhancer blocker sites
      • CTCF sites at promoters
      • how does this interact with Job Dekker's 5C results?
    • Status: Still unsure of what exactly this is supposed to entail. This analysis will probably fall out of the segmentation results.
    • Done: By 29Feb2010 large-scale call: Orion have cohesin/Rad21 peaks in new Segtools reports
    • Action: By 29Feb2010 large-scale call: Michael to ask Zhiping Weng for suggestions/additions.
  • Standardized signal generation: Anshul Kundaje, Michael Hoffman, Orion Buske, Richard Sandstrom
    • Status: Round 5b generation is done. Everything else (GRCh37 conversion) postponed until after March meeting
    • Done: By 4Feb2010: Orion and Richard to send bigWig URLs to the AWG mailing list so that regions can be discussed on
    • Action: By 19Mar2010: Anshul to produce a version of tagAlign2rawSignal that supports SAM/BAM, and filtering by proportion of nearby mappable positions, and produce tag extension lengths considering randoms for multimapping.
    • Action: By 19Mar2010: Michael to pick round6 assays.
    • Action: By 26Mar2010: Orion to produce round6 signal files, genomedata archive, considering randoms for multimapping. Both DNase assays to use same two-way extension length.
  • Confusion matrices between different segmentations: Orion Buske

GRCh37 Results

Results for March 2010 meeting

Round 5b results for the March 2010 meeting.

Please link files with this example wiki code:

* [http://my.website.edu/mydir/mypage.html Name] - description, followed by any other relevant information, for instance if you expect certain patterns, of conservation/evolution for that particular data set

If you have expected results that are not available yet, please add them as well (just without the URL link)

  • Nucleosome positioning data
    • K562 NPS calls from Broad histone modification data (Jason Ernst)
    • GM12878 NPS calls from Broad histone modification data (Jason Ernst)
  • Signal aggregation
    • Plots for all transcription factors (Anshul Kundaje)
    • Tab-delimited data for all transcription factors (Anshul Kundaje)

Beyond the March 2010 meeting

Software tools

  • tagAlign2rawSignal (Anshul)
  • Segtools (Michael, Orion) obtain - Software in somewhat haphazard state. Please contact Orion with any questions or if you have any trouble at all!
    • Analyze the distributions of various data tracks within each type of segment
    • Analyze the length distributions of the segments
    • Summarize the transition matrix between segments with different labels
    • Make "aggregation plots" around punctate sites (e.g., TSSs) or predefined regions (e.g., exons) of the frequency of various segment labels
    • Summarize nucleotide and dinucleotide frequencies by segment label
  • GSC (Genomic Structural Correction) (Ben, Elliot, Nathan, ?) obtain
    • Significance test for the overlap between regions (binary or continuous) in two BED files over a given domain
    • Dyadic segmentation
    • BED parser
    • Wiggle parser
  • SegOverlap (Bob, Richard, Orion) (alpha) obtain
    • Analyze the overlap / enrichment of one segmentation with respect to a data track or a second segmentation
  • Annote (Bob) obtain
    • Tabulate various enrichment statistics between one or more labels and a variety of data tracks and genomic annotations.
  • ??? (Peter, Jessica)
    • Cluster the various segment types and evaluate levels of clustering with respect to some biological / statistical criterion

Links

Analysis tasks (old)

  • RNA - which classes correspond to transcribed genes. (Mark, Robert Bjornson, Zhi John Lu – novel transcribed regions)
  • TFs held out from segmentation (Anshul, Xiaoyu)
  • DNase1/FAIRE to identify open chromatin classes in various segmentations (Richard)
  • Gene annotation (Mark, Robert Bjornson, Zhi John Lu – novel transcribed regions, Michael, Orion)
  • Sequence conservation enrichment (Jason, Steve, Elliott)
  • Logical relationships among segment labels (Jason)

Old Results

Round 4

Round 4 input files

Round 3

Interim studies with the round 3 input files spreadsheet. No public results

Round 2

These are produced using the data tracks in the round 2 input files spreadsheet.

Obsolete results

People

  • Ray Auerbach
  • Peter Bickel
  • Nathan Boley
  • Orion Buske
  • Xiaoyu Chen
  • Jason Ernst
  • Mark Gerstein
  • Michael Hoffman
  • Haiyan Huang
  • Anshul Kundaje
  • Qunhua Li
  • Bill Noble
  • Steve Parker
  • Joel Rozowsky
  • Richard Sandstrom
  • Noam Shoresh
  • Bob Thurman