NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.


From Encode2 Wiki
Jump to: navigation, search


No previous page.

Cat-herder - Ewan Birney

Overarching goals

  • Making a taxonomy of non-genic features
    • “enhancers”, “insulators”, “something-else”
  • Investigating the sub-architecture of features
    • Certainly promoters
  • Joint model of non-genic elements and genic annotation
    • Certainly expression of gene
    • Also pathway/GO
    • Element to Gene assignment
      • Outstanding problem to be sorted out


  • Proposed format for element matrix circulated, pseudo-code for matrix construction provided 19th Dec
  • Format frozen, 1 example file posted 16th Jan
  • First set of matrix files created (~10 or so, different data manipulations) 31st Jan
  • First runs of machine learning done 28th Feb:
 Anshul Kundaje
 Steven Wilder
 Zhiping Weng
 Yale crowd
 Bickel crowd
  • Ability to present the first round of clustering/modelling March meeting


  • Proposed transformation of data

For broader peaks here is the suggested way to transform the information around the peak to the integration code.

Assuming the following attributes of the elements

    • element goes from element_start to element_end, with mid point element_mid
    • broad peak goes from broad_start to broad_end
    • total signal of the broad peak is total_broad_peak
    • mid-point of the broad, mid_broad_peak is the point where the integrated signal to the left is the same as the integrated signal to the right

This is what I propose we hand over to the machine learning matrix if the element overlaps the broad peak (note of course that there will be multiple sets of these columns for each broad peak assay). If the element does not overlap, I suggest scoring 0 in each of these, though we might want to standardise on "NA" meaning no overlap, that can be easily transformed to 0 on the "client" side.

    • signal of the broad peak between element_start and element_end (ie, local signal)
    • the total_broad_peak
    • the distance from the mid-point of the element to the mid-density of peak, ie abs(element_mid-mid_broad_peak)
    • length of the broad peak (broad_end - broad_start)

This provides to the machine learning techniques both the density of this information under the element, then the total effect of the element, and then a sort of "skewness" parameter.

In the rare cases where an element overlaps two broad peaks of the same assay, presumably because it bridges between the two broad peaks (making the assumption that broad peaks of the same assay cannot overlap), then it takes the parameters from the larger overlap. I think this should be a rare event.

  • Proposed element or region data

There are at least two considerations on regions or elements that we want to integrate signals. Region or element here means a unit for data integration, a small slice of the genome. First, in an ultimately unbiased way, we can split the genome into pre-defined windows (with or without overlapping) of fixed size. The problem with this is the region file is usually quite big itself, and many regions perhaps don't even have values across the datasets we want to integration. A few examples:

Second, we create regions on the fly, depending on the context of datasets that we want to integrate. A simple merging can be performed to put all genomic intervals from input data files together. If a merged regions is too big (for example, greater than 5kb), then this region will be split up into 5kb pieces.

This process should be integrated into matrix building code.

  • Matrix documentation

Element generation

Certain factors (including DNase1, FAIRE, selected transcription factors excluding PolII and PolIII) were chosen to be feature_generating. The elements were created from the union of the peak calls (see spec file for peak files used). Elements longer than the specified max_feature_length (for versions 1 & 3: 1000 bp, versions 2: 2000bp) were split from the start of the element.

Matrix population

The matrix file's header contains the genome version, max_feature_length, default_method, column names, the files and the column numbers used for generating the values in the matrix columns.

For versions 1 and 2 of the matrix, for each factor the peak heights were used to populate the matrix, using the mean height if multiple peaks overlapped an element, and 0 if no peaks overlapped an element.

For version 3 of the matrix, Signal (or, if unavailable, Raw Signal) tracks were used. For each factor, the maximum value of the signal in a region overlapping the element was inserted in the matrix, or 0 if no region overlapped the element.

  • First test run of the pipeline

The aim is to recreate the matrix file Ewan brought to the DC meeting with our existing pipeline.

  • Pre-freeze run

The aim is to generate data matrices with pre-freeze data (see Ewan's email to the AWG group on 2/22/09)

  • Post DC meeting, small element integration (See Ewan's email to the AWG group on 4/2/2009)

We will have a 'chromatin focused' small elements, which will at first be the union set of (chip-seq-factors,dnase1,faire). This will currently be done off lab submitted elements but will switch to peakseq and spp elements post standardised calls. Ting will make a new file for elements on this within a week.

We will take an initial set of RNA elements as the union of (RNAseq, TransFrags, Gencode-exons) over all compartments. Sarah will make this file within a week.

  • 4/12/09: Post DC meeting, second round, CTCF data removed (See Ewan's email to the AWG group on 4/8/2009)

Same sets of data as 040209, except without the CTCF datasets. Element size is up to 2kb.

  • 7/23/09: Pre Huntsville meeting, third round, consistent peak calling

Additional elements and factors added. Maximum element size 1kb. All values generated from signal tracks.


  • Other pages