NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.

Unsupervised pattern discovery in human chromatin structure through genomic segmentation

From Encode2 Wiki
Jump to: navigation, search

We have developed a method that finds patterns in multiple tracks of continuous-scale data across a genome. The method partitions each chromosome into a number of contiguous segments, where each segment is associated with one of a fixed set of segment classes. Every class has a corresponding set of probability distributions, one for each data track. Our implementation, Segway, uses dynamic Bayesian network techniques to discover the parameters of these distributions and the genomic segmentation that best fits these parameters.

This method provides several benefits, including the ability to work at 1-bp resolution across an entire genome, and to handle heterogeneous patterns of missing data in different tracks without downsampling or interpolation. This capability allows us to take full advantage of the high-resolution data generated by sequencing assays. It can also incorporate the relative abundance of dinucleotides into its classification.

Segway has produced segmentations on a number of ENCODE data tracks acquired in the GM12878 and K562 cell lines. We focus on a segmentation of data tracks from DNaseI-seq experiments and ChIP-seq experiments with antibodies against CTCF, Pol II, and seven histone modifications. Low intensities in all of these tracks cluster together, as do high intensities. Segment classes trained on K562 data only associated with low or high intensities in that cell line also tend to include genomic positions that associate more with low or high intensities, respectively, in equivalent GM12878 experiments.

We anticipate that combining ENCODE data and the positions of known genomic features (for example, transcription start sites, enhancers, silencers, and locus control regions) will allow us to elucidate data patterns that predict these features. Similarly, adding annotation of expressed or repressed genes will allow identification of patterns associated with those signals.

doi:10.1038/nmeth.1937