NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.

# Companion papers

## Companion Paper Submission Status

This page contains privileged information (including unpublished analyses and manuscripts). This material is posted here solely to facilitate coordination and planning by ENCODE consortium members and their collaborators. Any unauthorized disclosure, copying, use, or distribution of this information for any purpose other than coordinating submission of ENCODE papers and planning future ENCODE analyses is not permitted.

• Submission of Figures and Data Spreadsheets to Googledocs (High Profile Companion Papers) - September 9, 2011
• Submission of Titles and Abstracts for GR companions to Genome Research - September 15, 2011
• Submission of Draft Manuscript to Googledocs (High Profile Companion Papers)- September 23, 2011
• Submission of Integrative and High Profile Companion Papers to Nature - October 15, 2011
• Submission of GR companions to Genome Research - October 15, 2011

# Three sentence policy for dataset use

In the spirit of transparency and consistency, we would like to make explicit the guidelines for use of genome-wide datasets in high-profile and other companion papers. Datasets that are (a) past the 9-month embargo, (b) in the January freeze, (c) used in the integrative paper, or (d) used in a high-profile companion are all available for use by all the groups to validate their hypotheses or extend their results. However, analyses that use these datasets as the primary information should be reserved to the papers written by the data producers, unless permission is specifically granted to carry out an independent primary analysis that does not conflict with the data producers' own papers. In both cases, an email should be sent to the data production PIs to inform them of such use, to avoid duplication or conflicts.

# Best Practices for Companion Papers

We are setting out some best practice for the ENCODE papers which we will be following in the main flagship paper, that we expect all the high profile papers to follow and that we encourage all the companion papers to follow.

• The data should substantially come from the Jan2011 freeze; any additional dataset should be explicitly mentioned as not coming from the Jan2011 freeze. Additional datasets should be submitted to UCSC of course. Any statistics that draw information from another resource (eg, the TSSs, or say Chip-seq results) must use the freeze based data. Gencode version 7 is considered to be part of the Jan2011 freeze.
• Where there exists a uniform calling procedure (eg, by IDR) that statistic should be used; if authors want to drop thresholds (it is appropriate in a number of scenarios) they should quote both the results from the IDR threshold and then their own threshold set. Wherever possible, genome-wide overlap statistics should go through the GSC statistical framework. We have a totally first class statistical group in the Bickel group, so do run any statistical questions through with them.
• All data orientated figures should have the data accessible as a column orientated table - either .csv or a R data frame. For the main paper we will have a structured system (at a minimum a ftp-able directory structure, perhaps tar'd up) for downloads of these. We would strongly encourage everyone to take the same view.
• Wherever possible there should be a pseudo-code level description of how one gets from the raw data files to the .csv or R data frame of a figure, and ideally the actual scripts/code.
• All statistics which mention a genome-wide dataset should ideally also be a track submitted to UCSC.

# High Profile Articles

High profile companions has its own page High_profile_companions

# Other (non GR, GB) Companion Papers

Please list your other papers to be included in the web site here:

• List starts here

## Extensive promoter-centered chromatin interactions provide a topological basis for transcription regulation. Cell. 2012 Jan 20;148(1-2):84-98

• Authors

Li G, Ruan X, Auerbach RK, Sandhu KS, Zheng M, Wang P, Poh HM, Goh Y, Lim J, Zhang J, Sim HS, Peh SQ, Mulawadi FH, Ong CT, Orlov YL, Hong S, Zhang Z, Landt S, Raha D, Euskirchen G, Wei CL, Ge W, Wang H, Davis C, Fisher-Aylor KI, Mortazavi A, Gerstein M, Gingeras T, Wold B, Sun Y, Fullwood MJ, Cheung E, Liu E, Sung WK, Snyder M, Ruan Y

• Abstract

Higher-order chromosomal organization for transcription regulation is poorly understood in eukaryotes. Using genome-wide Chromatin Interaction Analysis with Paired-End-Tag sequencing (ChIA-PET), we mapped long-range chromatin interactions associated with RNA polymerase II in human cells and uncovered widespread promoter-centered intragenic, extragenic, and intergenic interactions. These interactions further aggregated into higher-order clusters, wherein proximal and distal genes were engaged through promoter-promoter interactions. Most genes with promoter-promoter interactions were active and transcribed cooperatively, and some interacting promoters could influence each other implying combinatorial complexity of transcriptional controls. Comparative analyses of different cell lines showed that cell-specific chromatin interactions could provide structural frameworks for cell-specific transcription, and suggested significant enrichment of enhancer-promoter interactions for cell-specific functions. Furthermore, genetically-identified disease-associated noncoding elements were found to be spatially engaged with corresponding genes through long-range interactions. Overall, our study provides insights into transcription regulation by three-dimensional chromatin interactions for both housekeeping and cell-specific genes in human cells.

## UNCOVERING TRANSCRIPTION FACTOR MODULES USING ONE-DIMENSIONAL AND THREE-DIMENSIONAL ANALYSES. J.Biol. chem. 2012 In Press

• Authors

Xun L, Farnham PJ, Jin VX

• Abstract

Transcriptional regulation is a critical mediator of many normal cellular processes as well as disease progression. Transcription factors (TFs) often co-localize at cis-regulatory elements on the DNA, form protein complexes, and collaboratively regulate gene expression. Machine learning and Bayesian approaches have been used to identify TF modules in a one-dimensional context. However, recent studies using high-throughput technologies have shown that TF interactions should also be considered in three-dimensional nuclear space. Here we describe methods for identifying TF modules and discuss how moving from a one-dimensional to a three-dimensional paradigm, along with integrated experimental and computational approaches, can lead to a better understanding of TF association networks.

• []

## SWI/SNF Chromatin Remodeling Factors: Multiscale Analyses and Diverse Functions. J.Biol. Chem. 2012 In Press

• Authors

Euskirchen GM, Auerbach RK, Snyder M

• Abstract

Chromatin remodeling enzymes play essential roles in many biological processes, including gene expression, DNA replication, DNA repair and cell division. Although one such complex, SWI/SNF, has been extensively studied new discoveries are still being made. Here we review SWI/SNF biochemistry, highlight recent genomic and proteomic advances and address the role of SWI/SNF in human diseases including cancer and viral infections. These studies have greatly increased our understanding of complex nuclear processes.

## Genome-wide studies of CTCF and cohesin provide insight into chromatin structure and regulation. J. Biol. Chem. 2012 In Press

• Authors

Lee B-K, Iyer VR

• Abstract

Eukaryotic genomes are organized into higher-order chromatin architectures by protein-mediated long-range interactions in the nucleus. CTCF, a sequence-specific transcription factor, serves as a chromatin organizer in building this complex chromatin structure by linking chromosomal domains. Recent genome-wide studies mapping the binding sites of CTCF and its interacting partner, cohesin, using chromatin immunoprecipitation coupled with deep sequencing (ChIP-seq) reveal that CTCF globally co-localizes with cohesin. This partnership between CTCF and cohesin is emerging as a novel and perhaps pivotal aspect of gene regulatory mechanisms, in addition to playing a role in the organization of higher-order chromatin architecture.

## Genome-wide Epigenetic Data Facilitate Understanding of Disease Susceptibility Association Studies. J. Biol. Chem. 2012 Mini-review, In Press

• Authors

Ross C. Hardison

• Abstract

Complex traits, such as susceptibility to diseases, are determined in part by variants at multiple genetic loci. Genome-wide association studies can identify these loci, but most phenotype-associated variants lie distal to protein-coding regions and likely are involved in regulating gene expression. Understanding how these genetic variants affect complex traits depends on the ability to predict and test the function of the genomic elements harboring them. Community efforts such as the ENCODE Project provide a wealth of data about epigenetic features associated with gene regulation. These data enable the prediction of testable functions for many phenotype-associated variants.

## Integrative annotation of chromatin elements from ENCODE data

• Authors

Michael M. Hoffman*, Jason Ernst*, Steven P. Wilder, Anshul Kundaje, Robert S. Harris, Max Libbrecht, Belinda Giardine, Jeffrey A. Bilmes, Ewan Birney, Ross C. Hardison†, Ian Dunham†, Manolis Kellis, William Stafford Noble

• Abstract

While the annotation of protein-coding genes has benefited from the presence of strong evolutionary conservation and primary sequence signals, the annotation of regulatory elements in the remaining 98% of the human genome has remained a great challenge. The ENCODE project has generated a wealth of experimental information mapping a diversity of chromatin properties and histone modifications in several human cell lines. While each of these tracks is independently informative towards the annotation of regulatory elements, their complex interrelations are not yet understood, requiring the development of new computational methods for integrating this information for the systematic annotation of regulatory elements. Here, we apply unsupervised learning methodologies to generate an interpretable summary of the massive and complex data sets of the ENCODE Project. We describe two related machine learning methods for converting dozens of ENCODE functional genomics data sets into discrete annotation maps of regulatory and chromatin elements along the human genome. These methods automatically rediscover and succinctly summarize diverse aspects of the architecture of the human genome, including transcribed genes, transcription start sites, gene-proximal promoter regions, distal regulatory elements, insulator regions, and diverse other classes of regulatory elements. They also help interpret evolutionarily-conserved regions that had remained uncharacterized, providing biochemical support for recently-defined conserved elements based on biased patterns of substitution. The two chromatin state annotations are highly consistent with each other, enabling us to produce a single, human-interpretable summary of the functional architecture of the human genome in a joint segmentation. Overall, our results provide a foundation for interpreting the human genome through the lens of ENCODE experimental datasets, and a general approach for systematic integration of large-scale experimental datasets.

## Adaptive calibrated measures for rapid automated quality control of massive collections of ChIP-seq experiments (Was GBCP019, In re-submission, For October/November addition)

• Authors

Anshul Kundaje*, Yungsook L. Jung*, Peter V. Kharchenko, Barbara J. Wold, Arend Sidow, Serafim Batzoglou, Peter J. Park

• Abstract

In recent years Chromatin immunoprecipitation with massively parallel sequencing (ChIP-seq) has emerged as the predominant tool for generating high resolution, genome-wide maps of DNA binding proteins and chromatin modifications. Large-scale projects such as the ENCODE and modENCODE consortia are generating comprehensive collections of ChIP-seq and other functional genomics datasets across diverse organisms. With the rapid increase of available ChIP-seq datasets, there is a need for development and systematic evaluation of robust data quality measures. We suggest new, robust data quality measures based on cross-correlation analysis of strand-specific, genome-wide tag density profiles. The measures reflect the inherent degree of IP fragment clustering in a dataset and are highly representative of signal-to-noise. They can be computed efficiently, post read alignment without any parameter tuning. We also present other popular data quality measures and explore their relationship to each other and various data parameters using a cohort of ~1000 transcription factor and ~500 histone modification ChIP-seq datasets from the human ENCODE consortium. We provide supporting evidence from drosophila modENCODE datasets. The newly proposed measures show strong correlation with other IP enrichment measures, downstream peak calling results and data reproducibility. Flagged low quality datasets that were repeated and replaced showed significant improvement in data quality. The cross-correlation analysis can also be used to detect sub-optimal sequencing depths and size selection problems in ChIP-seq datasets. We provide a systematic comparison of replicate datasets and a classification of transcription factor antibodies and datasets based on data quality. The proposed measures can also be used to evaluate data quality of other sequencing based functional genomics datasets such as open chromatin assays (DNase-seq, FAIRE-seq) and nucleosome sequencing datasets (MNase-seq). We show that strand cross-correlation analysis coupled with other complementary data quality measures can be used to build a robust, automated, quality control pipeline.

## Reproducibility measures for automatic threshold selection and quality control in ChIP-seq datasets(Was GRCP020, In submission, For October/November addition)

• Authors

Anshul Kundaje, Qunhua Li, James B. Brown, Joel Rozowsky, Arif Harmanci, Steven Wilder, Serafim Batzoglou, Ian Dunham, Mark Gerstein, Ewan Birney, Arend Sidow, Peter Bickel.

• Abstract

We developed a computational pipeline for automatic threshold selection and quality control for ChIP-seq data. This pipeline implemented a recently developed statistical method, "irreproducible discovery rate" (IDR), which is a formal measure of the evidence for signals to be reproducible between replicates. Based on this measure, this pipeline provides a unified framework, comparable across different data sets, labs and peak callers, to measure the reproducibility of replicate experiments and select peaks based on their reproducibility.

Using this pipeline and other graphical quality tools, we carried out a comprehensive quality assessment for the peaks identified from ENCODE production ChIP-seq data (~500 unique datasets with at least two replicates from six labs across 77 cell lines) using three peak callers (SPP, MACS and Peakseq). Using a universal IDR threshold, we processed a wide range of data quality and binding characteristics, including both sparse and ubiquitous binding. The peaks selected by our method show consistent overlap ratios with binding motifs and known targets across data sets for transcription factors with well-defined motifs, which demonstrates the potential of our method for automatic threshold selection. Moreover, our pipeline successfully identified samples with poor quality and extracted the maximum signal from datasets with replicates of varying quality using a rescue strategy. Our results also showed that, despite marked algorithmic differences, the three peak callers generate very similar peak sets on most datasets in our study.

## Context-specific functional associations of transcription factors and their effect on gene expression (Was GRCP018, In preparation, For October/November addition)

• Authors

Anshul Kundaje, Manoj Hariharan, Nadine Hussami, Alan P Boyle, Zhengqing Ouyang, Yong Cheng, Stephen Landt, Arend Sidow, Serafim Batzoglou, Michael Snyder

• Abstract

Genes are transcriptionally regulated by complex combinations of transcription factors (TFs) in a cell-type and locus specific manner. We integrated over 200 ENCODE transcription factor ChIP-seq datasets in five cell-lines with RNA-seq and CAGE data to reveal context-specific co-localization of transcription factors and their effect on gene expression. Using a target factor-centric approach, we identified all significant pairwise and higher order combinatorial co-associations between transcription factors in the context of the genome-wide binding landscape of each target transcription factor. This method allowed us to differentiate between primary partners of each target factor that tend to strongly coassociate with the target across its entire binding landscape and localized partners that co-bind with the target factor at specific subsets of loci. Thus, we were able to extract specific 'biclusters' consisting of sub-populations of sites of each target factor that preferentially associated with distinct, mutually exclusive or partially overlapping sets of partners. We found that biclusters involving a focus TF with different binding partners often regulated genes with distinct enrichments of gene-ontology functional categories. We revealed associations that were preferentially enriched in gene-proximal and gene-distal subpopulations of binding sites of each target factor. By integrating association preferences across all target factors in a cell-line, we obtained functional classes of transcription factors that show distinct preferences of co-binding with partners. We also found that some transcription factors switched their preferred partners in different cell-lines. Our analyses identified a large number of known associations as well several novel ones. To corroborate the learned associations based on ChIP-seq data, we learned analogous association models between known and discovered motifs as well as inferred footprinting data based on integration of motif data and DNase-seq data. We found high consistency between these analyses. Using a complementary gene-centric view, we built non-linear regression models linking transcription factor binding in proximal promoter regions to RNA-seq and CAGE based gene expression data. We identified relevant co-associations of transcription factors that regulated specific subsets of genes and could explain differences in expression. These models automatically revealed complex non-linear activating and repressive effect of different transcription factors and transcription factor combinations. Finally, we cross referenced RNAi-based transcription factor knockdown data for 10 transcription factors with their corresponding ChIP-seq based binding landscape to identify potential direct and indirect target genes. Our analysis provides a partial view of the highly complex context-specific regulatory code across multipe genomic environments and cell types.

# Genome Research Companion Papers

Register your GR companion papers here. Take the next available code number from this list and delete it. Add more to the list if it gets short :-)

• GRCP070
• GRCP071

# RNA

## GRCP001:GENCODE: The reference human genome annotation for the ENCODE project.

• Authors

WTSI, Lausanne, CRG, UCSC, WashU, MIT, Yale, CNIO

• Abstract

The publication of the human genome over 10 years ago led to the realisation that developmental complexity is not related to gene number. Recent studies would suggest that a significant proportion of our genome is transcribed, however only 1-2% of the genome is associated with protein-coding loci. The GENCODE consortium aims to identify all gene features in the human genome, using a combination of computational analysis, manual annotation and experimental validation. Since the first public release of this merged annotation set in 2009 the number of protein coding loci has reduced, yet the alternative splicing transcript annotation has steadily increased. The GENCODE 7 release, which has been used as reference annotation for both the ENCODE and FANTOM5 projects, contains approximately 138,000 annotated transcript models at 20,687 protein-coding and 9640 long non-coding RNA loci. Comparing this dataset with other publicly available resources we found that GENCODE v7 has greater than 2 fold more transcripts annotated than UCSC genes and Refseq. Also it has the most comprehensive annotation of lncRNA loci publicly available with the predominant transcript form consisting of 2 exons. We have also examined how complete transcripts are in GENCODE and found 35% of GENCODE transcriptional start sites supported by CAGE clusters and 62% of protein coding genes have polyA sites annotated. Other data from the ENCODE that is now being integrated into the GENCODE annotation pipeline, includes mass spectrometry (MS) data. Over a third of GENCODE protein-coding genes are covered by peptide hits derived from spectra submitted to Peptide Atlas. In addition, new models derived from the Illumina "body map" RNAseq data identify 3689 new loci not currently in GENCODE; however 3127 consist of two exon models indicating that they are possibly unannotated long non-coding loci. In addition 200 Scripture models that had high coding potential when using PhyloCSF (Lin et al. 2011) have highlighted 15 new protein-coding loci not present in any other public annotation dataset. The GENCODE gene set is publicly available from http://www.gencodegenes.org, and via the UCSC and Ensembl browsers and is updated every 4 months.

• Bullet points
• Datasets used: Gencode v7

## GRCP002: The GENCODE v7 catalogue of human long non-coding RNAs: Analysis of their gene structure, evolution and expression.

• Authors

Derrien T*1, Johnson R*1, Bussotti G1, Tanzer A1, Djebali S1, Tilgner H1, Guernec G2, Martin D1, Merkel A1, Gonzalez D1, Lagarde J1, Veeravalli L7, Ruan X7, Ruan Y7, Lassmann T8, Carninci P8, Brown JB3, Lipovich L4, Gonzalez J5, Thomas M5, Davis CA6, Shiekhattar R9, Gingeras TR6, Hubbard T5, Notredame C1, Harrow J5, Guigó R1,10.

• Abstract

The human genome contains many thousands of long non-coding RNAs (lncRNAs). While several studies have demonstrated compelling biological and disease roles for individual examples, analytical and experimental approaches to investigate these genes have been hampered by the lack of comprehensive lncRNA annotation. Here we present and analyze the most comprehensive human lncRNA annotation to date, produced by the GENCODE consortium within the framework of the ENCODE project and comprising 9,277 manually-annotated genes producing 14,880 transcripts. Our analyses indicate lncRNAs are generated through pathways similar to that of protein-coding genes, with similar histone modifications profiles, splicing signals, and exon/intron lengths. In contrast to protein-coding genes, however, lncRNAs display a striking bias towards two-exon transcripts, they are predominantly localized in the chromatin and nucleus, and a fraction appear to be preferentially processed into small RNAs. They are under stronger selective pressure than neutrally evolving sequences—particularly in their promoter regions, which display levels of selection comparable to protein-coding genes. Importantly, about one third seem to have arisen within the primate lineage. Comprehensive analysis of their expression in multiple human organs and brain regions shows that lncRNAs are generally less expressed than protein-coding genes, and display more tissue-specific expression patterns, with a large fraction of tissue-specific lncRNAs expressed in the brain. Expression correlation analysis indicates that lncRNAs show particularly striking positive correlation with the expression of antisense coding genes. This GENCODE annotation represents a valuable resource for future studies of lncRNAs.

• Bullet points

Gencode v7 annotated ~15,000 LncRNAs. LncRNAs a lowly expressed and show a tissue-specific expression pattern. LncRNAs are preferentially localized in nucleus.

• Datasets used

Gencode v7 annotation, CSHL RNASeq, HBM RNASeq, Histone modifications.

## GRCP003:Chromatin mediated regulation of alternative splicing.

• Authors

Hagen Tilgner, Joao Curado, Camilla Inanone, Juan Valcarcel, Ben Brown, Roderic Guigo (CRG, CSHL, Berkeley)

• Abstract

Splicing, the removal of introns from eukaryotic pre-messenger RNAs, is an incompletely understood process. Alternative splicing has been shown to affect at between 74 and 100% of human multi-exons genes and is of considerable importance in disease. Despite considerable advances, it is still not possible to predict complete exon-intron structures or inclusion levels of alternative exons from binding of splicing factors. During the last two years, a wave of genomic analyses and experimental papers have shown that there is a strong connection between chromatin variables and splicing decisions, including cell-type specific alternative splicing decisions. Here we make use of a wealth of RNAseq and chromatin-ChIPseq data from the ENCODE project in order to elucidate the interactions between chromatin and splicing. First we make use of nuclear polyadenylated ENCODE RNAseq data and investigate the dynamics of alternative splicing across cell lines. We determine statistically differentially included exons in all pairwise cell comparisons. We find that between 500 and 1000 genes are affected by differential exon inclusion per cell type comparison. Bioinformatic and RT-PCR analysis shows that the majority of these alternative-exon-calls correspond to bona-fide alternative exons. We then monitor histone modification ChIPseq data on these alternative exons and find that the levels of a number of modifications, notably H3K9ac, consistenly change with changes in exon inclusion between cell lines. We further verified by individual ChIP that the changes in H3K9ac levels correlate with changes in exon inclusion. Finally we make use of regression models in order to predict cell type specific exon inclusion from histone modifications. We show that histone modifications contribute to predict exon inclusion, and that the models inferred in one cell type are broadly valid in other cell types. Our results show for the first time a role of chromatin structure in the genome wide regulation of alternative splicing.

• Bullet points
• Datasets used

Histone Modifications and RNAseq in K562, Gm12878, H1hesc, Helas3, Hepg2 and Huvec; GENCODE V7

## GRCP005: Cage analysis of cell compartments specific coding and non-coding RNA.

• Authors

Timo Lassmann, Piero Carninci

• Abstract

We have applied the cap-analysis gene expression (CAGE) to simultaneously identify mRNA/noncoding RNA transcription starting sites (TSSs) and simultaneously detect their expression and analyze the features of TSSs (consensus sequences, promoters, regulatory elements). Altogether, we have characterized the poly-A plus and poly-A-minus transcriptome of cytosolic, nuclear and nuclear sub-compartments (chromatin, nucleoplasm and nucleolus). This allowed identifying a set of key features of the transcriptome, which can be summarized as follows: (1) the nuclear transcriptome shows much larger complexity than the cytoplasmic one; (2) the poly-A minus fractions are more complex than the poly-A plus fractions; (3) a large part of the TSSs can be mapped to unconventional sites, like introns or novel intergenic regions; (4) retrotransposon elements-derived RNAs are mostly localized in the nuclear fractions, although there are specific repeats enriched in the cytoplasm; LINE-derived RNAs are frequently associated with the chromatin; (5) protein coding mRNAs are associated to specific cell compartments. Some of the compartment TSSs may not only represent promoters, but also cleavage-recapping events: our analysis shows different features of the initiation sites that allows to separate CAGE promoters from other events. Altogether, the CAGE data provide a very comprehensive dataset that complement the ENCODE efforts to identify TSSs and regulatory regions.

• Bullet points

(1) the nuclear transcriptome shows much larger complexity than the cytoplasmic one; (2) the poly-A minus fractions are more complex than the poly-A plus fractions; (3) a large part of the TSSs can be mapped to unconventional sites, like introns or novel intergenic regions; (4) retrotransposon elements-derived RNAs are mostly localized in the nuclear fractions, although there are specific repeats enriched in the cytoplasm; LINE-derived RNAs are frequently associated with the chromatin; (5) protein coding mRNAs are associated to specific cell compartments.

• Datasets used

RIKEN CAGE

## GRCP006: An integrative analysis of the ENCODE transcriptome with polymerase and TAF1..

• Authors

Wold lab, Myers Lab

• Abstract

Major goals of the ENCODE project are the complete characterization of transcriptional units in the human genome and understanding of the connection between regulatory inputs and transcriptional output. To this end, we combine deep paired-end RNA-seq data across 10 different human cell lines and 16 human tissues to map and quantify the polyadenylated portion of the human transcriptome. We examine transcriptome diversity, catalogue [number] candidate noncoding intergenic loci and transcripts, [number] alternative splicing isoforms, [number] 5 and [number] 3 transcript ends novel relative to the GENCODE v7 annotation and characterize their abundance and extent of usage across cell types and tissues. We integrate the results with orthogonal ChIP-seq data on RNA polymerase and the TAF1 subunit of the TFIID general transcription factor in a subset of cell lines. Using these data, we confirm transcript model predictions and characterize the relationship between the loading of those factors and transcriptional output and the extent of transcriptional and posttranscriptional regulation in the genome across different abundance levels and transcript classes.

• Bullet points
• Datasets used

GENCODE v7, Caltech RNA-Seq, Illumina Body Map RNA-seq data, HA Pol2 and TAF1, GIS RNA PET, CAGE

## GRCP008: Deep sequencing of subcellular RNA fractions shows splicing to be predominantly co-transcriptional in the human genome but inefficient for lncRNAs

• Authors

CSHL and CRG: Hagen Tilgner, David Gonzlez Knowles, Rory Johnson, Carrie Davis, Sudipto Chakrabortty, Sarah Djebali, Joo Curado, Michael Snyder, Thomas Gingeras and Roderic Guig

• Abstract

Splicing remains an incompletely understood process. Recent findings suggest that chromatin structure participates in its regulation. Here, we analyze the RNA from sub-cellular fractions obtained trough RNASeq in the cell line K562. We show that in the human genome, splicing occurs predominantly during transcription. We introduce the coSI measure, based on RNASeq reads mapping to exon junctions and borders, to assess the degree of splicing completion around internal exons. We show that, as expected, splicing is almost fully completed in cytosolic polyA+ RNA. In chromatin-associated RNA (which includes the RNA that is being transcribed), for 5.6% of exons, the removal of the surrounding introns is fully completed, compared to 0.3% of exons for which no intron-removal has occurred. The remaining exons exist as a mixture of spliced and fewer unspliced molecules, with a median coSI of 0.75. Thus, most RNAs undergo splicing while being transcribed - co-transcriptional splicing. Consistent with co-transcriptional spliceosome assembly and splicing, we have found significant enrichment of spliceosomal snRNAs in chromatin associated RNA compared to other cellular RNA fractions and other non-spliceosomal snRNAs. CoSI scores decrease along the gene, pointing to a first transcribed, first spliced-rule, yet more downstream exons carry other characteristics, favoring rapid, co-transcriptional intron removal. Exons with low coSI valuesthat is, in the process of being splicedare enriched with chromatin marks, consistent with a role for chromatin in splicing during transcription. For alternative exons and long non-coding-RNAs splicing tends to occur later, and the latter might remain unspliced in some cases.

• Bullet points
• Datasets used

Long and short RNAseq datasets from all compartments in the K562 cell line (CSH), nuclear polyA+ RNA from all cell lines (CSH), hg19, gencode v3c (currently), chromatin modification sequencing in K562 (Bernstein lab), Polymerase in K562 (Hudsonalpha), nucleosomes in K562

## GRCP029: Global RIP-Chip Analysis of HuR RNA Binding-Protein in Multiple ENCODE Cell Lines Reveals a Highly Interconnected Network of Target Genes.

• Authors

Yidong Chen, Francis Doyle, Sabarinath Jayaseelan, Kihoon Yoon, Doderer, Uthra Suresh, Luiz O. Penalva and Scott A. Tenenbaum*

• Abstract

RNA binding proteins (RBPs) play major roles in post-transcriptional regulation including, splicing, processing, RNA transport, localization, decay and translation. The human genome is predicted to have more than 1,000 RBPs but for the large majority of them, characterization remains limited, especially for binding site recognition and target RNA association. New advances in genomic technology are rapidly changing this scenario and a picture is emerging that RBPs regulate specific biological processes by orchestrating the expression of functionally related mRNAs; the so-called post-transcriptional operon model. Layers of complexity can be added as combinations of RNA subsets can change in response to cell perturbation and context, especially in certain cell types. To better clarify the broader, global role of RBP mediated regulation, we performed RIP-Chip experiments in a panel of five ENCODE cell lines targeting the ELVA1/HuR, one of the best characterized RBPs, being involved in a variety of cellular processes including inflammation, tumorigenesis, apoptosis and cell proliferation. Although well studied, ELAV1/HuR research has been limited to focused studies in specific cell types and in response to specific biological processes. Our work aimed to look at the global as well as specific function of ELAV1/HuR by using multiple ENCODE cell lines. Our analysis shows that when multiple cell types are studied, interesting mRNA association patterns of regulation emerge with discrete RNA subsets being shared by most or all cell lines in addition to other cell type specific subsets. Gene ontology and pathway analyses identified Cell Cycle, RNA Post-Transcriptional Modification, DNA Replication and Gene Expression as global core processes regulated by ELAV1/HuR regardless of cell type. Of note is that although these pathways appear central to the general function of ELAV1/HuR, some of the specific genes in these pathways differed with each cell type suggesting redundancy of function but not necessarily mRNA target specificity. As expected, target sets display enrichment for U-rich sequences in the 3UTR in comparison to overall transcriptome. A strong overlap between these U-rich sequences and miRNA predicted sites suggests interplay between ELAV1/HuR and miRNAs targeting and therefore another layer of complexity in post-transcriptional networks.

• Bullet points
• Datasets used

## GRCP030: Genome wide annotation of pseudogenes and analysis of their transcription, functional genomics and evolutionary constraints

• Authors

B Pei, C Sisu, M Gerstein & others (Yale), J Harrow & A Frankish (Sanger), R Harte & M Diekhans (UCSC)

• Abstract

The GENCODE project identified a list of 11,216 pseudogenes at the genome-wide scale using HAVANA manual annotation and two automatic pipelines (PseudoPipe and retroFinder). Within this annotation list, there are 8,248 processed pseudogenes, 2,127 unprocessed pseudogenes, 138 unitary pseudogenes, 161 immunoglobulin pseudogenes, 21 T-cell receptor pseudogenes and 521 unclassified pseudogenes. However, the manual annotation process is not complete for all chromosomes and we estimate that a few thousand more pseudogenes will eventually be included in the genome-wide total.

A variety of analyses for transcription, functional genomics and conservation were carried out against the GENCODE pseudogenes. Expression of pseudogenes was studied by pooling the evidences from EST and mRNA database, RNAseq data and proteomics data from multiple tissues and cell lines. Psueodgenes with one or more pieces of expression evidence were subject to further experimental validation by RT-PCR. We examine the chromatin states and transcription factor binding sites of each pseudogene, and input the information to integrative models for prediction of active pseudogenes. The results demonstrated that the transcribed pseudogenes, on average, maintain more active chromatin states than the untranscribed pseudogenes. Evolutionary constraints on pseudogenes were also studied by identifying constrained elements within each pseudogene sequence. Again, it showed transcribed pseudogenes are under a higher selective pressure than untranscribed pseudogenes. The results of transcription, functional genomics and evolutionary constraints of pseudogenes are stored in a flat file with cross references to the GENCODE annotation. This file serves as a comprehensive resource for the study of individual or subgroup of pseudogenes.

• Bullet points
• Datasets used

## GRCP031: Comparison of Upstream Functional Genomics Data in Duplicated Genes and Pseudogenes

• Authors

C Sisu, B Pei, M Gerstein & others (Yale), J Harrow & A Frankish (Sanger), R Harte & M Diekhans (UCSC)

• Abstract

We compare the upstream regions of paralogous genes and pseudogenes in terms of transcription factor binding and patterns of chromatin modification. In particular, we analyze the degradation of the chip-seq signal and motif occurrence in relation to sequence similarity. We find that the number but not the exact positioning of motifs correlates with the similarity of chipseq peaks. We also find the patterns of preservation differ markedly between duplicated and processed pseudogenes.

• Bullet points
• Datasets used

## GRCP043: A Survey of RNA Editing in the human ENCODE RNA-seq data

Eddie Park1,2, Brian Williams3,4, Barbara Wold3,4, Ali Mortazavi1,2

1. Department of Developmental and Cell Biology, University of California Irvine, Irvine, CA 92697 2. Center for Complex Biological Systems, University of California Irvine, Irvine, CA 92697 3. Division of Biology, California Institute of Technology, Pasadena, CA 91125 4. Beckman Institute, California Institute of Technology, Pasadena, CA 91125

RNA editing is the process by which individual bases in transcripts are changed a fraction of the time from the sequence in the genome. These changes, which typically occur a fraction of the transcripts, can result in non-synonymous substitutions, changes to splicing, nuclear retention of mRNA, and alterations to RNA interference. If all genomic SNPs or private mutations were known, and if cDNA sequencing was error-free, we should be able to read and quantify RNA editing from RNA-seq data. As part of the ENCODE project, we have surveyed 2x75 bp. whole cell polyA RNA-seq data across 14 different human cell lines for candidate RNA editing events within the boundaries of Gencode version 7 annotated-genes. We identified candidate editing events that occurred in biological replicates by mapping stringently to the genome and to known splice junctions, filtering for coverage, looking for a mimimum frequency of non-redundant, uniquely mappable reads, and excluding known genomic SNPs. We found the majority (60 to 80%) of the RNA variants not in dbSNP are of the A to G RNA editing type, and that up to 3,700 novel genic A-to-I candidates per cell line are primarily located within introns and 3 UTRs and only rarely are they located within exons. Nearly 80% of these SNVs mapped in Alu repeats and another 8% mapped to all other repeat families. Unlike A-to-G, the majority (60-80%) of other base variants we detected mapped to within 5 bases of Gencode intron-exon boundaries, and they are likely to be computational mapping artifacts. The distribution of the remaining non A to G variants resembles that of known SNPs, being located more frequently within coding exons. These are likely to be dominated by private mutations, evidence of which can be found in ChIP-seq data collected from the same samples, thus highlighting the importance of careful mapping of RNA-seq data for such analyses.

## GRCP045: RNA-PET for accurate delineation of transcriptional units and gene fusion events

• Authors

Oscar J. Luo, Melissa J. Fullwood, Jayce JY. Koh, Lavanya Veeravalli, Sarah Djebali, Roderic Guigo, Carrie Davis, Tom Gingeras, Atif Shahab, Yijun Ruan, Xiaoan Ruan

• Abstract

Comprehensive understanding of cellular transcriptomes requires characterization of RNA features. Here we describe the RNA analysis with Paired-End Tags (RNA-PET) methodology to identify exact locations of transcription start sites, transcription end sites, as well as special RNA species such as fusion transcripts and trans-splicing events, which are relevant to disease models such as cancer. RNA-PET works by producing full-length cDNA, followed by capture of the 5' and 3' ends into a paired-end tag structure, which is then sequenced and mapped to the genome. Because RNA-PET does not require the sequencing of internal structures, and provides linkage information between the 5' and 3' ends, it is the most cost-effective method for de novo identification of fusion transcripts and trans-splicing events. Here, we present an updated method RNA-PET, renamed from GIS-PET, which has more appealing properties: simplified cloning steps, integration with new next-generation sequencing methods, and more comprehensive RNA transcripts. Especially, the 5' and 3' ends clusters are complementary to RNA-Seq data, which are sufficient to define the start and end of the transcripts, for complete RNA transcripts.

• Bullet points
• Datasets used

## GRCP046: The combination of RT-PCR-seq and RNA-seq is essential to catalog all genic elements encoded in the human genome

• Authors

Cdric Howald, Andrea Tanzer, Jacqueline Chrast, Felix Kokocinski, Thomas Derrien, Nathalie Walters,Jose Manuel Gonzalez, Adam Frankish, Bronwen L Aken, Thibaut Hourlier, Jan-Hinnerk Vogel, Simon White, Stephen MJ Searle, Jennifer Harrow, Tim Hubbard, Roderic Guigo, Alexandre Reymond Lausanne, CRG, Sanger

• Abstract

Within the ENCODE consortium, GENCODE aimed to accurately annotate all protein-coding genes, pseudogenes and non-coding transcribed loci in the human genome through manual curation and computational methods. Annotated transcript structures were rated and lower confidence transcribed loci were systematically experimentally validated. Predicted exon-exon junctions were evaluated by RT-PCR amplification followed by highly multiplexed sequencing readout, a method we coined RT-PCR-seq. 82% of all assessed junctions are confirmed by this evaluation procedure demonstrating the high quality of the annotation reached by the GENCODE gene set. RT-PCR-seq was also efficient to screen gene models predicted using the Human Body Map (HBM) RNAseq data. We validated 73% of these predictions, thus confirming 1168 novel genes, mostly non-coding, which will further complement the GENCODE annotation.Our novel experimental validation pipeline is extremely efficient, far more than unbiased transcriptome profiling through RNA sequencing, which is becoming the norm. Exon-exon junctions unique to GENCODE annotated transcript are five times more likely to be corroborated with our targeted approach than with extensive large human transcriptome profiling. Datasets such as the HBM and ENCODE RNA-seq data fail sampling of low-expressed transcripts. Our RT-PCR-seq targeted-approach also has the advantage of identifying novel exons of known genes, as we discovered unannotated exons in about 10% of assessed introns. We thus estimate that at least 18% of known loci have yet-unannotated exons. Our work demonstrates that the cataloging of all the genic elements encoded in the human genome will necessitate a coordinated effort between unbiased and targeted approaches, like RNA-seq and RT-PCR-seq, respectively.

• Bullet points
• Datasets used

GENCODE annotation v8, Human Body Map, ENCODE subcellularly-compartimentalized transcripts

## GRCP047: Discovery of hundreds of mirtrons in mouse and human small RNA data

• Authors

Erik Ladewig, Katsutomo Okamura, Jakub O. Westholm and Eric C. Lai Sloan-Kettering Institute

• Abstract

One of the prime usages of deep sequencing technology has been to annotate miRNA genes from short RNA reads. However, non-canonical miRNA substrates, which do not fit the criteria used for annotation of canonical miRNAs, can escape the notice of typical miRNA genefinders. Recent analysis of Drosophila and C. elegans showed that their splicing-derived miRNAs (mirtrons) catalogs were far larger than previously recognized, and only a few tens of mammalian mirtrons were annotated to date. Here, we perform meta-analysis of over 750 available mouse and human small RNA datasets comprising >2.5 billion raw reads. From these, we provide highly confident annotation for 169 mouse and 130 human splicing-derived miRNAs, including novel conventional mirtrons, especially frequent 5' tailed mirtrons, and new evidence for 3' tailed mirtrons in mammals. These annotations increase their numbers in mammals by nearly an order of magnitude, and we recognize additional loci with candidate (and often compelling) evidence. As with previously known mirtrons, most of these loci arose relatively recently in their respective lineages; nevertheless, alignments of rodent and primate genomes provide evidence for purifying selection for miRNA-type regulatory capacity on certain mirtrons in each of the three classes. In addition, certain loci present features suggestive of cis-regulatory influence, including mirtrons produced by alternative splicing and evidence for mirtron hairpins evolving by compensatory changes. Detailed inspection revealed unrecognized complexities in small RNA biogenesis, including abundant 3 uridylation of mirtrons (in contrast to a trend for 3 adenylation amongst non-mirtron intron-3p reads), efficient micro-trimming of an individual 5 nt from certain mirtron-5p substrates (providing a new mechanism for re-setting seed sequences), cases of alternative hairpin dicing consistent with 5 measuring as well as unexpectedly consistent dicing despite 5 heterogeneity, and instances of dual 5 and 3 trimming during mirtron biogenesis. This enormous increase in mirtron annotations provides a new foundation for understanding the scope and evolutionary dynamics of miRNAs in mammals, and raises numerous new directions needed to fully understand miRNA biogenesis.

## GRCP058: Evolution by Gene Innovation

• Authors

Valer Gotea, Hanna Petrykowska, Laura Elnitski

• Abstract

The evolution of organismal complexity relies on the accumulation of genes with new functions. The process of gene duplication is a mechanism to which a large part of genic functional diversity is attributed, especially within the framework of protein coding genes. It has been only recently that a few anecdotal examples of the de novo emergence of genes have been documented, however, no underlying mechanism has been outlined. We propose a mechanism underlying the emergence of new genes from anonymous genomic loci modeled on the acquisition of bidirectional promoter activity in existing unidirectional promoters. Here we explore the predictions of this evolutionary model in the human genome, and show evidence for a significant contribution to the diversity of the human transcriptome. Moreover, we also describe a novel test for detecting elevated rates of lineage-specific substitutions as indicators of positive selection, and show that such mutations are responsible for the ultimate structure of the host transcript, and thus its function. In addition, we show that regions immediately upstream of promoters of protein-coding genes are particularly suitable for the emergence of genuinely novel transcripts because of the preferential accumulation of transposable elements that create lineage-specific genomic environments. In summary, we show that the activation of bidirectional promoters is not only the first report of a mechanism of de novo gene acquisition, but also an important mechanism for the emergence of novel genes in a lineage-specific fashion, which can thus contribute to the diversification of genic functionality in any vertebrate species.

# Long-range and large-scale phenomena

## GRCP010: Analysis of long-range interaction networks in the ENCODE pilot regions.

• Authors

Dekker lab

• Abstract
• Bullet points
• Datasets used

## GRCP012: Pattern discovery in RNA-seq and CAGE data.

• Authors

Hoffman, CSHL, RIKEN, Birney, Noble.

• Abstract
• Bullet points
• Datasets used

## GRCP013: Integration of Hi-C and ChIP-seq data reveals distinct types of chromatin hubs.

• Authors

Lan, Witt, Katsumura, Ye, Wang, Bresnick, Farnham, Jin

• Abstract

We have analyzed publicly available K562 Hi-C data, which enables genome-wide unbiased capturing of chromatin interactions, using a Mixture Poisson Regression Model to define a highly specific set of interacting genomic regions. We integrated multiple ENCODE Consortium resources with the Hi-C data, using DNase-seq data and ChIP-seq data for 46 transcription factors and 8 histone modifications. We classified 12 different sets (clusters) of interacting loci that can be distinguished by their chromatin modifications and which can be categorized into three types of chromatin hubs. The different clusters of loci display very different relationships with transcription factor binding sites. As expected, many of the transcription factors show binding patterns specific to clusters composed of interacting loci that encompass promoters or enhancers. However, cluster 6, which is distinguished by marks of open chromatin but not by marks of active enhancers or promoters, was not bound by most transcription factors but was highly enriched for 3 transcription factors (GATA1, GATA2, and c-Jun) and 3 chromatin modifiers (BRG1, INI1, and SIRT6). To validate the identification of the clusters and to dissect the impact of chromatin organization on gene regulation, we performed RNA-seq analyses before and after knockdown of GATA1 or GATA2. We found that knockdown of the GATA factors greatly alters the expression of genes within cluster 6. Our work, in combination with previous studies linking regulation by GATA factors with c-Jun and BRG1, provide genome-wide evidence that Hi-C data identifies sets of biologically relevant interacting loci.

• Bullet points
• Datasets used

DNAse-seq, 8 histone mods, 46 TF ChIP-seq: All from K562 cells and all past the embargo deadline Published Hi-C data from Lieberman-Aiden et al. New data: RNA-seq in K562 after GATA1 or GATA2 knockdown GRCP013_manuscript

## GRCP044: Global integration of ChIP-seq and DNAse-seq ENCODE data using Self-Organizing Maps: the chromatin landscape of cell-type specificity

Ali Mortazavi1,2*, Shirley Pepke3,4*, Georgi Marinov4, Barbara Wold4,5

1. Department of Developmental and Cell Biology, University of California Irvine, Irvine, CA 92697 2. Center for Complex Biological Systems, University of California Irvine, Irvine, CA 92697 3. Center for Advanced Computing Research, California Institute of Technology, Pasadena, CA 91125 4. Division of Biology, California Institute of Technology, Pasadena, CA 91125 5. Beckman Institute, California Institute of Technology, Pasadena, CA 91125

A fundamental property of transcriptional regulation in metazoa is its cell type specificity, which we can measure as qualitative and quantitative differences in chromatin state. Within the ENCODE project, a systematic collection of histone modification ChIP-seq and DNAse-hypersensitive datasets have mapped the chromatin landscape of transcriptionally active, inactive, and inaccessible regions across multiple human cell types. Small cohorts of elements that change chromatin state, in specific patterns, among the different cell types are of particular biological interest. To highlight these in a genome-wide manner and to further relate them to function and sequence-specific transcription factor binding, we adapt the use of self-organizing maps (SOM), which is an unsupervised machine learning-method for clustering, visualizing, and mining high-dimensional data, to the analysis of chromatin states. A strength of SOMs for mining data integration is that they can cluster cohorts of similar elements at both global and increasingly local levels. We train a large, fine-grained SOM constructed using an ENCODE segmentation of the genome on 72 ChIP-seq histone mark and DNAse-seq datasets in six ENCODE Tier 1 and Tier 2 cell lines. We show that this SOM, which clusters the genome into 1350 coherent units, captures both global and cell-type specific chromatin profiles. We then overlay additional ChIP-seq and RNA-seq datasets not used in map training and find that the map highlights functional elements, such as promoters and distal regulatory elements according to their activity across cell types. We find that specific units on the map are enriched for Gene Ontology terms and Genome-wide association lead SNPs. Together, these demonstrate that Self-organizing maps are a powerful method for mining the results of large-scale segmentations of genomes.

## GRCP051: Dynamic chromatin states specify lineage-specific human DNA replication programs

• Authors

R. Scott Hansen a, Theresa K. Canfield b, Molly Weaver b, Richard Sandstrom b, Sean Thomas b, Robert E. Thurman b, and John A. Stamatoyannopoulos b

a. Department of Medicine, Division of Medical Genetics, University of Washington School of Medicine, Seattle, WA.
b. Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA.

The temporal order of DNA replication in higher eukaryotes is thought to reflect a large scale epigenetic compartmentalization of the genome that varies between cell lineages and phenotypic states. To explore this variation in an expanded set of human cell types, we ascertained whole genome replication patterns using Repli-Seq, a previously described method for ascertainment of whole genome replication time patterns that is based on high throughput DNA sequencing. We have updated Repli-Seq methods and analyses so as to allow simple comparisons of the replication program in different cell types and conditions and also to provide such features as the locations of early and delayed replication initiation zones and relative genomic copy number. Comparisons of eleven different cell types indicate that at least 75% of the genome varies in replication time using conservative thresholds. Few such differences are found when comparing replicates (0%) or seven different lymphoblastoid cell lines (6%); much of the latter variation derives from X chromosome differences in male versus females because of X inactivation. Our studies confirm that early replication is generally correlated with active chromatin states defined by ENCODE consortium data, including gene expression, open chromatin, and active chromatin protein modifications. Biphasic or asynchronous replication like that seen on the female X is also found in a limited number of autosomal regions in all cell types examined, including primary cultures. Allelic differences in replication and chromatin state are likely for some of these, but it appears that cell population differences resulting from stochastic events may be contributing to many others. Several instances of biphasic replication have disjoint segments and point to probable errors in the reference genome sequence. Repli-Seq derived copy number is particularly useful in evaluating instances of replication time asynchrony and sharp changes in replication time that are commonly found in cancer cell lines. Lastly, we find that replication initiation zones are best correlated with the occurrence of clustered DNaseI hypersensitive sites, and that stronger DNaseI signals correspond to earlier activation times. We present DNA deletion studies that support the hypothesis that the DNA replication program is determined at such sites through LCR-like regulatory elements.

# Regulation of Transcription

## GBCP014: Cell type-specific binding patterns reveal that TCF7L2 can be tethered to the genome by association with GATA3.

• Authors

Seth Frietze, Rui Wang, Lijing Yao, Yu Gyoung Tak, Zhenqing Ye, Malaina Gaddis, Heather Witt, Peggy J Farnham, and Victor X Jin

• Abstract

Background The TCF7L2 transcription factor is linked to a variety of human diseases, including type 2 diabetes and cancer. One mechanism by which TCF7L2 could influence expression of genes involved in diverse diseases is by binding to distinct regulatory regions in different tissues. To test this hypothesis, we performed ChIP-seq for TCF7L2 in 6 human cell lines. Results We identified 116,000 non-redundant TCF7L2 binding sites, with only 1,864 sites common to the 6 cell lines. Using ChIP-seq, we showed that many genomic regions that are marked by both H3K4me1 and H3K27Ac are also bound by TCF7L2, suggesting that TCF7L2 plays a critical role in enhancer activity. Bioinformatic analysis of the cell type-specific TCF7L2 binding sites revealed enrichment for multiple transcription factors, including HNF4alpha and FOXA2 motifs in HepG2 cells and the GATA3 motif in MCF7 cells. ChIP-seq analysis revealed that TCF7L2 co-localizes with HNF4alpha and FOXA2 in HepG2 cells and with GATA3 in MCF7 cells. Interestingly, in MCF7 cells the TCF7L2 motif is enriched in most TCF7L2 sites but is not enriched in the sites bound by both GATA3 and TCF7L2. This analysis suggested that GATA3 might tether TCF7L2 to the genome at these sites. To test this hypothesis, we depleted GATA3 in MCF7 cells and showed that TCF7L2 binding was lost at a subset of sites. RNA-seq analysis suggested that TCF7L2 represses transcription when tethered to the genome via GATA3. Conclusions Our studies demonstrate a novel relationship between GATA3 and TCF7L2, and reveal important insights into TCF7L2-mediated gene regulation.

• Bullet points
• Datasets used

## GRCP015: Sequence and chromatin determinants of cell-type specific transcription factor binding.

Aaron Arvey, Phaedra Agius, Bill Noble, Christina Leslie.

• Abstract

Gene regulatory programs in distinct cell types are maintained in large part through the cell-type specific binding of transcription factors (TFs). The binding determinants of a given TF include its own DNA sequence preferences, DNA sequence preferences of co-factors, and the local cell-dependent chromatin context. To explore the contribution of DNA sequence preference, histone modifications, and DNase accessibility to cell-type specific binding, we analyzed over 250 ChIP-seq experiments performed by the ENCODE Consortium. This analysis included experiments for 70 transcription factors, 12 of which were profiled in both the GM12878 (lymphoblastoid) and K562 (erythroleukemic) human hematopoietic cell lines. To model DNA sequence preferences, we used support vector machines (SVMs) that use flexible k-mer patterns to model sequence preferences more accurately than traditional motif approaches. In addition, we used SVM-based chromatin signature models to capture the spatial distribution of histone modifications and DNase accessibility, obtaining significantly more accurate predictions than simpler approaches. Consistent with previous studies, we find that DNase accessibility can explain cell-line specific binding for many factors. However, in contrast to these studies, we find that some TFs display distinct cell-dependent sequence preferences that can be learned by training simultaneously on ChIP-seq data from multiple cell types. Moreover, we identify cell-specific binding sites that are accessible in both cell types but bound only in one. For these sites, cell-type specific sequence models, rather than DNase accessibility, are able to explain differential binding. Our results suggest that using a single motif for each TF and filtering for chromatin accessible loci is not always sufficient to accurately account for cell-type specific binding profiles.

• Bullet points

1. Cell type specific binding for many transcription factors can be predicted by differential DNase accessbility. 2. A subset of TFs have cell-type specific binding even when the locus is accessible in both cell lines. 3. The differential binding in these cases is best captured by differential DNA sequence preferences for a given TF, suggesting that cell-type specific cofactors determine binding.

• Datasets used

All publicly available transcription factor binding, histone modification, and DNase accessibility data for GM12878 and K562 cell lines as of 06/15/2011.

## GRCP017: Ubiquitous heterogeneity and asymmetry of the chromatin environment at regulatory elements

• Authors

Anshul Kundaje1*+, Sofia Kyriazopoulou-Panagiotopoulou1*, Max Libbrecht1*, Cheryl Smith2, Debashish Raha, Elliot Winters, Steven Johnson3, Michael Snyder4, Serafim Batzoglou1+, and Arend Sidow2+

• Joint First authors, + Corresponding authors
• Abstract

Gene regulation is governed by an interplay of nucleosome remodeling, histone modifications, and transcription factor binding at functional elements such as enhancers, promoters, and insulators. The large volume and diversity of relevant ENCODE ChIP-seq data from a variety of cell lines provided an opportunity to comprehensively relate nucleosome signals (modifications and positioning) to transcription factor binding. Using a new method, the Clustered AGgregation Tool (CAGT), we conducted exhaustive quantification of histone modifications and nucleosome positioning signals around bound transcription factors. CAGT conducts pattern discovery that accounts for the inherent heterogeneity in signal magnitude, shape and implicit strand orientation of chromatin marks at a collection of binding sites. We generated signal profiles of 11 chromatin modifiations plus nucleosome positioning, H2A.Z, and DNase signal for 123 transcription factors, over several cell lines. A total of 5,084 pairwise factor-by-signal profiles reveal, for each factor, extensive but limited heterogeneity in how histone modifications are deposited, and how nucleosomes are positioned, around its binding sites. Extreme asymmetry of nucleosome positioning is the norm, with the members of the CTCF/cohesin complex being the only factors in which symmetric nucleosome positioning around the binding site is predominant. Asymmetry of histone modifications is also the norm, for all types of chromatin marks examined, including promoter, enhancer, elongation, and repressive marks. Meta-analyses of the signal profiles reveal a common vocabulary of modification patterns and nucleosome positioning.

• Bullet points
• Datasets used

All Jan 2011 Freeze ENCODE Transcription factor ChIP-seq datasets

All DNase-seq and FAIRE seq datasets (Jan2011 Freeze)

All Histone modification ChIP-seq datasets (Jan 2011 Freeze)

Nucleosome positioning datasets (Gm12878, K562 cell lines)

## GRCP028: Predicting Cell-Type Specific Gene Expression from Regions of Open Chromatin

• Authors

Anirudh Natarajan, Galip Gürkan Yardımcı, Nathan C. Sheffield, Gregory E. Crawford and Uwe Ohler

• Abstract:

Complex patterns of cell-type specific gene expression are thought to be achieved by combinatorial binding of transcription factors (TFs) to sequence elements in regulatory regions. Predicting cell-type specific expression in mammals has been hindered by the oftentimes unknown location of distal regulatory regions. To alleviate this bottleneck, we used DNase-seq data from 19 diverse cell types to identify proximal and distal regulatory elements at genome-wide scale. Matched expression data allowed us to separate genes into classes of cell-type specific up-regulated, down-regulated, and constitutively expressed genes. CG dinucleotide content and DNA accessibility in the promoters of these three classes displayed substantial differences, highlighting the importance of including these aspects into modeling gene expression. We associated DNaseI Hypersensitive Sites (DHS) with genes, and trained classifiers for different expression patterns. TF sequence motif matches in DHS provided a strong performance improvement in predicting gene expression over using proximal promoter sequences, a typical baseline approach. In particular, we achieved competitive performance when discriminating up-regulated genes from different cell types or genes up- and down-regulated under the same conditions. We identified previously known and new candidate cell-type specific regulators. The models generated testable predictions of activating or repressive functions of regulators. DNaseI footprints for these regulators were indicative of their direct binding to DNA. In summary, we successfully used information of open chromatin obtained by a single assay, DNase-seq, to address the problem of predicting cell-type specific gene expression in mammalian organisms directly from regulatory sequence.

• Datasets used
1. Duke DNase-seq data from at least 19 diverse cell types
2. Duke Affy exon array expression data from same cell types

## GRCP032: Understanding transcriptional regulation by integrative analysis of transcription factor binding data

• Authors

Yale (Gerstein lab) + others

• Abstract

Gene expression regulation is at the center of many biological processes, in which a specific family of proteins, called transcription factors (TFs) plays critical roles. The ENCODE project has quantified the expression levels of >100,000 promoters using CAGE, diTAG, and RNA-seq methods with a great detail, i.e. in five cellular components of nine selected cell lines based on several different protocols. It also generated the genome-wide binding sites for more than 100 TFs in about 500 ChIP-seq datasets. We investigated the relationship between gene expression and TF binding signals using statistical models. We found that TF binding signals are highly predictive to gene expression levels, and that the levels measured by CAGE are more predictive than those by diTAG and RNA-seq. We divided the TFs into several different families and found that different families vary dramatically in their predictive capabilities. On the other hand, genes with high CpG content promoters can be predictive with much higher accuracy than those with low CpG content promoters. We also found that TF binding is highly predictive to the nearby histone modification signals, particularly, of the promoter associated histone marks. Moreover, the TF binding signals are also correlated with other chromatin signals, such as DNase I hypersensitivity and signals from FAIRE experiments. Our analysis suggests a model that TFs interplay with other factors to co-regulate the gene expression via affecting the local chromatin structure.

• Datasets used

TF binding, histone modification, DNase I hypersensitivity, FAIRE, CAGE, RNA-seq, diTAG, and Gencode

## GRCP033: Genome-wide analysis of the binding sites of more than 100 transcription-related factors defines different types of genomic regions with distinct biological properties

• Authors

Kevin Y. Yip, Chao Cheng, Nitin Bhardwaj, James B. Brown, Jing Leng, Anshul Kundaje, Joel Rozowsky, Ewan Birney, Peter J. Bickel, Michael Snyder and Mark Gerstein

• Abstract

Transcription factors (TFs) bind different classes of regulatory elements in regulating gene expression. One challenge in the study of gene regulation has been to provide useful annotations for these elements based on large yet incomplete experimental binding datasets. By using binding data of more than 100 TFs and related factors in about 500 ChIP-seq experiments produced by the ENCODE project consortium from multiple cell types, together with matched data for gene expression and chromatin states, we constructed statistical models that capture general properties of three paired types of genomic regions in five cell lines, which we provide as a reference resource. The three pairs include regions with active/inactive binding, extremely high/low degrees of co-occurrence of binding, and regulatory modules proximal to promoters/distal to genes, respectively. From the distal regulatory modules, we developed computational pipelines to identify potential enhancers, and successfully validated a sample of them experimentally. The six types of regions exhibit drastic differences in their chromosomal locations, chromatin features such as DNase I hypersensitivity and histone modifications, factors that bind them, and cell-type specificity. In addition, by correlating chromatin features at the distal regulatory modules and gene expression levels, we associated these modules with potential target transcripts and identified TFs potentially involved in these long-range regulations. Finally, we found significant fractions of binding at high co-occurrence regions without cognate sequence motifs, and showed that they have strong DNA accessibility, which may facilitate non-sequence-specific binding. Our study highlights the complex interplays between TFs, sequence patterns, chromatin features and gene expression.

• Datasets used

TF binding, DNase I hypersensitivity, FAIRE, histone modification, CAGE, RNA-seq and Gencode

## GRCP038: Systematic discovery and characterization of regulatory motifs associated with TF binding

• Authors

• AbstractRecent advances in technology have led to a dramatic increase in the number of available transcription factor ChIP-seq and ChIP-chip datasets. Understanding the motif content of these datasets is an important step in understanding the underlying mechanisms of regulation.

Here we provide a systematic motif analysis for 427 human ChIP-seq datasets. We collect motifs determined largely in vitro from the literature and supplement this by performing de novo motif discovery using five popular motif discovery tools. We show that the differential enrichment is a principled way for choosing between motif variants found in the literature and for flagging potentially problematic datasets. Enrichment of unexpected discovered motifs can uncover potential co-factors and highlight key activators. We also use cell type specific p300 binding to find factors active in specific conditions.

The results are provided in an format accessible for both browsing and performing large scale analyses. The motifs discovered here are being used in parallel studies to validate the specificity of antibodies, understand cooperativity between datasets, and measure the variation of motif binding across individuals and species.

• Datasets used

All TF ChIP-seq datasets; Gencode v4

## GRCP039: Modeling gene expression using chromatin features in various cellular contexts

• Authors

Xianjun Dong, Melissa Greven, Anshul Kundaje, Sarah Djebali, Ben Brown, Chao Cheng, Roderic Guig Serra, Zhiping Weng

• Abstract

Background: Early studies have demonstrated that chromatin feature levels correlate with gene expression. The ENCODE Project enables us to further explore this relationship using an unprecedented volume of data. Expression levels from more than 100,000 promoters were measured using a variety of high-throughput techniques to analyze RNA extracted by different protocols from different cellular compartments of several human cell lines. ENCODE also generated the genome-wide mapping of 11 histone marks, one histone variant, and DNase I hypersensitivity sites in seven cell lines. Results: We built a novel quantitative model to study the relationship between chromatin features and expression levels. Our study not only confirms that the general relationships found in previous studies hold across various cell lines, but also makes new suggestions about the relationship between chromatin features and gene expression levels. We found that expression status and expression levels can be predicted by different groups of chromatin features, both with high accuracy. We also found that expression levels measured by CAGE are better predicted than by RNA-PET or RNA-Seq, and different categories of chromatin features are the most predictive of expression for different RNA measurement methods. Additionally, PolyA+ RNA is overall more predictable than PolyA- RNA among different cell compartments, and PolyA+ cytosolic RNA measured with RNA-Seq is more predictable than PolyA+ nuclear RNA, while the opposite is true for PolyA- RNA. Conclusions: Overall, our study provides additional insights into transcription regulation by analyzing chromatin features in different cellular contexts

• Datasets used

histone modification, DNase I hypersensitivity, CAGE, RNA-PET, RNA-Seq, and Gencode

• Documents

## GRCP042: Modulation of the transcription factor affinity in the human genome

• Authors

Kathryn Beal, Pouya Kheradpour, Steven Wilder, Anshul Kundaje, Ian Dunham, Manolis Kellis, Ewan Birney, Javier Herrero

• Abstract

The ENCODE project aims at finding functional elements in the human genome. As part of the project, the binding of more than 120 transcription factors is being tested in approx. 80 cell lines using ChIP-seq experiments.

First, we study the per-base evolutionary conservation in mammalian genomes of the Transcription Factor Binding Motifs (TFBMs). ChIP-seq peaks are called using SPP. We use the concept of Irreproducible Discovery Rate (IDR) to focus on consistent (reproducible) peaks only. The analysis is performed with both known and discovered TFBMs. Known motifs are extracted from the Jaspar database. We supplement these with the discovery of new motifs using several methods. The best 3 motifs from each motif discovery method are considered as long as they are dissimilar enough from the other motifs (Pearson correlation < 0.75). The sequence conservation scores on the motifs are obtained with GERP on a 33-way pan-mammalian alignment made using Enredo and Pecan. For each set of bound motifs, we average the conservation scores across all the motifs. We observe a good linear correlation between the sequence conservation (SC) and the information content (IC) for most of the TFBMs.

We then focus on the differential affinity of the TF with respect to nucleotide changes at every position. The peaks are divided based on the two most common nucleotides at any given position. The distribution of scores are compared using a non-parametric test. We find several cases where a TFBM can tolerate alternative nucleotides at a particular position, but in which the affinity of the TF is affected by the actual nucleotide at that position.

Last, we compare these results with the SC vs IC correlation analysis to find positions in the genome that appear to be more conserved than expected by their IC and that seem to be involved in the modulation of the TF affinity for the sequence. For instance, the CTCF motif contains one of such positions where the presence of a G or a T changes significantly the affinity for the TF. Together with another two positions in the motif that also seem to modulate the affinity for the TF, they form a triplet such as we find an excess of motifs with all 3 bases with greater affinity or all 3 bases with lower affinity for the TF.

In conclusion, we can establish a relationship between the nucleotide usage in the motif and the affinity of the TF for the sequence. Some of these positions display an unexpected conservation in mammals. We suggest that these positions in the motif are used to modulate the TF affinity in different promoters.

• Bullet points
• Datasets used

• Authors

Hao Wang1,5, Matthew T. Maurano1,5, Hongzhu Qu1,3,5, Katherine E. Varley4, Jason Gertz4, Florencia Pauli4, Kristen Lee1, Theresa Canfield1, Molly Weaver1, Richard Sandstrom1, Robert E. Thurman1, Rajinder Kaul1, Richard M. Myers4 & John A. Stamatoyannopoulos1,2,6

1 Department of Genome Sciences, University of Washington, Seattle, WA 98195 USA; 2 Department of Medicine, University of Washington, Seattle, WA 98195 USA; 3 Laboratory of Disease Genomics and Individualized Medicine, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100029, China; 4 Hudson Alpha Institute for Biotechnology, Huntsville, AB 35806, USA. 5 These authors contributed equally to this work. 6 Correspondence: jstam@uw.edu.

• Abstract

CTCF is a ubiquitously-expressed regulator of fundamental genomic processes including transcription, intra- and inter-chromosomal interactions, and chromatin structure. Because of its critical role in genome function, CTCF binding patterns have long been assumed to be largely invariant across different cellular environments. Here we analyze genome-wide occupancy patterns of CTCF by ChIP-seq in 19 diverse human cell types including normal primary cells and immortal lines. We observed highly reproducible yet surprisingly plastic genomic binding landscapes, indicative of strong cell-selective regulation of CTCF occupancy. Comparison of a subset of sites with massively parallel bisulfite sequencing data revealed that 40% of regulated CTCF binding is linked to differential DNA methylation concentrated at two critical positions within the CTCF recognition sequence. Unexpectedly, CTCF binding patterns were markedly different in normal vs. immortal cells, with the latter manifesting widespread disruption of CTCF binding associated with increased methylation. Strikingly, this disruption is accompanied by up-regulation of CTCF expression, with the result that both normal and immortal cells maintain the same average number of CTCF occupancy sites genome-wide. Our results reveal a tight linkage between DNA methylation and global CTCF occupancy patterns, and suggest that up-regulation of CTCF expression in malignant cells is a stabilizing response to methylation-associated remodeling of the occupancy landscape.

• Bullet points
• Datasets used

## GRCP052: MYC Represses STAT1 Transcription by Interacting with MIZ1

• Authors

Debasish Raha1, Minyi Shi2, Anshul Kundaje2, Zhengqing Ouyang2, Stephen Landt2 and Michael Snyder2

1. Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, CT 06520 2. Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305-5120

• Abstract

MYC is a key transcription factor that can activate as well as repress gene transcription. The role of MYC in cell transformation is achieved by activating a set of genes involved in cell growth and proliferation and by repressing a set of genes like cyclin dependent kinase inhibitors, which can interfere with the activation function of MYC. A decrease in MYC level was observed in interferon treated cells. Using genome wide mapping studies, we show that MYC binds upstream of the STAT1 gene and other interferon response genes in HeLa S3 cells. A decrease in MYC level, either by targeting MYC using siRNAs or by the treatment of cells with IFNG, leads to activation of STAT1 transcription. The repressor function of MYC at the STAT1 locus is achieved through its interaction with MIZ1. These results indicate that MYC acts as a key regulator of STAT1 transcription in interferon treated cells.

## GRCP053: Transcription factor occupancy at HOT regions quantitatively predicts RNA polymerase recruitment in five ENCODE cell lines

• Authors

Joseph W. Foley, Arend Sidow

• Abstract

Previous work in model organisms has described the existence of high-occupancy target (HOT) regions, genomic loci occupied by many different transcription factors. We developed UniPeak, a peak caller that enables direct quantitative comparison of ChIP-seq data from multiple experimental conditions, and used this method to identify HOT regions in the human genome with data from 64 transcription factors in 5 cell lines. We found that transcription-factor occupancy varies quantitatively within human HOT regions, especially between different cell types. The sequence motif associated with any given factor's direct DNA binding is somewhat predictive of its empirical occupancy, but a great deal of occupancy occurs at sites without the factor's motif, implying indirect recruitment by another factor whose motif is present. Most, but not all, HOT regions co-localize with RNA polymerase II binding sites; however, many are not near the promoters of annotated genes. Of those that are, transcription-factor occupancy is somewhat predictive of Pol II recruitment, but only weakly predictive of RNA transcript abundance, likely due to the many regulatory influences between recruitment of the polymerase and production of the mature transcript. In summary, we find that transcriptional regulation by transcription factors is a quantitative and combinatorial system.

• Datasets used

Myers TF binding

Snyder TF binding

Caltech RNA-seq

## GRCP054: Mapping the cyclin-specific genomic binding profile of elongation factor P-TEFb

• Authors Nathan Lamarre-Vincent1, Zarmik Moqtaderi1, Kevin Struhl1 and others.

1. Dept. of Biological Chemistry and Molecular Pharmacology, Harvard Medical School

• Abstract

Positive elongation factor b (P-TEFb) is a cyclin/cyclin dependent kinase 9 heterodimer that is critical to the relief of promoter proximal stalled RNA polymerase II (RNAPII) and proper transcriptional elongation. We have mapped the genome-wide binding profiles of the two cyclins, cyclin T1 and cyclin T2, which can form active P-TEFb heterodimers with cyclin-dependent kinase 9. CycT1 and CycT2 overlap extensively with respect to the genes they target. However, patterns emerge to distinguish the cyclins, when their binding profiles are compared to those of the P-TEFb targets, transcription factors and histone acetylation. CycT2 binding is focused upstream of the transcriptional start site (TSS) coincident with transcription factors. CycT1 binding is coincident with NELF and DSIF, and enriched relative to CycT2 at a termination proximal pause. Additionally, unlike negative elongation factors NELF and DSIF, neither CycT1 nor CycT2 occupancy strongly correlated with transcriptional activity. We conclude that at constitutive genes at least two mechanisms for promoter proximal pausing exist a NELF/DSIF-dependent mechanism that predominates at active genes and a NELF/DSIF-independent mechanism characteristic of inactive promoters.

• Datasets used

Snyder TF Binding

## GRCP055: Genome wide association analysis of ATF/CREB family of transcription factors

• Authors Nathan Lamarre-Vincent1, Kevin Struhl1 and others.

1. Dept. of Biological Chemistry and Molecular Pharmacology, Harvard Medical School

• Abstract
       The activating transcription factor (ATF)/ cAMP response element (CRE) binding protein (CREB) family of basic leucine zipper transcription factors play a critical role in cell growth and survival, as well as regulating a number of cell type specific activities. Members of the ATF/CREB family all recognize the canonical CRE (TGATCA). However despite this commonality, studies have found unique targets for individual family members.  Comparative analysis of the genome wide binding profiles of four ATF/CREB family members, CREB, ATF1, ATF2 and ATF3, identify a core set of binding sites common to all four TFs, and complementary motif and TF associations that differentiate the ATF/CREB TFs.
A subset of CREB peaks associate directly with the transcription factor NF-Y. ATF2 has previously been shown to heterodimerize with JUN. In addition to significant overlap with JUN, ATF2 also preferentially associates with STAT and GATA motifs and TFs.  ATF3 binding is divided almost equally between overlap with the other ATF/CREB family members and the HLH transcription factors Max and USF1.


• Datasets used

Snyder TF Binding

## GRCP060 (Previously GRCP057 clash): Transcription Factor Networks Involved in Interferon Response

• Authors

Debasish Raha1, Minyi Shi2, Anshul Kundaje2, Zhengqing Ouyang2, Stephen Landt2 and Michael Snyder2 1 Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, CT 06520 2 Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305-5120

• Abstract

Type I and Type II interferon are used as antiviral and antitumor agents. The response of interferon is mediated by JAK/STAT and other signaling pathways. We combined binding site data of several transcription factors and RNA expression data to reveal TF networks involved in regulating expression of genes in response to both type I and type II interferon.

## GBCP062 : ChIP-seq analysis of SREBP family transcription factors reveals genome-wide patterns of DNA association regulated by cholesterol

• Authors

Brian D. Reed1, Alexandra E. Charos1, Debasish Raha1, Anna M. Szekely2, Sherman M. Weissman2 & Michael Snyder1,3

1Department of Molecular, Cellular, and Developmental Biology, Yale University, New Haven, CT 06520, USA. 2Department of Genetics, Yale University School of Medicine, New Haven, CT 06520, USA. 3Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA.

• Abstract

Sterol regulatory element-binding protein (SREBP) family transcription factors control the expression of genes involved in cholesterol and fatty acid homeostasis and play critical roles in numerous diet-related diseases, such as insulin resistance, diabetes, and cardiovascular disease. Here we have mapped the complete set of binding sites of SREBPs across the human genome using chromatin immunoprecipitation (ChIP) followed by massively parallel DNA sequencing (ChIP-seq). We find that SREBP1a and SREBP2 localize to 1,164 and 815 genomic binding sites, respectively, in a human hepatocyte (HepG2) cell line subjected to cholesterol deprivation and statin treatment. Many sites are co-occupied by both factors, including the promoter regions of most cholesterol metabolism genes, providing the first evidence that co-occupation by endogenous SREBP1a and SREBP2 is widespread in vivo. We find that, in addition to promoter regions, SREBPs also occupy a subset of predicted distal enhancer elements. Using the high-resolution binding data, we identify an enriched and evolutionarily conserved 10 bp consensus motif corresponding to the sterol regulatory element (SRE) and greatly expand the number of these important elements annotated in the human genome. In agreement with the binding data, gene expression analysis reveals that many SREBP-occupied genes are regulated by cholesterol levels. These results provide novel insight into the genome-wide functions of SREBPs and their role in human health and disease.

## GRCP063 : The interplay of cell type specific chromatin states and TF binding in six ENCODE cell lines.

• Authors

Jason Ernst and Manolis Kellis

• Abstract

While the genome sequence of each human is invariant across different cell types, the binding of regulators and the underlying state of chromatin can be extremely dynamic. However, the extent to which the chromatin context relates to the cell-type specific binding of various transcription factors (TFs) has remained largely unexplored. In this paper, we leverage genome-wide datasets of chromatin modification patterns and TF binding across six human cell types to study the interplay of chromatin and TFs in establishing cell type specific activity. We use 25 chromatin states, consistently learned across the six cell types, to summarize over 70 chromatin datasets, providing a consistent annotation of strong and weak enhancers, open and closed insulator regions, active and weak promoters, and diverse transcribed or repressed states. We use these to study the cell type specific binding of more than 100 regulators in close to 500 experiments, and reveal their enrichments in specific chromatin states.

• Datasets used

ChromHMM segmentation and uniform peak calls

## GBCP065: Functional analysis of transcription factor binding sites in human promoters.

• Authors

Troy W. Whitfield, Jie Wang, Patrick J. Collins, E. Christopher Partridge, Nathan D. Trinklein, Shelley Force Aldred, Richard M. Myers and Zhiping Weng

• Abstract

The binding of transcription factor (TF) proteins to specific locations in the genome is integral in the orchestration of transcriptional regulation in cells. To characterize TF binding site function on a large scale, we predicted (using a combination of genome-wide binding signals and sequence analysis) and mutagenized 455 TF binding sites (TFBSs) in human promoters. We carried out functional tests on these TFBSs in four different immortalized human cell lines (K562, HCT116, HT1080 and HepG2) by using transient transfections with a luciferase reporter assay. In each cell line, we verified that between 36% and 49% of TF binding sites made a functional contribution to the promoter activity; the overall rate for observing function in any of the four cell lines was 70%. In more than a third of the functional sites, on average, TF binding resulted in transcriptional repression. When compared with predicted TF binding sites whose function was not experimentally verified, the functional TFBSs had higher conservation and were located closer to the transcriptional start site (TSS). Among functional sites, repressive sites tended to be located further from the TSS than were the activating sites. Our functional tests of TFBSs were carried out primarily for six transcription factors: CTCF, GABP, GATA2, E2F proteins, STAT proteins and YY1. Our data provide significant insight into the functional characteristics of binding sites for the YY1 transcription factor. In particular, we were able to detect distinct activating and repressing classes of YY1 binding sites, with the repressing YY1 sites found to lie closer to, and often over, translational start sites and had a distinct binding motif.

• Bullet points
• Datasets used

ENCODE ChIP-seq data, RNA-seq data, transient transfection data from Weng lab.

# Networks

## GRCP036: A highly integrated and complex PPARGC1A transcription factor binding network in HepG2 cells

• Authors

Alexandra E. Charos1, Brian D. Reed1, Debasish Raha1, Anna M. Szekely2, Sherman M. Weissman2 & Michael Snyder1,3 1Department of Molecular, Cellular, and Developmental Biology, Yale University, New Haven, CT 06520, USA. 2Department of Genetics, Yale University School of Medicine, New Haven, CT 06520, USA. 3Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA.

Corresponding author: Michael Snyder, mpsnyder@stanford.edu

Running Title: The PPARGC1A TF network Keywords: Metabolic and regulatory networks; Functional genomics; PPARGC1A

• Abstract

PPARGC1A is a transcriptional coactivator that binds to and coactivates a variety of transcription factors (TFs) to regulate the expression of target genes. PPARGC1A plays a pivotal role in regulating energy metabolism and has been implicated in several human diseases, most notably type II diabetes. Previous studies have focused on the interplay between PPARGC1A and individual TFs, but little is known about how PPARGC1A combines with all of its partners across the genome to regulate transcriptional dynamics. In this study, we describe a core PPARGC1A transcriptional regulatory network operating in HepG2 cells treated with forskolin. We first mapped the genome-wide binding sites of PPARGC1A using chromatin-IP followed by high-throughput sequencing (ChIP-seq) and uncovered overrepresented DNA sequence motifs corresponding to known and novel PPARGC1A network partners. We then profiled six of these site-specific TF partners using ChIP-seq and examined their network connectivity and combinatorial binding patterns with PPARGC1A. Our analysis revealed extensive overlap of targets including a novel link between PPARGC1A and HSF1, a TF regulating the conserved heat shock response pathway that is misregulated in diabetes. Importantly, we found that different combinations of TFs bound to distinct functional sets of genes, thereby helping to reveal the combinatorial regulatory code for metabolic and other cellular processes. In addition, the different TFs often bound near the promoters and coding regions of each others genes suggesting an intricate network of interdependent regulation. Overall, our study provides an important framework for understanding the systems level control of metabolic gene expression in humans.

# Methods, Stats and Standards

## GRCP024: STAR: mapping non-contiguous RNA-seq data

• Authors

Alexander Dobin, Carrie A. Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Phillippe Batut and Thomas R. Gingeras

• Abstract

Accurate alignment of the high-throughput RNA-seq reads is a challenging and yet unsolved problem owing to the splicing of RNA molecules and relatively short length of the reads. Most alignment tools developed for the analysis of the high-throughput sequencing data rely upon either previously annotated splice junctions, the sequence characteristics of annotated junctions and/or the construction of a reference database of junction sites. To align the large ENCODE Transcriptome long RNA-seq dataset comprising 40 Billion Illumina reads we developed the Splice Transcripts Alignment and Reconstruction tool (STAR) which does not require any previous knowledge of splicing loci and does not rely upon a priori properties of the junctions. The unbiased de novo splice junction detection is imperative for discovery of novel splice junctions and isoforms, as well as other increasingly important RNA species such as inter-chromosomal chimeric RNAs. In our validation experiments, 80-90% of the novel intergenic junctions detected by STAR were corroborated by the 454 sequencing of long cDNAs. Remarkably, the validation rate remains at this high level even if the tested junctions were supported by as few as two RNA-seq reads. On the computational side, the sensitivity and specificity of the alignments produced by several most commonly used RNA mappers were calculated against the gold standard set of alignments, which was created by mapping the reads exhaustively to the annotated transcriptome. We demonstrate that STAR false discovery and negative rates are reduced significantly compared to the other RNA-seq alignment algorithms.

• Bullet points
• STAR algorithm features and advantages
• Experimental validation: high (80-90%) validation rate of novel splice junctions
• Novel method to Computationally estimate sensitivity and specificity against exhaustive alignments of real data
• STAR significantly outperforms Tophat and BLAT in sensitivity, specificity and speed
• Non-parametric IDR for RNA-seq data
• Datasets used
CSHL long and small RNA-seq
CSHL 454 sequencing
Gencode 7 annotations

## GRCP048: ChIP MC: A Monte Carlo method for motif analysis of ChIP sequencing data

• Authors

Troy W. Whitfield and Zhiping Weng

• Abstract

High throughput chromatin immunoprecipitation experiments can reveal the genome-wide patterns of transcription factor binding that orchestrate the transcriptional regulation of cells. Precise determination of the 6-15 base pair binding sites, however, relies on additional computational analysis, typically making use of position weight matrices (PWMs). Currently available analysis tools select binding sites that closely match a known PWM or one discovered \emph{de novo} using the most significant binding regions. The data from ChIP sequencing can be heterogeneous, with some peaks having only weak examples of binding motifs. As a result of this heterogeneity, PWM-based distinction between the "complete" set of ChIP sequencing peaks from an experiment and a set of background sequences can be poor. To address this problem, we introduce a Monte Carlo (MC) based method, ChIP MC, that can optimize the transcription factor binding site (TFBS) selection to best distinguish between ChIP peaks and background fragments. Application of ChIP MC to synthetic data and ENCODE Consortium data-sets demonstrates the effectiveness of the method. The formulation of the method allows for straightforward generalization to including additional data-types; extensions are discussed.

• Bullet points
• Datasets used

## GRCP056: Inferring transcriptional regulation from ChIP-seq data.

• Authors

Snyder Lab

• Abstract
• Bullet points
• Datasets used

## GRCP057: Pre-programming of chromatin structure across the cell cycle.

• Authors

B Miller, Stamatoyannopoulos Lab

• Abstract

The high resolution dynamics of chromatin accessibility across the replicative phase of the cell cycle has not been investigated on a genome scale, and have the potential to play an important role in elucidating the mechanism for inheritance of epigenetic states. To measure this, we measured chromatin accessibility throughout DNA replication time. This revealed that chromatin states in subsequent phases of the cell cycle are almost completely pre-programmed in prior phases, and that the chromatin state is reset very rapidly follow replication fork passage. Collectively our results suggest the chromatin and gene regulatory network during DNA replication is almost completely predetermined in a sequential fashion.

• Bullet points
• Datasets used

DNaseI across cell cycle, some ChIP-seq and RNA

## GRCP068: ChIP-seq guidelines and practices used by the ENCODE and modENCODE consortia.

• Authors

Stephen G. Landt1*, Georgi K. Marinov2*, Anshul Kundaje4*, Pouya Kheradpour3, Florencia Pauli5, Serafim Batzoglou4, Bradley Bernstein6, Peter Bickel7, Ben Brown7, Philip Cayting1, Yiwen Chen8, Gilberto DeSalvo2, Charles Epstein6, Katherine Fisher-Aylor2 Ghia Euskirchen1, Mark Gerstein9, Jason Gertz5, Roderic Guigo10, Alexander J. Hartemink11, Michael M. Hoffman12, Vishwanath Iyer13, Youngsook L. Jung14,15, Subhradip Karmakar16, Manolis Kellis3, Peter Kharchenko14,15, Qunhua Li17, Tao Liu8, Xiaole Shirley Liu8, Lijia Ma16, Aleksandar Milosavljevic18, Richard M. Myers5, Peter Park14, Michael J. Pazin19, Marc D. Perry20, Debasish Raha21, Timothy E. Reddy5, Joel Rozowsky9, Noam Shoresh6, Arend Sidow1,22, Matthew Slattery16, John Stammatoyonnopoulous12,23, Michael Tolstorukov14,15, Kevin White16, Simon Xi24, Peggy Farnham25+, Jason Lieb26+, Barbara Wold2+, Michael Snyder1+

• = These authors contributed equally, += Corresponding authors
• Abstract

Chromatin Immunoprecipitation (ChIP) followed by high-throughput DNA sequencing (ChIP-seq) has become a valuable and widely used approach for mapping the genomic location of transcription factor binding and histone modification in living cells. Despite its widespread use, there are considerable differences in how these experiments are conducted, how the results are scored and evaluated for quality, and how the data and metadata are archived for public use. These practices affect the quality and utility of any global ChIP experiment. Through our experience in performing ChIP-seq experiments, the ENCODE and modENCODE consortia have developed a set of working standards and guidelines for ChIP experiments. These guidelines, which address antibody validation, experimental replication, sequencing depth, data and metadata reporting, and data quality assessment, are presented here.

• Bullet points

1. Current ENCODE guidelines for antibody characterization, experimental design/replication, sequencing depth, metadata 2. Guidelines for dataset quality assessment with accompanying examples

• Datasets used

All ENCODE ChIP-seq through Jan 2011 freeze

## GRCP066: Predictive accuracy of genome-scale chromatin assays

• Authors

Patrick A. Navas, Ericka M Johnson, Tristan Frum, Eric D Nguyen, Abigail K Ebersol1, Minerva E Sanchez, Hadar H Sheffer, Dimitra Lotakis, Richard Humbert, Eric Haugen, Robert E.Thurman and John A Stamatoyannopoulos

• Abstract

Massively parallel sequencing has enabled the proliferation of genome-scale assays that examine chromatin structure and composition, and chromatin occupancy by transcriptional regulators. However, systematic, well-powered comparisons between chromatin features detected in massively parallel sequencing-based assays with established experimental gold-standards has never been performed. To address this, we developed a strategy for comprehensive validation of genome-scale maps of DNaseI hypersensitivity using classical Southern hybridization assays. We analyzed computationally estimated versus actual performance parameters including sensitivity and positive predictive value through comparison with >7,500 conventional gold assays. Our results suggest that typically-applied sequencing depths for next-generation chromatin assays substantially compromise detection sensitivity, which can be rescued by deeper sequencing only if data are relatively noise-free. We show further that experimentally-determined predictive accuracy differs substantially from widely-applied computational approaches that rely on false discovery rates. Our results suggest that many genome-scale assays of chromatin structure and regulatory factor binding are both insensitive and may suffer from significantly higher rates of false discoveries than currently believed.

• Datasets used

Encode DNase-seq

## GRCP067: Automated quality assessment for next-generation epigenomic assays

• Authors

Robert Thurman, Noam Shoresh, Eric Rynes, John Stamatoyannopoulos

• Abstract

Determination of data quality is a fundamental problem in genome-scale epigenomic analysis. The availability of a standardized quality metric (PHRED) for identifying high quality data with high signal-to-noise ratios greatly enabled sequencing efforts at all levels from the human genome project to individual laboratories. Here we describe a robust signal quality measure,signal portion of tags (SPOT), that directly quantifies signal enrichment in diverse epigenomic profiling assays that utilize massively parallel sequencing. SPOT scores recapitulate the performance of human users in identifying high signal-to-noise data sets by visual inspection of tag density traces. We show further using gold-standard experimental validation that SPOT scores correlate positively with empirical true positive rates and can prospectively identify data with high a posteriori rates of biological reproducibility. SPOT is easily interpretable, automated, and applicable to a wide range of epigenomic assay types including ChIP-seq, DNase-seq, RNA-seq, and methylation enrichment experiments. SPOT scores are currently automatically computed for all Roadmap Epigenomics Program chromatin data and are available for hundreds of data sets from the ENCODE project.

• Datasets used

Encode ChIP-seq and DNase-seq

• Authors

Cydney B. Nielsen, Hamidreza Younesy, Henriette O'Geen, Xiaoqin Xu, Torsten Möller, Ting Wang, Joseph F. Costello, Martin Hirst, Peggy J. Farnham,6, Steven J. Jones

• Abstract

Biologists possess the detailed knowledge critical for extracting biological insight from genome-wide data resources and yet they are increasingly faced with non-trivial computational analysis challenges posed by genome-scale methodologies. To lower this computational barrier, particularly in the early data exploration phases, we have developed an interactive pattern discovery and visualization prototype, Spark, designed with epigenomic data in mind. Here we demonstrate Spark’s ability to reveal both known and novel epigenetic signatures using genome-wide histone modification, DNA methylation, and transcription factor data from human embryonic stem cells.

• Datasets used

Encode ChIP-seq

# Translation

## GRCP025: Mining mass spec data for lncRNA translation: the ectopic ORFeome and cryptic mRNAs.

• Authors

Balazs Banfai^, Hui Jia^, Jainab Khatun^, Emily Wood, Christopher Maier, Will Gundling, Peter Bickel, Morgan Giddings, Ben Brown% & Leonard Lipovich%

^contributed equally to this work % corresponding authors

• Abstract

The ENCODE Consortium has produced tandem mass spectrometry data for the human cell lines GM12878 and K562, both for whole cells and cellular fractions and compartments. Whole cells and matched compartments have also been assayed by RNA-seq in long and short, PolyA+ and PolyA- fractions. From this data, we estimated the frequency of translation of Gencode v7 lncRNAs as a function of their expression across cellular compartments. The expression levels in the nucleus and cytoplasm were the most important covariates we analyzed. In particular, mRNAs expressed at high levels in the nucleus and at low levels in the cytoplasm were 5 fold less likely to be translated than other RNAs. On average, in whole cell data, lncRNAs are expressed at 10% the level of mRNAs in the long RNA-seq, but expressed at two fold higher levels in the short RNA-seq data in these cell lines. Finally, lncRNAs are at least 100 times less likely to be translated than mRNAs expressed at similar levels. Intersecting 15,512 lncRNAs with 79,333 peptides yielded only 111 peptide-matched lncRNAs. For each lncRNA with detectable translation, we manually re-annotated the locus. In all but two cases where multiple in-frame peptides matched a transcript, we found the translated putative lncRNA in question to be a coding transcript of a known gene. The two outliers were one unprocessed pseudogene and one bona fide lncRNA gene; the translated ORFs were compromised by upstream stop codons in both cases. All lncRNA ORFs that lacked upstream stops, and could be translated, were represented only by a single peptide per transcript. We conclude that with very few exceptions ribosomes are able to distinguish coding from non-coding transcripts with at least 99% fidelity, and that ectopic translation and cryptic mRNAs are rare in the human genome.

• Bullet points
• Datasets used

Proteomic data used: Mass spectrometry data from two ENCODE Tier1 cell lines. Sequences used: GENCODE V7 transcripts and UCSC hg19 human genome sequence.

## GRCP026: Whole human genome proteogenomic mapping for ENCODE cell line data: identifying protein-coding regions.

• Authors

Jainab Khatun1*, Yanbao Yu2, John Wrobel1, Brian A. Risk1, Harsha P. Gunawardena2,3, Ashley Secrest1, Wendy J. Spitzer1, Ling Xie2, Li Wang2, Xian Chen2,3 and Morgan C. Giddings1,2

• Abstract

Abstract Background: Proteogenomic mapping is an approach that uses mass spectrometry data from proteins to directly map protein-coding genes and could aid in locating translational regions in the human genome. In concert with the ENcyclopedia of DNA Elements (ENCODE) project, we applied proteogenomic mapping to produce proteogenomic tracks for the UCSC Genome Browser, to identify putative translational regions across the human genome. Results: We generated ~1 million high-resolution tandem mass (MS/MS) spectra for Tier 1 ENCODE cell lines K562 and GM12878 and mapped them against the UCSC hg19 human genome, and the GENCODE V7 annotated protein and transcript sets. We then combined and compared the results from the three searches to identify the best-matching peptide for each MS/MS spectrum, thereby increasing the confidence of the putative new protein-coding regions found via the whole genome search. At a 1% false discovery rate, we identified 26,472, 24,406, and 13,128 peptides from the protein, transcript, and whole genome searches, respectively; of these, 481 were found solely via the whole genome search. The proteogenomic mapping data are available on the UCSC Genome Browser at http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=wgEncodeUncBsuProt. Conclusions: The whole genome search revealed that ~4% of the uniquely mapping identified peptides were located outside GENCODE V7 annotated exons. The combination and comparison of disparate searches also resulted in identifying 15% more spectra than would have been identified solely from a protein database search. Therefore, whole genome proteogenomic mapping is a complementary method for genome annotation when performed in conjunction with other searches.

• Bullet points
• Datasets used

Proteomic data used: Mass spectrometry data from two ENCODE Tier1 cell lines. Sequences used: GENCODE V7 transcripts and UCSC hg19 human genome sequence.

# Variation

## GBCP027: Analysis of variation at transcription factor binding sites in fly and human

• Authors

Mikhail Spivakov, Ewan Birney and collaborators

• Provisional abstract

The advances of sequencing technology have boosted population genomics and made it possible to map the positions of transcription factor binding sites (TFBS) with a high precision by ChIPseq. Here we investigate TFBS variability by combining ChIP data generated by ENCODE (humans) and modENCODE consortia (Drosophila) with large-scale genomic variation data from 1000 Genomes and Drosophila Population Genomics Project. We introduce a metric of TFBS variability, motif mutational load, that takes into account motif fitness and makes it possible to investigate TFBS functional constraints instance-by-instance as well as in sets that share common biological properties. These analyses have provided intriguing insights into the relationship between within-and cross-species conservation and have shown evidence for the functional 'buffering' of TFBS mutations in each of the two species. The emerging ChIPseq studies comparing genome-wide TFBS profiles across individuals are instrumental to further address these questions, and by analysing one such dataset we show evidence that "buffering" mechanisms which reduce the deleterious effects of TFBS mutations develop in the course of evolution.

• Datasets used

Encode Chip-seq and motif discovery data, 1000 genomes pilot, modencode Chip data and motif discovery data, Drosophila Population Genomics Project variation data.

## GRCP034: Structural and functional analysis of the full spectrum of genomic variations in ENCODE non-coding RNAs and Transcription Factor Binding Sites

• Authors

Xinmeng Jasmine Mu, Arif Harmanci, Jieming Chen, Joel Rozowsky, Robert Bjornson, and Mark B. Gerstein

• Abstract

Natural selection within non-coding elements manifests itself in many different ways. In previous studies, the genomic annotations of non-coding RNAs (ncRNAs) and transcription factor binding sites (TFBSes) have been shown to be under selective constraint. In this work, we investigate selection pattern within these elements at high resolution. Using data from the ENCODE and 1000 Genomes Projects, we examine genomic variations at interacting sites within a single molecule or between different molecules. We first analyze variants at sites having structural interactions within a single ncRNA molecule. We discover that mutations at sites under stronger selection tend to destabilize the structures. Intriguingly, the paired regions in ncRNA structures harbor an enrichment of synonymous mutations that are non-disruptive to the structures relative to non-synonymous ones. We further analyze variants at interacting sites in different molecules through physical binding. In parallel to earlier observations, we find a similar destabilizing effect of mutations at sites under stronger selection on the hybridization between microRNAs and their targets. Another type of binding reaction involves protein-DNA binding at TFBSes. In this regard, we explore various sub-patterning of genomic variants at high occupancy and cell-line specific TFBSes, and find that more binding reactions lead to stronger selection. Lastly, we analyze selection in functional units that are interacting through regulatory networks, which typically involves a regulator gene, a non-coding regulatory element, and a target gene. To overcome the small number of coevolving sites between these units, we develop a consistency score to measure how consistent selection pressure is between the units. Consistencies in selection are found across the units for both TF-binding and miRNA-binding networks.

• Bullet points
• Datasets used

Encode ChIP-seq data, 1000 Genomes pilot, GENCODE v7

## GRCP035: Linking Disease Associations with Regulatory Information in the Human Genome

• Authors

Marc A. Schaub, Alan P. Boyle, Anshul Kundaje, Serafim Batzoglou, Michael Snyder

• Abstract

Genome Wide Association Studies have been successful in identifying Single Nucleotide Polymorphisms (SNPs) associated with a large number of phenotypes. However, an associated SNP is likely part of a larger region of Linkage Disequilibrium. This makes it difficult to precisely identify the SNPs that have a biological link with the phenotype. We have systematically investigated the association of multiple types of ENCODE data with disease-associated SNPs and show that there is significant enrichment for functional SNPs among the currently identified associations. This enrichment is strongest when integrating multiple sources of functional information and when highest confidence disease associated SNPs are used. We propose an approach that integrates multiple types of functional data generated by the ENCODE Consortium to help identify “functional SNPs” that may be associated with the disease phenotype. Our approach generates putative functional annotations for up to 80% of all previously reported associations. We show that for most associations, the functional SNP most strongly supported by experimental evidence is a SNP in linkage disequilibrium with the reported association rather then the reported SNP itself. Our results show that the experimental data sets generated by the ENCODE consortium can be successfully used to suggest functional hypotheses for variants associated with diseases and other phenotypes.

• Bullet points

- Identify functional SNPs (overlapping regulatory or coding regions in ENCODE) that are associated with diseases in GWAS.

- Search for the SNP most strongly supported by functional evidence in the LD region around each association.

- Generate functional hypotheses for up to 80% of all existing GWAS associations.

- In the majority of cases, the reported SNP is not the one most strongly supported by functional evidence.

• Datasets used

- All the data in RegulomeDB (see GRCP041)

- Gencode v7 Annotation (see GRCP001)

- NHGRI GWAS catalog

- HAPMAP2/3 (for LD information)

## GRCP037: Simultaneous SNP genotyping and assessment of allele-specific bias from ChIP-seq data

• Authors

Y. Ni, A. Hall, A. Battenhouse, V. Iyer

• Abstract

Single nucleotide polymorphisms (SNPs) have been associated with many aspects of human development and disease, and many non-coding SNPs associated with disease risk are presumed to affect gene regulation. We have previously shown that SNPs within transcription factor binding sites can affect transcription factor binding in an allele-specific and heritable manner. These SNPs are likely to be regulatory SNPs that modulate transcription. However, such analysis has relied on prior whole-genome genotypes provided by large external projects such as HapMap and the 1000 Genomes Project. This requirement limits the study of allele-specific effects of SNPs in primary patient samples from diseases of interest, where complete genotypes are not readily available. In this study, we show that we are able to identify SNPs de novo and accurately from ChIP-seq data. Where independent genotyping data was available, our de novo identified SNPs from ChIP-seq data are highly concordant with published genotypes. Analysis of transcription factor binding at discovered SNPs revealed widespread heritable allele-specific binding, confirming previous observations. Our approach combines SNP discovery, genotyping and allele-specific analysis, but is selectively focused on functional regulatory elements occupied by transcription factors or epigenetic marks, and will therefore be valuable for identifying the functional regulatory consequences of non-coding SNPs in primary disease samples.

• Bullet points
• Datasets used

ENCODE ChIP-seq data from the Iyer lab, 1000 Genomes Pilot 2 (published)

## GRCP040: Identification and Characterization of Allele Specific Gene Expression and Regulation in the Human Genome

• Authors

Hualin Simon Xi, Jie Wang, Jia Xu, Zhiping Weng

• Abstract

Allele-specific gene expression is an important source for phenotypic variations. Studying the underlining regulatory mechanisms will help draw the links between the genetic/epigenetic variations and the phenotypic variations. Recent advances in high-throughput technologies and availability of genome-wide experimental data from large international consortiums like ENCODE and the 1000Genomes Project have enabled us to investigate this phenomena in a genome-wide scale. Here we conducted an integrated analysis using the rich datasets provided by the consortiums to identify allele-specifically expressed genes as well as allele-specific protein DNA interactions and epigenetic changes in histone modifications and DNase I-hypersensitive sites throughout the genome. We demonstrated the feasibility of identifying allele-specific events on cells derived from the same individual using data generated from next-generation sequencing. We uncovered well-coordinated allele-specific transcriptional regulation, epigenetic changes and gene expression. A survey of the genetic variation at the binding sites for 33 transcription factors also showed the extent of genetic causes of allele specific gene regulation.

• Bullet points
• Datasets used

## GRCP041: RegulomeDB: Functional Annotation of Genomic Variation

• Authors

Alan P. Boyle, Eurie L. Hong, Manoj Hariharan, Yong Cheng, Marc A. Schaub, Maya Kasowski, Konrad J. Karczewski, Julie Park, Benjamin C. Hitz, Shuai Weng, J. Michael Cherry, Michael Snyder

• Abstract

As the sequencing of healthy and disease genomes becomes more commonplace, detailed annotation provides interpretation for individual variation responsible for normal and disease phenotypes. Current approaches focus on direct changes in protein coding genes, particularly nonsynonymous mutations that directly affect the gene product. However, most individual variation occurs outside of genes and, indeed, most markers generated from genome-wide association studies (GWAS) identify variants outside of coding segments. Identification of potential regulatory changes that perturb these sites will lead to a better localization of truly functional variants and interpretation of their effects. We have developed a novel approach and database, RegulomeDB, which guides interpretation of regulatory variants in the human genome. RegulomeDB includes high-throughput, experimental data sets from ENCODE and other sources, as well as computational predictions and manual annotations to identify putative regulatory potential and identify functional variants. These data sources are combined into a powerful tool that scores variants to help separate functional variants from a large pool and provides a small set of putative sites with testable hypotheses as to their function. We demonstrate the applicability of this tool to the annotation of noncoding variants from 69 full sequenced genomes as well as that of a personal genome, where thousands of functionally associated variants were identified. Moreover, we demonstrate a GWAS where the database is able to quickly identify the known associated functional variant and provide a hypothesis as to its function. Overall, we expect this approach and resource to be valuable for the annotation of human genome sequences.

• Bullet points
• Datasets used

All Jan2011 Freeze datasets

## GRCP050: Personal and population genomics of human regulatory variation (note has also used GRCP028)

• Authors

Vernot B, Stergachis A, Maurano M, Vierstra J, Neph S, Thurman R, Stamatoyannopoulos J, Akey J, University of Washington

• Abstract

The characteristics and evolutionary forces acting on regulatory variation in humans remains elusive because of the difficulty in defining functionally important non-coding DNA. Here, we combine genome-scale maps of regulatory DNA marked by DNaseI hypersensitive sites from 138 cell and tissue types with whole-genome sequences of 53 geographically diverse individuals in order to better delimit the patterns of regulatory variation in humans. We estimate that individuals contain up to seven times as many functional variants in regulatory DNA compared to protein-coding regions, although they are likely to have, on average, smaller effect sizes. Moreover, we demonstrate that there is significant heterogeneity in the level of functional constraint in regulatory DNA among different cell types, with a striking difference between normal and immortal cells. We also find marked variability in functional constraint among transcription factor motifs in regulatory DNA, with sequence motifs for major developmental regulators, such as HOX proteins, exhibiting more constraint than protein-coding regions. Finally, we perform a genome-wide scan of recent positive selection, and identify hundreds of novel substrates of adaptive regulatory evolution that are enriched for biologically interesting pathways such as melanogenesis and adipocytokine signaling. These data and results provide new insights into patterns of regulatory variation in individuals and populations, and demonstrate that a large proportion of functionally important variation lies beyond the exome.

## GRCP059: Evidence of selection in the human population for biochemically-active non-conserved elements

• Authors

Luke Ward, Manolis Kellis

• Abstract

The pilot ENCODE project revealed that a surprisingly large number of biochemically-active elements were not conserved across species, and that evidence for purifying selection among humans at these non-conserved elements was weak. We reconsider this evidence using whole-genome sequence data from the 1000 Genomes Project and functional data from ENCODE, which allow a much more comprehensive examination of the relationship between human variation and biochemical activity. Specifically, we inspected patterns of human diversity outside of previously-annotated exons and conserved elements. We find that a broad range of unconserved elements protein binding sites, regulatory elements detected by DNAse and FAIRE, novel transcribed regions, and distal enhancers marked by histone modifications show depressed human diversity (~16%) and derived allele frequency despite not showing constraint across mammals. This signal of negative selection is robust to the confounding effects of CpG content and background selection from nearby exons. Conversely, we have also studied the activity patterns of constrained elements revealed by comparative analysis of 29 mammals. We find that among conserved noncoding elements, those without detected biochemical activity by ENCODE have relaxed constraint among humans, suggesting that some have become nonfunctional on the human lineage. Taken together, these data highlight the potential of ENCODE functional genomics data to explain the ubiquity of recent selection among primates and humans.

• Datasets used

Encode ChIP-seq and motif discovery data, DNAse I and FAIRE, ChromHMM segmentation, 1000 Genomes pilot, GENCODE v7

## GRCP061: Allele-specific transcription factor binding, chromatin modification, and gene expression in the human genome

• Authors

Bob Altshuler, Manolis Kellis

• Abstract

Understanding how variation in genome sequence leads to differences in gene regulation is a longstanding challenge that is essential to explaining the many phenotypic differences and complex diseases that are observed in humans. Sequencing-based functional genomics assays provide unique insight into this problem by allowing direct observation of differences between homologous chromosomes in, for example, gene expression or chromatin state. The ENCODE project provides a unique opportunity to study allele-specific activity jointly across many layers of regulation including DNA methylation, chromatin structure and modifications, occupancy by transcription factors and RNA Polymerase II, and ultimately gene expression. We find widespread evidence of allele-specific activity across ENCODE datasets, as well as genome-wide correlations among combinations of TFs, histone modifications, and several classes of RNAs. Through this study, we have identified thousands of functionally-associated sequence variants in non-coding portions of the genome, demonstrating their potential to provide new insights into gene regulatory mechanisms. Ultimately, these variants may provide mechanistic insights into intergenic variants associated with human disease and may point the way towards more direct interpretation of rare variants in personal human genomes.

• Datasets used

Encode ChIP-seq and RNA-Seq data, ChromHMM segmentation, 1000 Genomes pilot, GENCODE v7

## PGCP001: Widespread Site-dependent Buffering of Human Regulatory Polymorphism

• Authors

Matthew T. Maurano1,*, Hao Wang1,*, Tanya Kutyavin1, John A. Stamatoyannopoulos1,2,

• Abstract

The average individual is expected to harbor thousands of variants within non-coding genomic regions involved in gene regulation. However, it is currently not possible to interpret reliably the functional consequences of genetic variation within any given transcription factor recognition sequence. To address this, we comprehensively analyzed heritable genome-wide binding patterns of a major sequence-specific regulator (CTCF) in relation to genetic variability in binding site sequences across a multi-generational pedigree. We localized and quantified CTCF occupancy by ChIP-seq in 12 related and unrelated individuals spanning three generations, followed by comprehensive targeted resequencing of the entire CTCF-binding landscape across all individuals. We identified hundreds of variants with reproducible quantitative effects on CTCF occupancy (both positive and negative). While these effects paralleled protein-DNA recognition energetics when averaged, they were extensively buffered by striking local context dependencies. In the significant majority of cases buffering was complete, resulting in silent variants spanning every position within the DNA recognition interface irrespective of level of binding energy or evolutionary constraint. The prevalence of complex partial or complete buffering effects severely constrained the ability to predict reliably the impact of variation within any given binding site instance. Surprisingly, 40% of variants that increased CTCF occupancy occurred at positions of human-chimp divergence, challenging the expectation that the vast majority of functional regulatory variants should be deleterious. Our results suggest that, even in the presence of perfect genetic information afforded by resequencing and parallel studies in multiple related individuals, genomic site-specific prediction of the consequences of individual variation in regulatory DNA will require systematic coupling with empirical functional genomic measurements.