NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.
Notes from 10/22 call follow.
Collaborators: Mark Groudine (co-PI?), Michael Bender at Hutchinson Cancer Center.
Data will be DNase and gene expression (Exon Array track ?) in 30-40 mouse primary tissues, plus a few cell lines, including ES. 10-12 of cell lines will have deep sequencing w/ footprint calls (DGF track). All will be in the reference genome (C57BL/6J) strain.
Logistics: All bioinformatics will run through Stam lab. Data already available -- will be able to to make first submissions mid-Nov. to Dec. 1 timeframe.
John's suggestion for companion tracks: Castaneus strain genome sequence (Stanford, Sidow lab)
They will also have DNA replication timing data for ENCODE Tier1 and Tier2 analogs.
Studies in GATA-1 knockouts, ES, erythroid, myocyte (6 cell types) 1) Gene expression (RNA-seq) 2) DNAse 3) 3 histone marks (ChIP) 4) 3 TF's (CHIP) 5) 5C
Experiment matrix is: File:HardisonExperimentMatrix.doc
Human/mouse comparative analysis.
Collaborators: Barbara Wold (RNA), Job Dekker (5C), James Taylor (analysis), and others (Bob Paulson at PSU, Gerd Blobel at Philadelphia Children's Hospital, Greg Crawford, and more).
Ross is determining best person for contact, possibly Sheryl Kapore (will let us know)
Sequencing at Caltech and PSU (Steve Schuster)
Already have GATA-1 knockout data from previously funded effort
UCSD/Ludwig Institute (Ren)
Contacted 11/21 to discuss project (late notification from NHGRI to us -- it's actually an R01, not an ARRA grant).
PI: Bing Ren firstname.lastname@example.org
Dr. Yin Shen (<email@example.com>) is leading this project and she will be the main contact for general issues related to experimental design and method. Ms. Lee Edsall (<firstname.lastname@example.org>) will be in charge of data submission.
Sherman Weissman at Yale is co-PI.
30 TF's in 2 cell lines
2 RNA-seq datasets
Will have some ready to submit by end-of-year
Will have bioinformatics lead at Stanford (trained by Mike Wilson).
U Chicago (White lab)
Notes from 11/4 call follow.
PI: Kevin White <email@example.com>
Liza Herendeen <firstname.lastname@example.org> Administrative Assistant (773) 834-3913
Collaborator: Mike Snyder, Stanford
Data will be ChIP-seq to identify TFBS, using an alternative method to TF-antibodies (protein-tagged TF's, used with a single antibody to the protein). TF's are expressed from BACs (?). Currently using Solexa platform, single reads.
Initially K562, will try IPS cells, lots of TF's (overcomes limitatations of antibody availability). Expect to generate 50-70 datasets per year for 2 year period of the grant.
Logistics: still determining whether all data will go through White lab bioinformatics (Cloud-based). White lab has experience with modENCODE submissions. Pilot data is currently available. Expecting production data submission to be ready around Jan 1.
Nick Bild <email@example.com> will be the bioinformatics/database person who will interface with DCC wrangler. Subhradip Dhar <firstname.lastname@example.org> is the experimental lead. We will send data from our "Cistrack" system (www.cistrack.org), as we do for modENCODE (this was built by Bob Grossman).
(1) describes database for Drosophila binding sites determined by ChIP Flynet: a genomic resource for Drosophila melanogaster transcriptional regulatory networks. Tian F, Shah PK, Liu X, Negre N, Chen J, Karpenko O, White KP, Grossman RL. Bioinformatics. 2009
(2) Application of the BAC tagging strategy in human cancer cells Genomic antagonism between retinoic acid and estrogen signaling in breast cancer. Hua S, Kittler R, White KP. Cell. 2009 Jun 26;137(7):1259-71.
(3) Description of the BAC tagging strategy BAC TransgeneOmics: a high-throughput method for exploration of protein function in mammals. Poser I, Sarov M, Hutchins JR, Hériché JK, Toyoda Y, Pozniakovsky A, Weigl D, Nitzsche A, Hegemann B, Bird AW, Pelletier L, Kittler R, Hua S, Naumann R, Augsburg M, Sykora MM, Hofemeister H, Zhang Y, Nasmyth K, White KP, Dietzel S, Mechtler K, Durbin R, Stewart AF, Peters JM, Buchholz F, Hyman AA. Nat Methods. 2008 May;5(5):409-15. Epub 2008 Apr 6. Erratum in: Nat Methods. 2008 Aug;5(8):748.
UNC (Morgan Giddings lab)
Giddings office phone: 919-843-3513 email: email@example.com
Bioinformatics (database) contact: Chris Maier <firstname.lastname@example.org>
Protein extraction from cell fractions (cytosolic vs. whole cell ?) followed by protease digestion to separate peptides, chromatography,'tandem' mass spec. Initially will work with peptides of 5-20 AA long. Later, 50-100 AA using newer 'middle-bound proteomics' technology.
Giddings lab will be performing bioinformatics, collaborator in biochem dept. will be doing cell culture and mass spec.
First dataset expected in next 2 months (Jan 1.).
Each mass spec run (24hrs) produces 100K spectra, filtered by 50% for quality, then only 10-20% of these can be uniquely identified. Replicates -- expect 8-10 per cell line. Will be using one (or both?) Tier1 cell line.
Based on mass spec spectra features, they match to genome w/ HMM and database matching.
Data submissions will intially consist of peptides mapped to the genome, with a score. Later, possibly gene models based on pepticde data.
Raw data (mass spec) will be submitted to 'proteome commons' database.
We had a fascinating conference call with Morgan Giddings, who's received one of the new ENCODE ARRA grants. In a nutshell, she's doing mass spec proteomic analysis, and mapping the protein fragments back to the genome.
Her collaborator in the Chemistry department is actually developing the cell cultures and doing the mass spec runs. She does the data analysis. They take the cell culture (the ENCODE Tier 1 cell lines, in this case), optionally fractionate for a certain cellular fraction, do an assay to select the protein (by getting rid of the DNA and RNA? I'm unclear how this selection works), digest the proteins down into peptides, and run through a 2D gel to separate them by size and charge so that the mass spec won't be overwhelmed. In a typical mass spec run, they'll get about 100K peaks, which is basically state of the art. So, they'll ultimately get about N*100K peaks, where N is the number of sections they divide their gels into.
The next step is standard mass spec data analysis: given peaks on the spectrum, identify what peptides they represent. This is done by searching through databases that list various known peptides by molecular weight. To boost the reliability of this identification, they do tandem mass spec: first they do mass spec analysis on the peptides, then they digest them a bit more to get smaller peptides, and then run mass spec analysis on the smaller peptides. The digestion steps generally break up the protein at specific amino acids, so one can make a more positive identification by seeing if a peak in the first spectra corresponds to a known peptide, and if peaks in the second spectra are what would be expected by breaking up that peptide at specific amino acids.
Then comes the cool part, where she maps the peptides back to the genome. She does this as a two-step process. In the first step, she identifies a set of genomic regions that could encode each peptide, probably with something akin to a BLAT search (note that the peptides can be as small as 5 amino acids). The next step, and the thing that sets her apart from the others, is doing a more refined match in which an HMM somehow estimates the likelihood of each genomic region matching the peaks in the spectrum.
Out of 100K peaks, she says that 50K are typically junk. Of the 50% of the peaks remaining, something like 90% are typically unidentifiable. These peptide databases often show the peptide only in its canonical modification state, with no post-translational modification. Out of the peptides she maps to the genome, she keeps the ones that don't border splice sites, and uses the unique mappings of peptide to genome to deconvolute the non-unique mappings. She's starting to work with an experimental mass spec protocol that involves larger peptides, and says that with those one can identify spliced peptides.
So when this is all done, what's left is essentially a track composed of blocks with confidence scores, where each confidence score reflects the likelihood that the block was translated to part of a protein. The utterly cool thing is that these blocks have made it through lots of regulatory processes: transcription, splicing, translation, etc., and have existed as proteins in the cell long enough to be picked up by the assay. This protocol will miss tons of genuine translated proteins (e.g. probably anything that's membrane-bound), but what it picks up is that much more likely to be functional.