NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.

Locations of ENCODE Data

From Encode2 Wiki
Jump to: navigation, search

Post Jan 2011 Freeze data


Site Build Location Notes
UCSC Download Sites
UCSC Test hg19 http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/encodeDCC/

ftp://encodeftp.cse.ucsc.edu/pipeline/hg19/

Since the public site is behind the test site in terms of data availability, for most AWG analyses you will want to use the test site.
UCSC Public hg19 http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/

ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/

UCSC Accessions NA accessions.hg19.2011-09-07.txt An index file relating Data submissions to their DCC Accession numbers (Sep 7th 2011).
AWG uniform processing pipeline
Jan 2011 Freeze files and meta tags hg19 Encode data table A table prepared by Ian indicating the relationships of all ENCODE data files from the Jan 2011 freeze to their metatags.
Uniform signal tracks BEDGRAPH (Jan 2011 Freeze) hg19 Download BEDGRAPH FILES: encode-box-01@fasp.encode.ebi.ac.uk:byDataType/signal/jan2011/bedgraph/ or ftp://ftp-private.ebi.ac.uk/byDataType/signal/jan2011/bedgraph Login:encode-box-01 Passwd: enc*deDOWN Notes on using the Aspera client (and EBI ftp), README on procedure used for signal generation
Uniform signal tracks BIGWIG (Jan 2011 Freeze) hg19 Download BIGWIG FILES: encode-box-01@fasp.encode.ebi.ac.uk:byDataType/signal/jan2011/bigwig/ or ftp://ftp-private.ebi.ac.uk/byDataType/signal/jan2011/bigwig Login:encode-box-01 Passwd: enc*deDOWN Notes on using the Aspera client (and EBI ftp) To stream the BIGWIGs to UCSC for custom tracks (without authentication) you can use [1] and [2] (for lab-combined data tracks). Please do not use this link for batch downloads. Instead use the Aspera/FTP download link, README on procedure used for signal generation
Uniform Element Calls (SPP) (Jan 2011 Freeze) hg19 encode-box-01@fasp.encode.ebi.ac.uk:byDataType/peaks/jan2011/spp/optimal or ftp://ftp-private.ebi.ac.uk/byDataType/peaks/jan2011/spp/optimal Login:encode-box-01 Passwd: enc*deDOWN Notes on using the Aspera client (and EBI ftp) . Also check out https://spreadsheets.google.com/ccc?key=0Am6FxqAtrFDwdE9pTHgxelBpV28tSFBjeU94TWJUeXc&hl=en&authkey=CLSc98cK for data quality measures and unreliable datasets. These peak calls have been filtered against blacklisted regions.
Uniform Element Calls (Peak Seq) hg19 encode-box-01@fasp.encode.ebi.ac.uk/byDataType/peaks/jan2011/peakSeq/optimal or ftp://ftp-private.ebi.ac.uk/byDataType/peaks/jan2011/peakSeq/optimal Login:encode-box-01 Passwd: enc*deDOWN Notes on using the Aspera client (and EBI ftp). These peak calls have been filtered against blacklisted regions. See https://spreadsheets.google.com/ccc?key=0AjmaDOiaxkCidG00MnhIejVkdXZtby1nc1Vhdm8tQ2c&hl=en_GB&authkey=CIrnpiw for pre-blacklist filtering peak counts.
Factorbook metadata table hg19 googledoc table This is the Factorbook metadata table, including various IDs and HGNC names for factors detected in ChIP-seq.
Pouya's Motif Discovery Pipeline hg19 website see readme on website for information and files containing all matches, motifs
Motifless peaks hg19 ftp://encodeftp.cse.ucsc.edu/users/benbrown/] Login:encode passwd:human This gives the Pouya's motif scores and Ben's threshold for "motiflessness" for each ENCODE ChIP-seq peak. There is also a readme explaining the format.
Blacklist regions hg19 Consensus empirical signal-artifact blacklist (BED), Duke repeat-based blacklist (BED), README for blacklists For most types of data the consensus blacklist is good enough. If you want to be more conservative, you can also use the Duke repeat-based blacklist that eliminates more regions. These extra regions don't seem to be show ultra-high signal artifacts.
Data quality
TF ChIPseq quality score summary hg19 Editable spreadsheet with quality+peakcalling summary Quality assessments for TF ChIP-seq datasets from Anshul Kundaje
SPOT quality scores hg19 Spreadsheet with SPOT quality scores Quality metrics for DNase-seq, TF and Histone ChIP-seq, and some RNA-seq datasets from Bob Thurman
Segmentations
Version 8 Segway hg19 Segmentation_progress#Segmentations See links on this page Segway version 8 segmentations.
Version 8 ChromHMM hg19 Segmentation_progress#Segmentations See links on this page ChromHMM version 8 segmentations. Direct link to files is of the form http://www.broadinstitute.org/~jernst/ROUND8_ChromHMM/fourcol_ChromHMM_HEPG2_concatenate_25.bed.gz.
Combined Segmentation hg19 Segmentation_progress#Segmentations See links on this page Combined segmentation. Direct links to files at http://www.ebi.ac.uk/~swilder/Superclustering/concordances4/
RNA
RNA data
RNA datasets through the dashboard hg19 RNA dashboard

How to

Front end to all RNA data generated within ENCODE.
Reflects the contents of the ENCODE DCC test site, but can also contain data not visible at UCSC yet. See also this explanatory note re: RNA elements.
RNA Elements
Long RNASeq Contigs hg19 ftp://ftp2.cshl.edu/transfer/contigs/contigs1.11/ (login:encode password:human): Also available via aspera at encode-box-01@fasp.encode.ebi.ac.uk:byDataType/rna_elements/jan2011/LongRnaSeq/raw (raw file) and encode-box-01@fasp.encode.ebi.ac.uk:byDataType/rna_elements/jan2011/LongRnaSeq/idrFilt (filtered at idr threshold as in the README) (see Notes on using the Aspera client (and EBI ftp)) Contigs (continuous regions covered by uniquely aligned reads) from shotgun long RNA-seq in BED9 format. Contact: Felix Schlesinger, schlesin@cshl.edu. Files are named following this nomenclature. .
Short RNASeq Contigs hg19 http://genome.crg.es/~jlagarde/encode/pre-DCC/wgEncodeCshlShortRnaSeq/20110706_short_contigs_gencodev7.tgz (iIDR'd, with smoothing) Contigs (continuous regions covered by uniquely aligned reads) from short RNA-seq in BED9 format. Contact: Felix Schlesinger, schlesin@cshl.edu, Wei Lin wlin@cshl.edu.
Gencode Annotation
Gencode hg19 / v7 Gencode release 7 (ftp) Currently use release 7
Gencode elements and expression from CSHL long RNAseq hg19 / v7 http://genome.crg.es/~jlagarde/encode/pre-DCC/wgEncodeCshlLongRnaSeq/20110713_long_quantifications_gencodev7.tgz Gencode v7 exons/transcripts/genes with expression from 115 CSHL long RNA. Quantifications from CSHL long were obtained using the flux capacitor on the STAR mapping. With iIDR. Contact: julienlag@gmail.com
Gencode elements and expression from CSHL short RNAseq hg19 / v7 http://genome.crg.es/~jlagarde/encode/pre-DCC/wgEncodeCshlShortRnaSeq/20110706_short_contigs_gencodev7.tgz Gencode v7 exons with expression from CSHL short RNASeq. Contact: Wei Lin (wlin@cshl.edu)
Gencode Genome partition hg19 / v7 gen7.Partition.tgz Partitioning the genome into elements using the hierarchy gap-exon-intron-intergenic. Contact: andrea.tanzer@crg.eu
Gencode Genome labeling hg19 / v7 gen7.GenomeLabels.tgz these segments are continuous genome regions with identical annotation (type of element and transcript). Contact: andrea.tanzer@crg.eu
TSS and TTS Files
Consolidated CAGE and Gencode TSS, all Tier 1+2 cell lines hg19 / v7 CAGE+Gencode TSS definition README file Gencode TSSs merged with CAGE tags that have passed appropriate cluster effects. Use this file to partition genome into TSS+ and TSS- regions. Contact: sarah.djebali@crg.eu or timolassmann@gmail.com
Gencode TSS quantitation for predictability hg19 / v7 Gencode TSS quantitation for predictability README file GencodeV7TSS Description This file only includes Gencode v7 TSS (defined as most 5' bp of Gencode transcripts). Each TSS is quantified in 267 RNA experiments. Contact: sarah.djebali@crg.eu
Active Gencode TSS hg19 / v7 Gencode TSS with active CAGE experiment file

name schema

Gencode v7 TSS which are expressed at at least one CAGE experiments. It removed ~20% TSS with 0 in all 69 CAGE experiments from above file.
  • Active TSS is also defined as TSS with >0 expression in at least one cell line with specific conditions (e.g. same techniques/RNA extract/cell compartments). For example, TSS that are active in at least one CAGE.PolyA+.Cytosol experiments can be downloaded as file activeTSS.in.cp_c.tab in directory http://zlab.umassmed.edu/~dongx/encode/Chromatin_Xianjun/data/activeTSS/.

Contact: xianjun.dong@umassmed.edu

CAGE TSS clusters hg19 HMM-TSSs Timo Lassmann generated a set of transcriptional start sites in all available cell lines based solely on CAGE and a non-supervised TSS classifier. The latter separates CAGE signal into transcription initiation and "other" signal based on local genomic sequences. The prediction accuracy appears to be high (AUC: 0.95). A description can be found here.
Gencode TTS hg19 / v7 Gencode TTS file Gencode v7 TTS file with confidence and list of transcripts they come from. The tts were made exactly the same way as the tss above but looking at the transcript 3' ends. All transcript biotypes were considered. Contact: sarah.djebali@crg.eu
Comparative
Comparative Alignments hg19 Comparative datasets Also back lifted onto hg18
GERP scores on Pouya's Motifs hg19 Conservation of bound motifs When more than one peak file per cell line is provided, we have taken the union of all the peaks (we are plaining to add the intersection as well)
Methylation
DNA methylation in Tier 1 and 2 (RRBS) hg19 DNA methylation in Tier 1 and 2 cell lines
Variation
NA12878 personal diploid genome constructed from variants hg18 http://sv.gersteinlab.org/NA12878_diploid

Use May 3, 2011 version (does not include HiSeq SNPs which were not part of the 1KG Paper)

SNPs:

G1K: 2,766,607 = 89% (phased) + 11% (unphased)
Extra HiSeq: 890,475 = 39% (phased, homozygous) + 61% (unphased) <= April 4, 2011 version only
Total: 3,657,082 = 77% (phased) + 23% (unphased)

INDELs:

G1K: 328,528 = 89% (phased) + 11% (unphased)

SVs:

G1K: 1,522 = 77% (phased) + 8% (unphased) + 15% (inconsistent)
Fosmid: 33 = 94% (phased) + 6% (unphased)
GWAS catalog hg19 gwascatalog.june_16_2011.txt

SNP-phenotype associations used in Fig13
SNP unique positions used in Fig13

GWAS catalog SNPs from June 16, 2011. Processed into a bed-like format with selected data from original download. Plus 2 subsets of the first file that are used in Fig 13.
shifted GWAS hg19 gwasShifted3kMax5k.june.sortChr.bed.gz SNP positions from the genotyping SNPs that are 3-5 kb from a GWAS catalog SNP.
Null set hg19 gwasNullSet.tar Null set of matched SNPs. SNPs from Illumina 1M array are matched to GWAS by CEU frequency, distance to TSS, and genomic region (intron, exon, etc.) The SNPs are then randomly selected from the matched SNPs. There are 1000 samples in this set.
Genotyping SNPs hg19 snpArrayIllumnina1M.sortChr.merged.bed.gz SNP positions from Illumina 1M array, excluding chrY.
24 personal genomes hg19 pgSnpsCombined24.hg19.noRand.noY.sortChr.bed.gz SNP positions from 24 personal genomes, excluding those mapped to *random chromosomes and chrY.
69 personal genomes from Complete Genomics hg19 completeGenomics69.bed.gz SNP positions from 69 genomes from Complete Genomics (6 of the individuals are in the set of 24 above, processed by 1000 genomes)
1000 Genomes low coverage SNPs hg19 low_coverage.2010_07.hg19.sorted.bed.gz SNP positions from 1000 Genomes low coverage SNPs July 2010 freeze.
GWAS phenotypes and DHS peaks and TF OS hg19 GWAS_DHSpeaks_all.xlsx

GWAS_TFandDHSpeaks_all.xlsx
GWAS_TFandDHSpeaks_all_subset.xlsx

Tables of intersections of GWAS phenotypes, DHS peaks, and TF OS. Full table of DHS peaks, full table with both, and filtered table with both (as used in Fig 13)
Elements
Element tracks (version 1.2) hg19 ElementTracks_v1.2.tar.gz (readme.txt included in package)

Cell-line-specific predicted active promoters, predicted active enhancers, high occupancy of TF (HOT) and low occupancy of TF (LOT) regions (New in v1.2: switched to Gencode v7)

Single linkage clusters hg19 Download at encode-box-01@fasp.encode.ebi.ac.uk:byDataType/slc/jan2011/ or ftp://ftp-private.ebi.ac.uk/byDataType/slc/jan2011 Login:encode-box-01 Passwd: enc*deDOWN Single linkage clusters of ChIP-seq peaks, RNA elements and Gencode exons for the Tier 1 and Tier 2 cell lines (separately and combined) + Dnase clusters across cell lines
Fish and Mouse Experiments hg19 Media:encode_fish_mouse.Rdata Rdata package with Fish and Mouse data frames Contact Ewan Birney (birney@ebi.ac.uk) for details
Jim Kent's Enhancer Picks hg19 Files_for_ENCODE_Analysis#Enhancer_Picks
UMASS Dekker 5C Looping Interactions hg19 5C Looping Interactions for K562, GM12878, H1-hESC and HeLa-S3 cells. (see links on page) 5C Looping interactions between TSS and Distal Elements. Detailed information available at dekkerlab website (see link - login info: encode/human)
DHS
DHS tracks hg19 dhs merged bed file (.gz) comprehensive set, made from UW and Duke ENCODE tracks (genome-preview Aug 22, 2011)

UW "hot spots" tracks:

"DNaseI Hypersensitivity by Digital DNaseI from ENCODE/University of Washington", including all 84 cell types.

Duke peak tracks:

"Open Chromatin by DNaseI HS from ENCODE/OpenChrom(Duke University)", all cell types
"Open Chromatin by FAIRE from ENCODE/OpenChrom(UNC Chapel Hill)", all cell types
Uniform processing of Duke/UW DnaseI data (hotspot pipeline) hg19 To reduce a source of variability between UW and Duke DNaseI processed datasets, the Stam lab did the following with the Duke DNase-seq aligned reads, to more closely match UW processing:
  • Combined all replicates for a given cell-type
  • Subsampled the result at a level of 30 million tags
  • Ran results through the Stam lab hotspot pipeline

Results are here, alongside the UW replicate 1 calls, which are identical to what are available on the browser. Note that hotspots (broadPeaks in the browser) should be thought of as regions of generalized chromatin accessibility, loosely thresholded (z score of 2), and of variable size. FDR 1% peaks (narrowPeaks) are arrived at by first thresholding hotspots (using random simulation) at FDR 1%, and then (essentially) locating local maxima of the tag density (150bp window, sliding every 20bp) within the hotspots. FDR 1% peaks are set to a fixed width of 150bp.

The combined calls are just a union of the calls for both groups, except for those 14 cell-types where both groups have data. In the latter case, a collapsed set of FDR 1% peaks are generated by taking a non-overlapping selection of the calls from both centers, and when calls do overlap, by giving preference to the peak that has the higher z-score. A collapsed set of hotspots on these cell-types is generated by simply merging the calls from both centers (taking the union interval of overlapping intervals). The "master list" is the union of FDR 1% peaks across all cell-types, removing overlaps by again giving preference to peaks that have the higher z-scores. Note especially that, by construction, the total genomic territory covered by the master list will be less than the territory covered by merging all the peaks from each cell-type.

Validation
Mapping of Greg Crawford's Insulator constructs to Version 8 segmentations hg19 Greg_insulators.segments.txt Tab delimited table. Fields are "Type Chr Beg End Segway.11.state ChromHMM.10.state Composite.state Score". Within each state field the state names are listed in a comma separated list with the fraction of the construct in that state separated by a colon e.g. T:0.413,WE:0.082. Fractions may not add to exactly 1 either because the construct i no completely covered by segments for the composite segmentation, or due to rounding up.
Mapping of ENCODE enhancer validation constructs to Version 8 segmentations hg19 ENCODE_VALIDATION_PHASE1.segments.txt Tab delimited table. See format description above.


Post Jan 2011 Freeze data (Obsolete data e.g Gencode v3c)

Do not use these data for ongoing analysis post July 2011.


Site Build Location Notes
Segmentations
Version 7 Segmentations hg19 Segmentation_progress#Segmentations See links on this page
Gencode Annotation
Gencode hg19 / v3c Gencode release 3c (ftp) Do not use for analysis from July 2011
Gencode elements and expression from CSHL long RNAseq hg19 / v3c Gencode exon file with expression

Gencode transcript file with expression

Gencode gene file with expression

Gencode v3c exons/transcripts/genes with expression from 115 CSHL long RNA. Quantifications from CSHL long were obtained using the flux capacitor on the STAR mapping. Contact: sarah.djebali@crg.eu
Gencode Genome partition hg19 / v3c Gen3c.unstranded.partition.gtf.gz Gen3c.stranded.partition.gtf.gz

Partitioning the genome into elements using the hierarchy gap-exon-intron-intergenic.

Gencode Genome labels hg19 / v3c Gen3c.unstranded.txtype_labels.gtf.gz Gen3c.stranded.txtype_labels.gtf.gz

Labelling of each nucleotide with transcript_type and element, represented as intervals in gtf format.

TSS and TTS Files
Consolidated CAGE and Gencode TSS, all Tier 1+2 cell lines hg19 / v3c TSS cluster file README file Gencode TSSs merged with CAGE tags that have passed appropriate cluster effects. Use this file to partition genome into TSS+ and TSS- regions
Gencode TSS (all transcript biotypes) hg19 / v3c Gencode TSS file (all biotypes) Gencode v3c TSS file from all transcript biotypes. Low confidence TSS are probably not TSSs. Contact: sarah.djebali@crg.eu
Gencode TSS (7 long transcript biotypes) with expression hg19 / v3c Gencode TSS file (7 biotypes) with expression Gencode v3c TSS from 7 long transcript biotypes with expression from 62 CAGE, 12 PET, 11 small and 115 CSHL long RNA. Low confidence TSS are probably not TSSs. Quantifications from CSHL long RNAseq were obtained using the flux capacitor on the STAR mapping. See Vignette R04 for some results about the tss and the way the file was obtained. Contact: sarah.djebali@crg.eu
Gencode TTS hg19 / v3c Gencode TTS file Gencode v3c TTS file with confidence and list of transcripts they come from. The tts were made exactly the same way as the tss above but looking at the transcript 3' ends. Only transcripts from protein_coding, NMD, ambiguous_orf, non_coding, antisense, processed_transcript and retained_intron transcript biotypes were considered. Contact: sarah.djebali@crg.eu


Autumn 2010 Data Links (hg19)


Site Build Location Notes
UCSC Test hg19 http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/encodeDCC/ Since the public site is behind the test site in terms of data availability, for most AWG analyses you will want to use the test site.
UCSC Public hg19 http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/
Uniform signal tracks (July 2010 Freeze) hg19 EBI Mirror (BEDGRAPH , BIGWIGs),

UCSC Mirror (BEDGRAPH, BIGWIG,)

Signal files were generated for all July 2010 lab-submitted BAM files. align2rawsignal is used to generate the signal files. The signal values are fold-changes wrt an equivalent uniform distribution. Hence the values are reasonably comparable across datasets, cell lines and labs. There is one for each dataset (with replicates combined).
Uniform raw signal tracks (Jan 2010 Freeze) hg19 U. Wash http Note: this is from Anshul/Michael's remapping of tier1 data from the Jan 2010 freeze
Uniform raw signal tracks (Jan 2010 Freeze) hg19 ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20100629_ENCODE_data/ use Aspera client (Instructions here) for fast download. Note: this is from Anshul/Michael's remapping of tier1 data from the Jan 2010 freeze
Uniform Element Calls (SPP) (July 2010 Freeze) hg19 SPP peak calls (NarrowPeak format) see also http://spreadsheets.google.com/ccc?key=0Am6FxqAtrFDwdFNGNThjWDQteTJZU3Rfd21pelFLS0E&hl=en . Editable version with comments from production groups is https://spreadsheets.google.com/ccc?key=0Am6FxqAtrFDwdE5lVFllaWgtbWhfbmF2TXdTZXRoUFE&hl=en&authkey=CILzpeIH
Uniform Element Calls (Peak Seq) hg19 Pending.
Comparative Alignments hg19 Comparative datasets Also back lifted onto hg18
RNA datasets hg19 RNA dashboard Still under construction. All data should make it to the ENCODE DCC site, but the dashboard will also have data not visible at UCSC yet.
Gencode hg19 Gencode release 3c (ftp) Currently use release 3c
Pouya's Motif Discovery Pipeline hg19 http://www.broadinstitute.org/~pouyak/encode-motif-disc-jun2010/ All Motifs and all matches
GERP scores on Pouya's Motifs hg19 Conservation of bound motifs When more than one peak file per cell line is provided, we have taken the union of all the peaks (we are plaining to add the intersection as well)
[ ]


This is for older datasets; note the build shift.

Site Build Location Notes
UCSC Test hg18 http://hgdownload-test.cse.ucsc.edu/goldenPath/hg18/encodeDCC/
UCSC Public hg18 http://hgdownload.cse.ucsc.edu/goldenPath/hg18/encodeDCC/
RNA dashboard hg18 RNA datasets Awaiting new version for hg19
Genome Segmentations hg18 Segmentations for March 2010 meeting
Uniform Element Calls (SPP) hg18 SPP peak calls (NarrowPeak format)
Uniform Element Calls (Peak Seq) hg18 PeakSeq peak calls (NarrowPeak format)
SNPs in ChIP-seq Peaks hg18 1000 genomes SNPs in ChIP-seq peaks Simon from Zhiping's group has processed all of Hudson Alpha, Broad and Yale group's Chip-seq data against the released 1,000 genomes snps from GM12878.