NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.
Hg19 (GRCh37) Migration Planning Working Group
The goal of this working group is to discuss issues surrounding the migration of the ENCODE project from NCBI build36 (UCSC hg18) to GRCh37 (UCSC hg19) and to define a plan and timeline.
In 2010, the ENCODE Consortium will transition to using the GRCh37 (hg19) human genome assembly. This will require the data providers changing their analysis pipelines to generate processed data referenced to the new assembly. In conjunction with this move, two other changes of the submission process are recommended by the DCC:
1) Adoption of SAM/BAM format for sequence mappings. This format is a standard that has emerged since the ENCODE production phase began. It was developed by the 1000 Genomes project, and has been adopted already by modENCODE. This format would replace UCSC's tagAlign and pairedTagAlign formats.
2) Submission of short read sequences from ENCODE labs to a public archive (NCBI or EBI implenentations of the SRA) with accessions reported to the DCC. (Sequences already submitted to the DCC will be accessioned by the DCC on behalf of the submitting lab).
The migration period is also a convenient time for the DCC to update some pipeline processing:
1) Represent 'wiggle' data with the newer 'bigWig' format. 'BigWig' format is higher-resolution than 'wiggle' and has a smaller database footprint. It also supports high-performance whole-genome custom tracks, so production labs may want to represent their wiggle data in this format so they can preview data in the browser without requiring wrangler assistance. The DCC can implement pipeline support for bigWig if the data providers find this convenient.
2) Update file and database tablename conventions to distinguish migrated and newer data, human vs. mouse, and to shorten (e.g. wgEncode prefix to be replaced by encHs or encMm).
Another component of the transition will be migrating previously submitted (hg18-referenced) ENCODE data to the new assembly. Using migration tools to 'lift' annotations by converting genome coordinates may be unsuitable for some types of processed data -- remapping and recalling of regions of significance may the preferred approach. This determination should made by the data providers for their data types and file types.
UCSC uses Chain format to represent differences between genome assemblies, and provides the liftOver tool to convert file coordinates using the liftOver chains. (Briefly, same-species liftOver chains are blat alignments that are processed with the UCSC pairwise alignment chain/net pipeline, followed by extraction of chains that remain after the net processing).
GRCh37 and Liftover
The Genome Reference Consortium has a description of the new assembly here.
Below are some links related to UCSC liftOver tool and data:
liftOver <in.bed> <lift.chain> <out.bed> <unmapped.txt> where <unmapped.txt> is the name of an output file that will contain the list of items that could not be mapped, with the reason.
liftOver k562_rep1.hg18.tagAlign.gz hg18ToHg19.over.chain.gz k562_rep1.hg19.tagAlign k562_rep1.tagAlign.unmapped
To provide a visual guide to differences between these assemblies, the hg18 and hg19 human browsers on UCSC genome-test browser now have liftOver chain tracks available in the Mapping and Sequencing track group (Hg19 Chain, Hg18 Chain). NB: The description pages are not entirely correct (they are from cross-species tracks). To browse, use:
- Split-frame (hg18 and hg19) browser with liftover chains by chromosome.
Some statistics about assembly diffs suggested by Ali:
1) by chrom, how much new sequence (not atttributable to gaps)
2) by chrom, how much sequence moved from chrom to random, and vice-versa
3) For non_random chroms, how many new nucleotides, how many changed nucleotides
Information about use of this format is posted here.
Accessioning short reads
Links to the public archives for sequence reads are here:
The archives operate as a single consortium for sequence read archiving, with identical XML and data formats and daily exchange of data providing easy geographic submissions and also redundancy of data worldwide.
Both support high-speed file upload via Aspera Connect.
Which sequences to use ?
GRCh37 (hg19) has more sequences, and more alternate sets of sequences than UCSC hg18:
hg18: 49 sequences (25 chroms, 20 randoms, 4 alternates) hg19: 93 sequences (25 chroms, 59 randoms, 9 alternates)
The full list of GRC/hg19 sequences, with sizes, is here: http://genome.ucsc.edu/cgi-bin/hgTracks?chromInfoPage=
Input received so far suggests the following list for use in ENCODE mapping and reporting:
- Include mitochondrion - Exclude alternate sequences (*hap*) - Exclude randoms - Create a female genome consisting of autosomes and chrX, and a male genome consisting of autosomes, chrX, and chrY with the PAR regions masked.
The exclusion of randoms warrants some discussion, as there are genes on the randoms -- a total of 16 RefSeqs (60 UCSC genes) on 5 randoms, with the preponderance (11 RefSeqs) on a single random (chr19_gl000209). The list of genes is here: Media: Hg19randomgenes.bed.
It may be worth having separate lists of sequences used for mapping (e.g. include randoms), and a subset (e.g. w/o randoms) used for analysis.
One note is that in the future we may want to instantiate the genome of NA12878, which has been sequenced by the 1000 genomes project. This would entail putting all of the sequence variants found in NA12878 (both SNPs and SVs) into the "reference" used for ENCODE. Note also that the 1000 genomes production project maps to a slightly different reference.
We have also decided to create two genome sequences to align to, one constituting the female genome (no chrY) and one constituting the male genome (chrY included). Since the PAR1/PAR2 regions are exactly duplicated in chrX and chrY, these regions would be masked (replaced by N's) in chrY so that they appear only once in the male genome.
Regarding sequence names, it has been requested that GRC names be used (e.g. '1' instead of 'chr1'). The DCC proposes using GRC names in SAM/BAM files submitted for ENCODE, but continuing the use of 'chr*' nomenclature in BED, wiggle, and other UCSC format files.
The FASTA and twoBit files for the male and female genomes can be found here using HTTP:
FTP instructions from hgdownload to follow.
- The official transition to the new genome assembly and formats will begin on Jan.15, 2010.
- This date will open a migration window where data submissions are suspended while the production labs and DCC complete work on tools and processes to handle the new assembly, and migrate previously generated data. The migration window will be expected to require 2-3 weeks.
- When submissions resume, all new data will be GRCh37 reference. Sequence reads will be submitted to the SRA or ERA. Alignments will be accepted in SAM/BAM format (and for a transitional period, in tagAlign as well).
- Labs that prefer to remap and recall their data will notify the DCC by Jan. 15 that they wish to opt-out of the lift process. Labs that have experiments that span the migration window (e.g. replicate 1 before Jan 15. and replicate 2 afterwards), may submit the remaining files for the submission in hg18, but should determine ahead of the freeze which datasets will be affected, and provide a list of these to the DCC.
In summary, Jan.15 to early February 2010 will be a migration window for ENCODE, where:
- Data submissions are suspended - Labs complete pipeline transition (GRCh37, SAM, SRA/ERA) - UCSC lifts datasets to hg19 - UCSC completes SAM support in pipeline - UCSC submits existing fastQ's to SRA