NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.
This page contains information about the cloning-free RNAPET (27/27) that the Genome Institute of Singapore (GIS) is producing for ENCODE.
9 long poly A+ RNAPET (27/27) libraries had been constructed (as of time this page was updated):
1) K562 Cytosol
2) GM12878 Cytosol
3) NHEK Cytosol
4) NHEK Nucleus
6) HelaS3 Cytosol
7) HelaS3 Nucleus
8) HUVEC Cytosol
9) HUVEC Nucleus
Library construction and RNAPET (27/27) production:
The RNA-PET is a modified version to the previous “GIS-PET” which had been published and described in Nature Method 2005. The new version of the PET construct has two major improvements. First, it has longer (27-27bp) paired end tags, extracted from 5’ and 3’ ends of the full length cDNAs (flcDNA) using a type III restriction enzyme EcoP15I. Second, the construction of the new PET eliminates the expensive and time-consuming bacterial cloning procedures by using a cloning-free approach which significantly reduces reagent cost (6 times) and construction time (2-3 times).
To start construction of RNA-PET library, approximate 5 microgram poly(A) mRNA is initially required for reverse transcription to generate cDNA. The Cap-Trapper method is utilized to select full length cDNA (flcDNA) molecules (Meth.Enzym.1999; Nature Method 2005), and the collected flcDNAs are then modified, ligated at both ends with specific DNA linkers, and circularized by ligation of the linkers. The 5’ and 3’ ends (27-27 bp) of the flcDNAs are then extracted through EcoP15I digestion and the extracted PET structures [3’-27bp tag--linker sequence--27bp tag-5’] are purified from mixture by streptavidin magnetic beads. Solexa sequencing adaptors are then ligated at both ends and amplified through PCR to generate sequencing templates. The sequencing template structure is shown below:
PET sequencing and mapping:
Solexa paired end (PE) sequencing is performed and 36-bp paired end reads are generated, which consists of the 27-bp tag and 9-bp linker seqeunce as shown in the diagram above. After filtering out the noise reads, the 3’ end tags are first identified by searching for a signature sequence (AACTGCTG) characteristics of the 3’-end tag. The signature sequence is searched from both paired end reads of all PETs. As long as the signature sequence is identified from one end of the PETs, this tag is considered as the 3’ end tag and the other end is the 5’-end tag. A portion of the PETs without signature sequence be identified, or rarely, the signature sequence appears on both ends, are discarded without further analysis.
RNA-PET mapping was done using Solexa ELAND pipeline with a seed of 24 bp and allowing up to 2 mismatches in the seed sequence on each tag. The orientation-determined PETs are then mapped to reference genome and those PETs which have both uniquely mapped 5’ and 3’ ends are classified as uniquely-mapped PETs. Majority of PETs (~90%), defined as concordant PETS, are mapped on the same chromosome, same strand and in the same direction to the known transcripts or splice variants. Whereas, a small portion (~10%) of the incorrectly-mapped PETs, referred as discordant PETs, are mapped either in the wrong orientations (e.g., 3’-end tag is mapped before the 5’-end tag), or on different strands, or different chromosomes (e.g., one end mapped on chromosome 3, but the other end is mapped on chromosome 8), indicating the existence of some transcription variations which could be caused by genome rearrangements such as deletion, inversion, tandem replication, translocation or trans-splicing etc. In current ENCODE datasets, only concordant PETs are submitted to UCSC.
After concordant PETs are identified, they are clustered based on a 200bp-extension search window for each paired end tag at 5’ and 3’ end. Specifically, the mapping location of the 5' and 3' tag of a given PET is extended by 200bp in both directions creating 5' and 3' search windows, respectively. If the 5' and 3' tags of a second PET mapped within the 5' and 3' search window of the first PET then the two PETs are clustered and the search windows are re-adjusted and expanded to acquire potential new PETs. This process is dynamic and iterative, and continues the cycle till no new PETs can be found within the search windows. At the end, all clusters are formed in this way and those PETs which not fall into this window are defined as singletons and filtered out from further analysis. In short for any PETs to be clustered, both 5' and 3' ends of the tags should be within 200-bp of the respective 5' and 3' ends of other PETs. For ENCODE datasets submitted, all the clusters have at least 2 PET counts above.