NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.

Long RNA-Seq

From Encode2 Wiki
Jump to: navigation, search

This page contains information about the Long RNA-Seq production pipeline the Gingeras lab is doing for ENCODE.

Cell Culturing and RNA Isolation:

The Gingeras lab is culturing cells for RNA Isolation and distribution. Biological replicates are being done whenever possible for all cell lines. The replicates are cultured independently and their RNA isolations and subsequent manipulations are kept separate for all steps. When possible we aim to grow up enough cells to isolate RNA not only from Whole Cells (WC) but also Nuclear (N) and Cytoplasmic (C) compartments for each biorep. Each biorep is then issued an RNA ID# (for details see the online Google Doc off the main Transcriptome page).

For example, "001WC", "001C" and "001N"... would mean that we cultured a huge batch of cells and partitioned them into 2 sub-batches at the time of RNA isolation. From one partition we isolated Total RNA from Whole Cells. From the other, we cracked the cells open and further sub-fractioned them into Cytoplasm and Nucleus. We then partition them into Long (>200) and Small (<200) RNAs. We take BioAnalyzer images of the final RNAs prior to distribution to ensure they are of good quality.

Prior to sample distribution we will generate a "Production Document", one per library. This document will contain detailed information about the methods of RNA isolation. It will be passed to the production groups to populate with their library generation protocol and qc images. It will be submitted to UCSC alongside the data.

Long-RNA Seq Library Generation:

We have decided to use the published T -> U protocol to generate our libraries. We are focusing on making libraries from Poly-A+ RNA for each biorep. We are also exploring methods of making libraries from the same bioreps to obtain sequence for:

  • Poly-A(-) RNA
  • Sites of Polyadenylation from A+ RNA
  • Some others....

When possible we include spike-ins. We are currently adding the NIST beta test (pool 14) spike-ins to each Long RNA-Seq library.

For each library we are supplying a document that contains:

  • RNA Isolation Methods
  • Library Protocol
  • Spike-In Info
  • Gel and BioAnalyzer Q.C. images

Example Library Generation Metadata document:

The A+ libraries are being sequenced as PE76 format. We are aiming to get +100 million individual reads per library (~3 lanes for each biorep).

Data Processing:

As the data comes off the GAIIx we do an initial assessment to determine if it should be processed for sequencing. Some of the tings we look for are:

  • Quality Score Box Plots
  •  % A,T,C,G across the cycles
  • Pass filtering rates
  • PhiX error rates
  • etc...

Each lane is mapped independently using S.T.A.R. (Alex Dobin, CSHL in preparation). Some summary stats are gathered such as:

  •  % mapped
  • Average number of mismatches in mapped reads
  •  % unique vs. multi
  • etc...

For each library we generate an EXCEL spreadsheet

Example Summary Stats document:

ELEMENTS We are generating 3 main kinds of elements for the Long-RNA Seq data.

  • Clusters
  • Splice Junctions
  • Polyadenylation sites


We are aiming to incorporate as many of the analysis tools (STAR, element calling, Q.C., FPKM generation, etc..) into our local Galaxy framework.