NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.

Cell phenotype study

From Encode2 Wiki
Jump to: navigation, search

This page lists microarray expression studies of Tier1 ENCODE cell samples performed for the ENCODE Common Resources WG by the Tenenbaum lab at SUNY-Albany.

The following table is under construction. Please be patient...


Initial Analyses Proposal

(by Zhiping Weng)

1) Goal : QC. effect of difference in lab protocols on cell phenotype using gene expression level measurements on Affy exon array

2) Final report will include:

  • outlier lab identification for each cell type
  • provide lab variance/correlation for each cell type (measuring robustness of each cell type)
  • provide cell-type dependent gene expression variance for each gene, controlling for between-lab variance. (robustness of each gene)
  • Establish a pipeline for new cells / protocols

3) Computational issues we'll handle:

  • derive representative mRNA expression values from exon levels. (Variance/correlation should be on gene level because of strong dependence among exons of the same gene)
  • Identify and use reliable values only (for QC, it's better to use only reliable estimates)

For higher level analysis:

1) We will examine the dependence of the variation on:
  • types of genes. Perhaps stress-related genes would have greater variation than house keeping genes.
  • expression level
  • genes transcribed by Pol I, Pol II, and Pol III
2) We will also examine the difference between the two cell lines, which may reveal difference between their genomic structures.

Preliminary Analyses by SUNY Albany Microarray Core

PCA Mapping

Cluster Analysis

Significant Pairwise K562

Significant Pairwise GM12878

Detailed Analysis by Tenenbaum Lab (SUNY)

Ajish George November 20, 2008


Affymetrix's apt-probeset-summarize was used to summarize the CEL files into both exon level and gene level probeset summaries. The following were the specific commands/parameters used. No cross-sample NORMALIZATION is performed!

gene-level : apt-probeset-summarize -a rma-bg,pm-gcbg,med-polish -a pm-gcbg,plier -a pm-only,dabg.neglog10=true --precision 10 -p lib/HuEx-1_0-st-v2.r2.pgf -c lib/HuEx-1_0-st-v2.r2.clf -b lib/HuEx-1_0-st-v2.r2.antigenomic.bgp -m lib/HuEx-1_0-st-v2.r2.dt1.hg18.full.mps --qc-probesets lib/HuEx-1_0-st-v2.r2.qcc -o gene/ --cel-files cel_list

exon-level : apt-probeset-summarize -a rma-bg,pm-gcbg,med-polish -a pm-gcbg,plier -a pm-only,dabg.neglog10=true --precision 10 -p lib/HuEx-1_0-st-v2.r2.pgf -c lib/HuEx-1_0-st-v2.r2.clf -b lib/HuEx-1_0-st-v2.r2.antigenomic.bgp --qc-probesets lib/HuEx-1_0-st-v2.r2.qcc -o exon/ --cel-files cel_list

Further processing was done within R.

All input datasets, used library files, RData, and lists generated are available here: [1]

The gene-level analysis also allowed the selection of probesets at three stringency levels:

  1. core: the most well-annotated probesets annotated by RefSeq & Genbank mRNAs, ESTs from dbEST -- approx 22,500
  2. extended: also includes Ensembl annotations, syntenic RNAs from mouse & rat, microRNAs -- approx 130,000
  3. full: also includes predicted genes from Geneid, Genscan, Twinscan -- approx 262,000

Similarity of Cell Lines Across Labs

Update Nov 20, 2008

The most important question to answer here is if the same cell-line grown up by different labs (potentially using different protocols) would be as similar as replicates from one lab. If the same line in different lines behaves wildly differently, then cross-lab comparisons of data are suspect. Since degree of similarity is a relative concept, we would be better off looking at as many different cell-lines and as many different labs as are available. Therefore, in addition to the two Tier1 cell lines, we've also included replicates of all five Tier2 cell lines (from the Crawford lab at Duke) into our analysis.

We then take a multidimensional scaling based visualization approach to examine the clustering of the data. To do this, we take the matrix of pairwise Spearman correlations for each chip, transform this into a distance ( ~ (1-correl)/2 ) and apply the non-parametric Sammon multidimensional scaling algorithm to it to reduce the data to two-dimensions. We now have a two dimensional space that captures the relationships between individual samples. Using the most noise free transformation of the data (gene-level including only core probesets) with the plier method, we plot out all the samples on the two resultant MDS dimensions (below).

Using MDS to look at core gene level intra-lab and inter-lab correlations.

As we can see, MDS dimension one best separates the two Tier1 cell lines, with dimension two resolving the Tier2 cell lines. The major difference then is between the two Tier1 lines but the Tier2 lines are different enough to cluster independently. We also see that cell-lines with similar origins (see GM12878 and GM18507) are closer together. Groovy.

Comparing the within lab distances between samples of the same cell line to the the overall average distance between same cell-line samples, we find that for K562, the overall is about one std deviation from the average within lab distance but that for GM12878, the overall is more than two std deviations away. This bears some discussion.

MDS on the exon level probesets.

Exon level probesets used to show clustering -- they are more noisy. This is probably due to the high level of similarity between individual exons probed. Gene-level bears out the differences much better.