Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

RNAseq exercises

Francesco Favero - June 2013


Overview

PCA3 is an antisense RNA, expressed specifically in prostate cancer cells. The specificity of PCA3 for prostate cancer is well know, and in fact, an assay to test presence of PCA3 in the urine is often coupled with the normal PCA blood test.

Public gene expression repositories as GEO have a large number of RNAseq cancer study. Among them few are from prostate cancer patient studies.
For this exercise we chosen the dataset
GSE22260, which consists of 30 samples Illumina Genome Analyzer II.

Q0: find from GEO the following data information:
a) Read length
b) Insert size
c) standard deviation of the insert size
d) are the samples paired or single end?

Hint: open few sample (GSM*) from the dataset page (GSE), and find the fields Charachteristic, Extraction protocol....

For the exercise we subsets the original dataset to the reads mapping at chromosome 9, in order to test the RNA level of PCA3 in tumor versus normal in shorter amount of time.

Each student will choose a sample and will perform the first part of the exercise on that.
In the second part of the exercise all the files generated by the students will be grouped

The following steps will be performed:

  1. Data preparation
  2. Alignment with tophat
  3. Counts reads with HTSeq
  4. Explore the result
  5. ------Working with R now----------
  6. Create the Targets.txt file
  7. Load data into R
  8. Associate ENS_genes with other annotations
  9. Normalize
  10. Check differential expressed genes
  11. Plot PCA3 expression in cancer and normal samples

Data preparation

Choose a sample in the Google Document, and make sure other student are not working on the same file.

Create a directory rnaseq in your home folder, and copy all the fastq.gz files of your sample in your local folder. (NOTE: MYSAMPLE stands for the id of you own sample)

mkdir /home/local/ngs_course/studXXX/rnaseq
cp /home/local/27626/exercises/rnaseq/dataset/chr9_GSE22260/MYSAMPLE/*.fastq.gz /home/local/ngs_course/studXXX/rnaseq/

Now we can have a quick look at the fastq files. Command lines are often handy to check the number of lines of a file.

cd /home/local/ngs_course/studXXX/rnaseq/
zcat SAMPLE_1.fastq.gz | wc -l
Q1: Considering that the file is in fastq format, how many reads have each files?

Alignment with tophat

Tophat is using bowtie for alignment, therefore we need a reference indexed in the format required for bowtie.
An already indexed reference for chr9 is available at /home/local/27626/exercises/rnaseq/reference/. Tophat also does not support gzipped files, so we need to unzip the files first:

gunzip SAMPLE*.fastq.gz
/tools/bin/tophat --num-threads 1 --output-dir tophat_out --mate-inner-dist 80 --mate-std-dev 38 \
   /home/local/27626/exercises/rnaseq/reference/Homo_sapiens.GRCh37.66.dna.chromosome.9 \
    SAMPLE_1.fastq SAMPLE_2.fastq
Q2: Looking at the help message of tophat2 (tophat2 --help), what does the options --mate-inner-dist and --mate-std-dev means? Have you an idea of how the value of --mate-inner-dist is chosen?

Counts reads with HTSeq

HTSeq need a small workaround to run on padawan, type the following command (need to be done only once every log-in)

 setenv LD_LIBRARY_PATH /home/local/27626/lib/

In the slides we mentioned HTSeq, and the ability to choose the feature. In this example there will be a practical example of how it works. You can check the working status and the various option with the command

/home/local/27626/bin/htseq-count --help

If the student installation does not work you can use this one: /home/projects3/sbge_cancer_seq/sequence/RNAseq/src/myPy/bin/htseq-counts
Since we need a two column header, with the gene_id and the number of reads, we create an empty file with only the header, and we will tell the program to write in it the counts information. The required GTF file is in the folder together with the reference genome.

echo "ens_gene\tcounts" > SAMPLE_htseq.txt
samtools view tophat_out/accepted_hits.bam | /home/local/27626/bin//htseq-count --quiet - \
   /home/local/27626/exercises/rnaseq/reference/Homo_sapiens.GRCh37.66.chr9.gtf \
   -m intersection-strict >> SAMPLE_htseq.txt
If htseq-count does not work, you can copy the file from
cp /home/local/27626/test/rnaseq/analysis/SAMPLE_htseq.txt .

Explore the result

The 2 column files contains the reads of the various ensembl genes in chromosome 9. Using standard UNIX tools is possible to have an idea of the results:

head SAMPLE_htseq.txt
cat SAMPLE_htseq.txt | gawk '{sum +=$2}END{print sum}'
tail SAMPLE_htseq.txt
Q3: a) Why the total read counts changes by selecting diffrent modes (-m) option of HTSeq?
b) Using htseq-count with the mode "intersection_strict", complete the information on the Google document according with the finding of your file


Hint use the -o option of HTSeq to output a sam files, and use the grep command to count (grep -c)ambiguous and no_features reads


Create the Targets.txt file

EdgeR, as the limma packages, require an index file, where all the relevant information for each sample are stored.
The relevant header entry necessary in the Target file are the column files and group. In the dataset folder there is a description of the samples, which can be used to generate the targets file.

cat /home/local/27626/exercises/rnaseq/dataset/chr9_GSE22260/Description.txt | \
   sed 's/"//g' | awk 'BEGIN { FS = "\t" } ; {print $1"_htseq.txt""\t"$2"\t"$3}' | \
   sed 's/files_htseq.txt/files/' > Targets.txt
Q4: How many normal prostate tissue sample do we have?

Load data into R

From now we need to use R and the package edgeR. Load the files in R is extremely easy after create the correct Targets.txt.

R
library(edgeR)
targets = readTargets("Targets.txt")
colnames(targets)[1] = "files"
d = readDGE(targets,path='/home/local/27626/test/2012/rnaseq/analysis/')
Q5: Using the commands, dim and names in R, which dimension the object d have and which sub-objects does it contain?

Associate ENS_genes with other annotations

The annotation file for ens_gene in chromosome 9 is located at: /home/local/27626/exercises/rnaseq/reference/GRCh37_66_ens_genes_annotation.chr9.txt. (The annotation was obtained querying the list of ens_gene in the order of the generated files, with the biomaRt package)

genes = read.table("/home/local/27626/exercises/rnaseq/reference/GRCh37_66_ens_genes_annotation.chr9.txt", header=T,sep='\t',as.is=T)
d$genes = genes
d$genes$gene_biotype
Q6: using the table command on the gene_biotype column we just printed - how many antisense RNAs are present on chr9?

Normalization

The following two commands will perform the normalization by library size, and fit the negative-binomial distribution to the model. For more details in R you can try ?calcNormFactors and ?estimateCommonDisp.

d = calcNormFactors(d)
d = estimateCommonDisp(d)
Q7: Using again the function names(d) which new objects appears in the list?

Check differential expressed genes

d.com      = exactTest(d,pair=c('normal','carcinoma'))
top.genes  = topTags(d.com,n=sum(d.com$table$PValue <= 0.01))

In R is possible to view a desired part of the data, by indexing the object on the square bracket "[]", or by accessing the desired line per raw name. In our case the raw names in the d.com object corresponds to the ENSG symbol

top.genes$table[1:20,] ## Print the first 20 line of the object table in top.gene
d.com$table["ENSG00000160563",]
Q8: Which is the most expressed RNA in this analysis?
Hint: look the logFC and the logCPM values

Plot PCA3 expression in cancer and normal samples

R is a really good tool for statistics, in particular when it needed to plot and visualize the results. Below are two example of graphics representation, a boxplot and a barplot.

par(mfrow=c(1,2))
boxplot(log2(d$pseudo.alt["ENSG00000225937",])~c(rep("cancer",20),rep("normal",10)),col=c("red","blue"), main="ENSG00000225937")
legend("topright",c("cancer","normal"),col=c("red","blue"), pch=16)
barplot(d$pseudo.alt["ENSG00000225937",],col=c(rep("red",20),rep("blue",10)), main="ENSG00000225937")
legend("topright",c("cancer","normal"),col=c("red","blue"), pch=16)
dev.print(file="ENSG00000225937.pdf", device=pdf)

Q9: a) Is PCA3 expression significant?
b) After plotting the 2 figures for PCA3 what conclusion would you say?

Congratulations you finished the exercise!