Next Generation Sequencing and transcriptomics exercise
Introduction:
The data set used
here is Solexa sequencing runs performed on cDNA from phototrophic sulfur
bacteria Thiodictyon sp.
Cad16. Each experiment represents a specific environmental condition that the
bacteria were grown under. One of the objectives was to learn which genes that
were differentially expressed beween the two growing conditions.
The objective of
this exercise is not to learn new biology, but rather to show you how next
generation sequencing data can be preprocessed, such that it can be used for
analysing gene expression using the tools youÕve learned for analyzing DNA
microarrays.
A.
To use sequenced cDNA reads
as measures of gene expression, you must first take some steps in order to
transform your fastq file (file with your reads and read qualities) into
something resembling a microarray file.
Q1) Can you think of a necessary analysis that you must perform in
order to transform a fastq (or fasta) file into something resembling a DNA
microarray data file?
(Hint: You can safely ignoring the
quality measures, which incidentally turns your data file into a fasta file)
Q2) Suggest a method for achieving this.
NOTE!
Performing this step on
each of the guest accounts is unfortunately not feasible (software
installation, cpu time, data etc.), so we will discuss this together after
youÕve all finished answering the first two questions.
B.
Not all fragments ends up
being sequenced. The ones that do are sampled according to a random stochastic
(probabilistic) process.
Q3) Can you think of a reason (an error) how a given gene can have
a much higher number of reads mapping to it compared to another gene with
identical expression level? (answer this question before moving on with the exercise or
reading further)
Note important!: commands that should be executed are illustrated
with a Ô>Õ character in front of them. You should NOT write this character explicitly, it is a mere
standard for showing when something takes place in the terminal window.
Create a directory on you computer to store data and work from.
Download the exercise files
from
http://www.cbs.dtu.dk/~plate/all.reads.num http://www.cbs.dtu.dk/~plate/all.SL.num
http://www.cbs.dtu.dk/~plate/normalize.R
to you exercise folder.
The files contain the
following information:
a.
Number of reads mapping to a gene. Each row represents one gene,
each column represents one sequencing experiment.
b.
Sequence lengths
c.
A script containing all the procedures for this exercise. DonÕt
cheat yourself and look at the solutions before giving a fair attempt to solve
the questions.
Hint: In order to learn R youÕll
do yourself a favour typing the commands manually rather than copy paste from
the exercise. ItÕll stick better.
Start up R:
Load the data in the file Òall.reads.numÓ.
> X <- read.table('all.reads.num')
> SL < read.table(Ôall.SL.numÕ)
First take look at the data
(as you should always do) and check for outliers:
> par(mfrow=c(2,1))
> plot(X[,1],X[,2],pch=16,col=2)
> plot(log(X[,1]),log(X[,2]),pch=Ó.Ó, col=4)
Q4) Normalize the read counts by the sequence lengths (as youÕve
just answered) and store these normalized counts in a variable with some
descriptive variable name like r1SL (read counts
experiment 1 normalized by sequence length)
>
r1SL <- ???
>
r2SL <- ???
Does the content of the
arrays look correct? Check by printing the first 5-6 elements of r2SL, X and SL. Verify
that this is an important step by plotting the seq. lengths vs. counts:
> par(mfrow=c(2,1))
> plot(log(SL[,1]), log(X[,1]), pch=Ó.Ó, xlab=ÓlogSLÓ, Óylab=ÓlogX1Ó)
> plot(log(SL[,1]), log(X[,2]), pch=Ó.Ó, xlab=ÓlogSLÓ, Óylab=ÓlogX2Ó)
C.
As you might remember was the case for microarrays, there are two
kinds of errors in sequencing systems like these. Systematic error and
stochastic error. The stochastic error is hard to do much about, but the
systematic error we definitely must reduce as much as possible, in order to
extract the true signal.
One systematic error is variance in total read counts in between
experiments. Different runs on otherwise identical samples can produce varying
amount of total sequence reads, which must be adjusted for (normalized) in
order to compare differential gene expression across experiments or runs
fairly.
Q5) What quantity can you
calculate that will give you an indication of systematic difference in read
numbers across the samples?
Q6) Having sequence counts
from two experiments, as in this case, use R to correct this systematic bias.
Note: you should now use the counts that you have corrected for sequence
length bias (r1SL, s2SL). Write the further corrected counts to new variables,
for instance Òr2NÓ (N for normalized - it gets increasingly awkward to come up
with short and descriptive variable names, so use your fantasy).
Hint: Use one experiment (e.g. Òr1SLÓ) as a reference and correct the
counts in the other experiment relative to this. It makes it easier if you use
an intermediary variable to save your normalization factor in.
Try first for yourself and
see in the script Ònormalize.RÓ whether you did it correct.
D.
Non-linear signal dependent
normalization, is a standard approach that you should be familiar with by now.

Figure: X: mean(log2 intensity) Y: log(ratio)
Read more about this techique here: qspline
normalization recap
Load the affymetrix library by typing either:
> require(affy) or >
library(affy)
First we check for non linear bias:
> rat12 <- log2(XN[,1]/XN[,2])
> int <- apply(log2(XN),1,mean)
> plot(int,rat12)
Look closely to see if the dots are centered horizontally.
Then we apply qspline to remove any non-linear bias
XN_spl <- normalize.qspline(XN)
Again we check for non linear bias
> rat12_spl <- log2(XN_spl[,1]/XN_spl[,2])
> int_spl <- apply(log2(XN_spl),1,mean)
> plot(int_spl,rat12_spl)
The difference is small, but if you look closely youÕll see that
the non qsplineÕd ratios are not perfectly centered horizontally. The qspline
fixes this nice and tidy.
E.
Q7) If you were searching for genes that were differentially
expressed across the samples, think of ways how you could indentify the most
variable genes.
As goes for microarray
analysis, in transcriptomics the more experimental replications the
better. This data set unfortunately only contains two experiments (priced
$8-12.000 each). This limits how advanced statistical methods are worth
applying. A t-test between two observations of a gene for example doesnÕt make much
sense, so in lack of more data, we will (as you have just answered above) use either
fold change or the standard deviation across samples, and sort genes according to
this measure. The more varying the expression is, the higher the SD or FC.
> sd_spl <- sd(t(XN_spl))
Q8) See if you can perform the same step using fold change instead
of standard deviation.
> fc_spl <- ???
You could save your
normalized counts and your standard deviations like so:
> write(XN_spl, file="reads.normalized.num",
ncolumns=2)
> write(sd_spl, file="all.stddev.num", ncolumns=1)
All that is left to do is
to join the calculated standard deviations of normalized counts with their
respective gene names, sort them in decreasing order so that the top genes
represent the most differentially expressed genes. All of this could be done
very easily in R. This data set has not been published yet, so unfortunately
weÕll have to save this little step for now.
F.
Q9) Which of the steps that weÕve just performed would be
different if we had a series of sequencing runs (say from a time series
experiment) instead of just two experiments? What would do different
specifically?
Take home message:
1.
Nextgen sequencing can be used to measure gene expression on a wider
and more detailed scale than DNA microarrays.
2.
Nextgen sequencing WILL account for a large proportion of expression
studies in the very near future.
3.
With a series of preprocessing steps, you can use the analysis tools
youÕve learned for microarrays in the exact same manner.