Next Generation Sequencing and transcriptomics exercise
The data set used here is Solexa sequencing runs performed on cDNA from phototrophic sulfur bacteria Thiodictyon sp. Cad16. Each experiment represents a specific environmental condition that the bacteria were grown under. One of the objectives was to learn which genes that were differentially expressed beween the two growing conditions.
The objective of this exercise is not to learn new biology, but rather to show you how next generation sequencing data can be preprocessed, such that it can be used for analysing gene expression using the tools you’ve learned for analyzing DNA microarrays.
To use sequenced cDNA reads as measures of gene expression, you must first take some steps in order to transform your fastq file (file with your reads and read qualities) into something resembling a microarray file.
Q1) Can you think of a necessary analysis that you must perform in order to transform a fastq (or fasta) file into something resembling a DNA microarray data file?
(Hint: You can safely ignoring the quality measures, which incidentally turns your data file into a fasta file)
Q2) Suggest a method for achieving this.
Performing this step on each of the guest accounts is unfortunately not feasible (software installation, cpu time, data etc.), so we will discuss this together after you’ve all finished answering the first two questions.
Not all fragments ends up being sequenced. The ones that do are sampled according to a random stochastic (probabilistic) process.
Q3) Can you think of a reason (an error) how a given gene can have a much higher number of reads mapping to it compared to another gene with identical expression level? (answer this question before moving on with the exercise or reading further)
Note important!: commands that should be executed are illustrated with a ‘>’ character in front of them. You should NOT write this character explicitly, it is a mere standard for showing when something takes place in the terminal window.
Create a directory on you computer to store data and work from.
Download the exercise files from
to you exercise folder.
The files contain the following information:
a. Number of reads mapping to a gene. Each row represents one gene, each column represents one sequencing experiment.
b. Sequence lengths
c. A script containing all the procedures for this exercise. Don’t cheat yourself and look at the solutions before giving a fair attempt to solve the questions.
Hint: In order to learn R you’ll do yourself a favour typing the commands manually rather than copy paste from the exercise. It’ll stick better.
Start up R:
Load the data in the file “all.reads.num”.
> X <- read.table('all.reads.num')
> SL < read.table(‘all.SL.num’)
First take look at the data (as you should always do) and check for outliers:
> plot(log(X[,1]),log(X[,2]),pch=”.”, col=4)
Q4) Normalize the read counts by the sequence lengths (as you’ve just answered) and store these normalized counts in a variable with some descriptive variable name like r1SL (read counts experiment 1 normalized by sequence length)
> r1SL <- ???
> r2SL <- ???
Does the content of the arrays look correct? Check by printing the first 5-6 elements of r2SL, X and SL. Verify that this is an important step by plotting the seq. lengths vs. counts:
> plot(log(SL[,1]), log(X[,1]), pch=”.”, xlab=”logSL”, ”ylab=”logX1”)
> plot(log(SL[,1]), log(X[,2]), pch=”.”, xlab=”logSL”, ”ylab=”logX2”)
As you might remember was the case for microarrays, there are two kinds of errors in sequencing systems like these. Systematic error and stochastic error. The stochastic error is hard to do much about, but the systematic error we definitely must reduce as much as possible, in order to extract the true signal.
One systematic error is variance in total read counts in between experiments. Different runs on otherwise identical samples can produce varying amount of total sequence reads, which must be adjusted for (normalized) in order to compare differential gene expression across experiments or runs fairly.
Q5) What quantity can you calculate that will give you an indication of systematic difference in read numbers across the samples?
Q6) Having sequence counts from two experiments, as in this case, use R to correct this systematic bias.
Note: you should now use the counts that you have corrected for sequence length bias (r1SL, s2SL). Write the further corrected counts to new variables, for instance “r2N” (N for normalized - it gets increasingly awkward to come up with short and descriptive variable names, so use your fantasy).
Hint: Use one experiment (e.g. “r1SL”) as a reference and correct the counts in the other experiment relative to this. It makes it easier if you use an intermediary variable to save your normalization factor in.
Try first for yourself and see in the script “normalize.R” whether you did it correct.
Non-linear signal dependent normalization, is a standard approach that you should be familiar with by now.
Figure: X: mean(log2 intensity) Y: log(ratio)
Read more about this techique here: qspline normalization recap
Load the affymetrix library by typing either:
> require(affy) or > library(affy)
First we check for non linear bias:
> rat12 <- log2(XN[,1]/XN[,2])
> int <- apply(log2(XN),1,mean)
Look closely to see if the dots are centered horizontally.
Then we apply qspline to remove any non-linear bias
XN_spl <- normalize.qspline(XN)
Again we check for non linear bias
> rat12_spl <- log2(XN_spl[,1]/XN_spl[,2])
> int_spl <- apply(log2(XN_spl),1,mean)
The difference is small, but if you look closely you’ll see that the non qspline’d ratios are not perfectly centered horizontally. The qspline fixes this nice and tidy.
Q7) If you were searching for genes that were differentially expressed across the samples, think of ways how you could indentify the most variable genes.
As goes for microarray analysis, in transcriptomics the more experimental replications the better. This data set unfortunately only contains two experiments (priced $8-12.000 each). This limits how advanced statistical methods are worth applying. A t-test between two observations of a gene for example doesn’t make much sense, so in lack of more data, we will (as you have just answered above) use either fold change or the standard deviation across samples, and sort genes according to this measure. The more varying the expression is, the higher the SD or FC.
> sd_spl <- sd(t(XN_spl))
Q8) See if you can perform the same step using fold change instead of standard deviation.
> fc_spl <- ???
You could save your normalized counts and your standard deviations like so:
> write(XN_spl, file="reads.normalized.num", ncolumns=2)
> write(sd_spl, file="all.stddev.num", ncolumns=1)
All that is left to do is to join the calculated standard deviations of normalized counts with their respective gene names, sort them in decreasing order so that the top genes represent the most differentially expressed genes. All of this could be done very easily in R. This data set has not been published yet, so unfortunately we’ll have to save this little step for now.
Q9) Which of the steps that we’ve just performed would be different if we had a series of sequencing runs (say from a time series experiment) instead of just two experiments? What would do different specifically?
Take home message:
1. Nextgen sequencing can be used to measure gene expression on a wider and more detailed scale than DNA microarrays.
2. Nextgen sequencing WILL account for a large proportion of expression studies in the very near future.
3. With a series of preprocessing steps, you can use the analysis tools you’ve learned for microarrays in the exact same manner.