CLASSIFICATION BASED ON DNA
MICROARRAY DATA
This exercise will
illustrate classification based on data produced by DNA microarrays used for
monitoring expression levels of whole cell mRNA populations.
The exercise uses a
leukemia dataset and several analysis methods in the R statistics package.
Dataset
The leukemia dataset
contains gene expression levels for patients with high and low minimal residual
disease (MRD). MRD is a measure of the number of leukemic cells that are
present in the bone marrow after a period of treatment. MRD is therefore a
measure for how well the patients respond to the given treatment. Patients with
low MRD have a good response to treatment (and the cure rates for these
patients are high), whereas patients with high MRD have a poor response to
treatment (and a lower chance of long term survival).
The Affymetrix Focus Array
GeneChips where applied (a total of 8793 measured gene expressions).
Method
R: Public domain data analysis
package available for download for a number of different computer platforms at http://www.rproject.org/. There are
several online manuals available, e.g.: An Introduction to R
(PDF file).
Copy the commands given in
the exercise into the window. If you get
an error, try to copy one line at a time.
Please write down your answers during the exercise, so you can compare
the results obtained in the different parts of the exercise!!!
Exercises:
First load
the data into R:
R
Check
that your Xwin is working by typing
plot(1:10)
ls()
load("/home/projects/hanni/Courses/Phdcourse/leukemiaMRDdata")
ls()
You should now have the following
objects loaded into R:

Matrix

classes

expnames
Look at your data by typing "classes" and "expnames".
“classes” is a vector with the
assigned classes (H: high MRD, L: low MRD).
The “expnames”
are the names of the experiments followed by the class. Eg.
patient number 20 has high MRD and is named: “P20.H”.
Look at the first 15 lines of the
data object "Matrix":
Matrix[1:15,]
Here, you see the first 15 rows of
the ‘Matrix’ object, where each row is a different gene (the Affymetrix ID’s
are listed), and each column is an experiment. The values are gene expression
levels for each combination of experiment and gene.
Visualize the data before performing
any analysis.
library(MASS) # function for writing to file is used
library(class) # functions for class prediction
source("/home/projects/hanni/Courses/Phdcourse/R.functions")
tposed < t(Matrix)
pca < fast.prcomp(tposed, retx=TRUE,center=TRUE, scale. = TRUE)
plot(pca$x[,1:2],type="n")
text(pca$x[,1:2], expnames)
This plot shows the the two patient groups using the labels of low (L) and high
(H) MRD.
Question 1. How well are the individual
classes separated?
Remaining in the R session started
above; enter the following commands to perform a knearest neighbor
classification on the Euclidian distance between ALL 8793 GENES.
knn.targets < factor(classes)
predictions < knn.cv(tposed, knn.targets, k=3, prob=TRUE)
The predicted classes are saved in
"predictions". Type “predictions” to look at the predictions type.
knn.targets
Question 2. How good is your classifier at
classifying each of the 15 patients:
number of true positives (tp), false negatives
(fn), true negatives (tn), false positives (fp)?
Question 3. What is the classification
accuracy in percent?
Question 4. What is the correlation
coefficient?
Hint
(Correlation coefficient  CC):
tp < specify number
of true positives
fn < specify number
of false negatives
tn < specify number
of true negatives
fp < specify number of
false positives
CC < ((tp*tn)(fp*fn))/sqrt((tn+fn)*(tn+fp)*(tp+fn)*(tp+fp))
To
see the calculated correlation coefficients type “CC”.
Question 5. From the accuracies and
correlation coefficients, are these results any good?
Remaining in the R session started
above; enter the following commands to perform a ttest of all genes.
factors < factor(classes)
index < 1:length(factors)
index1 <
index[factors==levels(factors)[1]] #
Category H
index2 <
index[factors==levels(factors)[2]] #
Category L
pValues < get.pval.ttest(Matrix,index1,index2)
orders < order(pValues)
Matrix.pvalues < cbind(Matrix[orders,],pValues[orders])
Matrix.pvalues[1:25,]
Perform a new principal component
analysis by using only the 25 most significant genes.
tposed < t(Matrix.pvalues[1:25,1:length(classes)])
pca < fast.prcomp(tposed, retx=TRUE,center=TRUE, scale. = TRUE)
plot(pca$x[,1:2],type="n")
text(pca$x[,1:2], expnames)
Question 7. How well are the individual classes separated,
now?
Remaining in the R session started
above; enter the following commands to perform a knearest neighbor
classification on the Euclidian distance between the 25 most significant genes.
predictions < knn.cv(tposed, knn.targets, k=3, prob=TRUE)
The predicted classes are saved in
predictions. To look at the prediction type “predictions”.
knn.targets
Question 8. How good is your classifier at
classifying each of the 15 patients:
number of true positives (tp), false negatives
(fn), true negatives (tn), false positives (fp)?
Question 9. What is the classification
accuracy in percent?
Question 10. What is the correlation
coefficient?
Question 11. From the accuracies and correlation
coefficients, how are the results compared to classification based on all
genes?
This exercise demonstrates how the
number of top ranking genes chosen to include for classification have an
influence on classification performance.
In step 18, the
number of top ranking genes to include was chosen as 25. The code below will
calculate the CC with number of genes 2 to 500 and plot a graph (performance
based on number of genes)
perf = c(2:500)
for(x in 2:500){
tposed < t(Matrix.pvalues[1:x,1:length(classes)])
predictions < knn.cv(tposed, knn.targets, k=3, prob=TRUE)
test < predictions==knn.targets
accuracy <
length(test[test==TRUE])*100/length(test)
cc < calc.correlation(return.score(predictions,knn.targets))
perf[x1] = cc}
plot(c(2:500),perf,type="l",xlab="Number of genes", ylab="CC")
Question 12. How does the classification
performance change with the number of genes?
Question 13. Why does the graph look like it
does (“squared fluctuation”)?
With regard to prediction of classes
based on top ranking genes selected by a ttest of all samples (this exercise
and the previous exercise):
Question 14. Is this the right way
of estimating classification performance?
Question 15. How would you get a truer estimate
of your classification performance?
Question 16. Would this classification
performance be better or worse than the just estimated performance?
cluster.matrix < Matrix[orders,]
colnames(cluster.matrix) < expnames
d < dist(t(cluster.matrix))
hc < hclust(d)
plot(hc)
Question 17. What do you see on this plot? How
are the samples clustered?
m < cluster.matrix[1:50,]
d < dist(t(m))
hc < hclust(d)
plot(hc)
Question 18. What do you see on this plot? How
are the samples clustered?
colcol<vector()
colcol[classes=="H"]<"red"
colcol[classes=="L"]<"magenta"
heatmap(m, distfun = cordist, col = topo.colors(32),ColSideColors=colcol)
Question 19. What is the difference between
this plot and the previous plot? Describe the plot.
Quit the R session with ControlD
followed by the answer "n" to the question whether you want to save
the workspace image.
Last updated by Agnieszka S. Juncker,
CBS, May 10, 2006