CLASSIFICATION BASED ON DNA MICROARRAY DATA
This exercise will illustrate classification based on data produced by DNA microarrays used for monitoring expression levels of whole cell mRNA populations.
The exercise uses a leukemia dataset and several analysis methods in the R statistics package.
Dataset
The leukemia dataset contains gene expression levels for patients with high and low minimal residual disease (MRD). MRD is a measure of the number of leukemic cells that are present in the bone marrow after a period of treatment. MRD is therefore a measure for how well the patients respond to the given treatment. Patients with low MRD have a good response to treatment (and the cure rates for these patients are high), whereas patients with high MRD have a poor response to treatment (and a lower chance of long term survival).
The Affymetrix Focus Array GeneChips where applied (a total of 8793 measured gene expressions).
Method
R: Public domain data analysis package available for download for a number of different computer platforms at www.r-project.org. There are several online manuals available: Kickstarting R and An Introduction to R (PDF file).
Copy the commands given in the exercise into the window. If you get an error, try to copy one line at
a time.
Please write down your answers during
the exercise, so you can compare the results obtained in the different parts of
the exercise!!!
Exercises:
First load the data into R:
R
ls()
load("~hanni/Courses/Phdcourse/leukemiaMRD-data")
ls()
You should now have the following objects loaded into R:
- Matrix
- classes
- expnames
Look
at your data by typing "classes" and "expnames".
“classes” is a vector with the assigned classes (H: high MRD, L: low MRD).
The “expnames” are the names of the experiments followed by the class. Eg. patient number 20 has high MRD and is named: “P20.H”.
Look at the first 15 lines of the data object "Matrix":
Matrix[1:15,]
Here, you see the first 15 rows of the ‘Matrix’ object, where each row is a different gene (the Affymetrix ID’s are listed), and each column is an experiment. The values are gene expression levels for each combination of experiment and gene.
Visualize the data before performing any analysis.
library(MASS) # function for writing to file is used
library(class) # functions for class prediction
source("~hanni/Courses/Phdcourse/R.functions")
tposed <- t(Matrix)
pca <- fast.prcomp(tposed, retx=TRUE,center=TRUE, scale. = TRUE)
plot(pca$x[,1:2],type="n")
text(pca$x[,1:2], expnames)
This plot shows the the two patient groups using the labels of low (L) and high (H) MRD.
Question 1. How well are the individual classes separated?
(See Chapter 8 and page 102 in the Microarray book )
Remaining in the R session started above; enter the following commands to perform a k-nearest neighbor classification on the Euclidian distance between ALL 8793 GENES.
knn.targets <- factor(classes)
predictions <- knn.cv(tposed, knn.targets, k=3, prob=TRUE)
The predicted classes are saved in "predictions". Type “predictions” to look at the predictions type.
knn.targets
Question 2. How good is your classifier at classifying each of the 15 patients: number of true positives (tp), false negatives (fn), true negatives (tn), false positives (fp)?
Question 3. What is the classification accuracy in percent?
Question 4. What is the correlation coefficient?
Hint (Correlation coefficient - CC):
tp <- specify number of true positives
fn <- specify number of false negatives
tn <- specify number of true negatives
fp <- specify number of false positives
CC <- ((tp*tn)-(fp*fn))/sqrt((tn+fn)*(tn+fp)*(tp+fn)*(tp+fp))
To see the calculated correlation coefficients type “CC”.
Question 5. From the accuracies and correlation coefficients, are these results any good?
(See Section 3.5 and page 100 in the Microarray book )
Remaining in the R session started above; enter the following commands to perform a t-test of all genes.
factors <- factor(classes)
index <- 1:length(factors)
index1 <- index[factors==levels(factors)[1]] # Category H
index2 <- index[factors==levels(factors)[2]] # Category L
pValues <- get.pval.ttest(Matrix,index1,index2)
orders <- order(pValues)
Matrix.pvalues <- cbind(Matrix[orders,],pValues[orders])
Matrix.pvalues[1:25,]
Question 6. With multiple testing cutoff of 0.05 percent change of 1 false positive (a maximum p-value of 0.05/8793 = 5.7e-6), how many genes are significant?
Perform a new principal component analysis by using only the 25 most significant genes.
tposed <- t(Matrix.pvalues[1:25,1:length(classes)])
pca <- fast.prcomp(tposed, retx=TRUE,center=TRUE, scale. = TRUE)
plot(pca$x[,1:2],type="n")
text(pca$x[,1:2], expnames)
Question 7. How well are the individual classes separated, now?
Remaining in the R session started above; enter the following commands to perform a k-nearest neighbor classification on the Euclidian distance between the 25 most significant genes.
predictions <- knn.cv(tposed, knn.targets, k=3, prob=TRUE)
The predicted classes are saved in predictions. To look at the prediction type “predictions”.
knn.targets
Question 8. How good is your classifier at classifying each of the 15 patients: number of true positives (tp), false negatives (fn), true negatives (tn), false positives (fp)?
Question 9. What is the classification accuracy in percent?
Question 10. What is the correlation coefficient?
Question 11. From the accuracies and correlation coefficients, how are the results compared to classification based on all genes?
This exercise demonstrates how the number of top ranking genes chosen to include for classification have an influence on classification performance.
In step 18, the number of top ranking genes to include was chosen as 25. Vary this number (e.g. try some numbers between 2 and 500) and plot the classification accuracy and/or correlation coefficient dependent on top ranking genes included for classification. In the following code, the accuracy and correlation coefficient is calculated automatically. The R code for calculating the accuracy is shown directly, while the written R code for calculation of the correlation coefficient is stored in to functions "calc.correlation" and "return.score".
x <- 2
tposed <- t(Matrix.pvalues[1:x,1:length(classes)])
predictions <- knn.cv(tposed, knn.targets, k=3, prob=TRUE)
test <- predictions==knn.targets
accuracy <- length(test[test==TRUE])*100/length(test)
cc <- calc.correlation(return.score(predictions,knn.targets))
plot(x=c(x1,x2,.....,xi),y=c(y1,y2,.....,yi),type="p",xlab="Number of genes", ylab="CC")
Here, x1 and y1 are corresponding pairs of the number of genes (x1) and classification accuracy or correlation coefficient (y1).
Question 12. How does the classification performance change with the number of genes?
With regard to prediction of classes based on top ranking genes selected by a t-test of all samples (this exercise and the previous exercise):
Question 13. Is this the right way of estimating classification performance?
Question 14. How would you get a truer estimate of your classification performance?
Question 15. Would this classification performance be better or worse than the just estimated performance?
cluster.matrix <- Matrix[orders,]
colnames(cluster.matrix) <- expnames
d <- dist(t(cluster.matrix))
hc <- hclust(d)
plot(hc)
Question 16. What do you see on this plot? How are the samples clustered?
m <- cluster.matrix[1:50,]
d <- dist(t(m))
hc <- hclust(d)
plot(hc)
Question 17. What do you see on this plot? How are the samples clustered?
heatmap(m, distfun
= cordist, col = topo.colors(32))
Question 18. What is the difference between this plot and the previous plot? Describe the plot.
Quit the R session with Control-D followed by the answer "n" to the question whether you want to save the workspace image.
Last
updated by Steen Knudsen, CBS, April 1, 2005