**CLASSIFICATION BASED ON DNA MICROARRAY DATA**

This exercise will illustrate classification based on data produced by DNA microarrays used for monitoring expression levels of whole cell mRNA populations.

The exercise uses a leukemia dataset and several analysis methods in the R statistics package.

Dataset

The leukemia dataset contains gene expression levels for patients with high and low minimal residual disease (MRD). MRD is a measure of the number of leukemic cells that are present in the bone marrow after a period of treatment. MRD is therefore a measure for how well the patients respond to the given treatment. Patients with low MRD have a good response to treatment (and the cure rates for these patients are high), whereas patients with high MRD have a poor response to treatment (and a lower chance of long term survival).

The Affymetrix Focus Array GeneChips where applied (a total of 8793 measured gene expressions).

Method

**R**: Public domain data analysis package available for download for a
number of different computer platforms at www.r-project.org.
There are several online manuals available: Kickstarting R and An
Introduction to R (PDF file).

Copy the commands given in the exercise into the window. *If you get an error, try to copy one line at
a time.*

**Please write down your answers during
the exercise, so you can compare the results obtained in the different parts of
the exercise!!!**

Exercises:

First load the data into R:

- Make sure that you have started the Xwin32 server on the labtops (windows computers).
- Login to genome.

- Start R:

R

- Check that your R directory is empty:

ls()

- Load the data:

load("~hanni/Courses/Phdcourse/leukemiaMRD-data")

- Check that you have loaded the data correctly:

ls()

You should now have the following objects loaded into R:

- Matrix

- classes

- expnames

Look
at your data by typing "classes" and "expnames".

“classes” is a vector with the assigned classes (H: high MRD, L: low MRD).

The “expnames” are the names of the experiments followed by the class. Eg. patient number 20 has high MRD and is named: “P20.H”.

Look at the first 15 lines of the data object "Matrix":

`Matrix[1:15,]`

Here, you see the first 15 rows of the ‘Matrix’ object, where each row is a different gene (the Affymetrix ID’s are listed), and each column is an experiment. The values are gene expression levels for each combination of experiment and gene.

Visualize the data before performing any analysis.

- Load packages and functions:

library(MASS) # function for writing to file is used

library(class) # functions for class prediction

source("~hanni/Courses/Phdcourse/R.functions")

- Transpose data matrix:

`tposed <- t(Matrix)`

- Run PCA:

pca <- fast.prcomp(tposed, retx=TRUE,center=TRUE, scale. = TRUE)

- Plot the first two principal components:

plot(pca$x[,1:2],type="n")

text(pca$x[,1:2], expnames)

This plot shows the the two patient groups using the labels of low (L) and high (H) MRD.

Question 1. How well are the individual classes separated?

(See Chapter 8 and page 102 in the Microarray book )

Remaining in the R session started above; enter the following commands to perform a k-nearest neighbor classification on the Euclidian distance between ALL 8793 GENES.

- Make a list of target classes:

knn.targets <- factor(classes)

- Perform k-nearest neighbor classification with leave-one-out-cross-validation (LOOCV) and k=3:

predictions <- knn.cv(tposed, knn.targets, k=3, prob=TRUE)

The predicted classes are saved in "predictions". Type “predictions” to look at the predictions type.

- Check your classification accuracy by comparing with target classes:

knn.targets

Question 2. How good is your classifier at classifying each of the 15 patients: number of true positives (tp), false negatives (fn), true negatives (tn), false positives (fp)?

Question 3. What is the classification accuracy in percent?

Question 4. What is the correlation coefficient?

Hint (Correlation coefficient - CC):

tp <- specify number of true positives

fn <- specify number of false negatives

tn <- specify number of true negatives

fp <- specify number of false positives

CC <- ((tp*tn)-(fp*fn))/sqrt((tn+fn)*(tn+fp)*(tp+fn)*(tp+fp))

To see the calculated correlation coefficients type “CC”.

Question 5. From the accuracies and correlation coefficients, are these results any good?

(See Section 3.5 and page 100 in the Microarray book )

Remaining in the R session started above; enter the following commands to perform a t-test of all genes.

- First, assign your data to each class according to the vector “classes”:

factors <- factor(classes)

index <- 1:length(factors)

index1 <- index[factors==levels(factors)[1]] # Category H

index2 <- index[factors==levels(factors)[2]] # Category L

- Calculated p-values for each gene (this will take several minutes):

pValues <- get.pval.ttest(Matrix,index1,index2)

- Order the data according to lowest p-value:

orders <- order(pValues)

Matrix.pvalues <- cbind(Matrix[orders,],pValues[orders])

- Look at the 25 genes with the lowest p-value (the most significant genes):

Matrix.pvalues[1:25,]

Question 6. With a Benjamini and Hochberg cutoff of 1 false positive (a maximum p-value of 1/8793 = 1.14e-4), how many genes are significant?

Perform a new principal component analysis by using only the 25 most significant genes.

tposed <- t(Matrix.pvalues[1:25,1:length(classes)])

- Run PCA:

pca <- fast.prcomp(tposed, retx=TRUE,center=TRUE, scale. = TRUE)

- Plot the first two principal components:

plot(pca$x[,1:2],type="n")

text(pca$x[,1:2], expnames)

Question 7. How well are the individual classes separated, now?

Remaining in the R session started above; enter the following commands to perform a k-nearest neighbor classification on the Euclidian distance between the 25 most significant genes.

predictions <- knn.cv(tposed, knn.targets, k=3, prob=TRUE)

The predicted classes are saved in predictions. To look at the prediction type “predictions”.

- Check your classification accuracy by comparing with target classes:

knn.targets

Question 8. How good is your classifier at classifying each of the 15 patients: number of true positives (tp), false negatives (fn), true negatives (tn), false positives (fp)?

Question 9. What is the classification accuracy in percent?

Question 10. What is the correlation coefficient?

Question 11. From the accuracies and correlation coefficients, how are the results compared to classification based on all genes?

This exercise demonstrates how the number of top ranking genes chosen to include for classification have an influence on classification performance.

In step 18, the number of top ranking genes to include was chosen as 25. Vary this number (e.g. try some numbers between 2 and 500) and plot the classification accuracy and/or correlation coefficient dependent on top ranking genes included for classification. In the following code, the accuracy and correlation coefficient is calculated automatically. The R code for calculating the accuracy is shown directly, while the written R code for calculation of the correlation coefficient is stored in to functions "calc.correlation" and "return.score".

- In
the following, change
**x**– the number of top ranking genes:

x <- 2

tposed <- t(Matrix.pvalues[1:x,1:length(classes)])

predictions <- knn.cv(tposed, knn.targets, k=3, prob=TRUE)

test <- predictions==knn.targets

accuracy <- length(test[test==TRUE])*100/length(test)

cc <- calc.correlation(return.score(predictions,knn.targets))

- The curve may be drawn using R:

plot(x=c(x1,x2,.....,x_{i}),y=c(y1,y2,.....,y_{i}),type="p",xlab="Number
of genes", ylab="CC")

Here, x1 and y1 are corresponding pairs of the number of genes (x1) and classification accuracy or correlation coefficient (y1).

Question 12. How does the classification performance change with the number of genes?

With regard to prediction of classes based on top ranking genes selected by a t-test of all samples (this exercise and the previous exercise):

Question 13. Is this the right way of estimating classification performance?

Question 14. How would you get a truer estimate of your classification performance?

Question 15. Would this classification performance be better or worse than the just estimated performance?

- Make a copy of the original Matrix to work with and sort the rows according to p-values:

cluster.matrix <- Matrix[orders,]

colnames(cluster.matrix) <- expnames

- Calculate Euclidian distances:

d <- dist(t(cluster.matrix))

- Perform clustering:

hc <- hclust(d)

- Make a cluster plot:

plot(hc)

Question 16. What do you see on this plot? How are the samples clustered?

- Continue working with only the 50 most significant genes:

m <- cluster.matrix[1:50,]

- Calculate Euclidian distances:

d <- dist(t(m))

- Perform clustering:

hc <- hclust(d)

- Make a cluster plot:

plot(hc)

Question 17. What do you see on this plot? How are the samples clustered?

- Perform another clustering using the top 50 most significant genes (here, the distances are based on Pearson's correlation coefficient):

heatmap(m, distfun
= cordist, col = topo.colors(32))

Question 18. What is the difference between this plot and the previous plot? Describe the plot.

Quit the R session with Control-D followed by the answer "n" to the question whether you want to save the workspace image.

*Last
updated by Steen Knudsen, CBS, April 1, 2005*