Data Analysis - Exercise

Statistical Analysis of Microarray Data


In this exercise we will analyze the data from two different microarray experiments
using the common t-test and an ANalysis of VAriance (ANOVA)


Take your time and read this introduction carefully.

Introduction

It has been found that some bacteria have an increased ability to adhere to solid surfaces when growing at sub-lethal alcohol concentrations. This phenotype is likely to cause problems during manufacturing of food, where alcohol frequently is used for disinfection. The mechanism behind this alcohol-induced adherence is still unknown, but studies like the one you will do in this exercise are hopefully going to point out in which direction(s) the research should go.

There is another alcohol-induced phenotype, which of course may be part of the overall reponse. A majority of the alcohol affected cells appear to have a different morphology. They tend to have a strange looking surface and grow slightly curved (like scrimps).

The first of the two experiments is a classical medium comparison: The wild type bacteria is grown in rich medium with and without alcohol. This experiment has been replicated 4 times. We therefore have 4 values corresponding to the wild type grown in rich medium and 4 values corresponding to the wild type grown in rich medium supplemented with a sub-lethal concentration of alcohol.

The second experiment is the same setup, but instead of using the wild type a strain with a mutation in the gene encoding a transcription factor (TF) has been used. The transcription factor is suspected to be involved in gene regulation during growth in the presence of alcohol. This experiment has been replicated 7 times.

The data file contains the combined data from both experiments. Each gene is represented by a line (row) and the format is like this: The gene name (in column 1) followed by the "raw" expression values from the above described experiments (column 2-23) in this order:
column 2-5: wild type without alcohol
column 6-9: wild type with alcohol
column 10-16: Mutant with alcohol
column 17-23: Mutant without alcohol

First: A little warming-up question:

  • What do you think we mean by "raw"?

    Exercise

    Download the file: data.combined.tab (to your Desktop), take a look at it and open R.

    Loading and normalizing:

    Load your data and normalize it by the use of Qspline.
    require(affy)
    Intensity <- read.table(file="Desktop/data.combined.tab")
    Int.m <- as.matrix(Intensity[,2:23])
    Norm.Int.m <- signif(normalize.qspline(Int.m,na.rm=TRUE),digits = 6)

  • Why is it important to normalize the data before running the statistical tests?

    (hint: Copy and paste the commands directly into your R prompt)

    As you may notice there is a warning message: "NaNs produced in: log(x)", this is due to negative values in the data.

  • How do you think we got negative expression values?

    Try to plot the "raw" data from one of the slides and compare it to a plot of the same data after it has been normalized.
    par(mfrow=c(2,2)) # to get a window with room for 4 plots
    plot(Int.m[,1],Int.m[,5],main="wild type - raw",xlab="no alcohol",ylab="alcohol",log="xy")
    plot(Norm.Int.m[,1],Norm.Int.m[,5],main="wild type - normalized",xlab="no alcohol",ylab="alcohol",log="xy")
    
    ... and one of the mutant arrays:
    plot(Int.m[,18],Int.m[,11],main="mutant - raw",xlab="no alcohol",ylab="alcohol",log="xy")
    plot(Norm.Int.m[,18],Norm.Int.m[,11],main="mutant - normalized",xlab="no alcohol",ylab="alcohol",log="xy")
    

  • Do you see the difference?

    The t-test:

    In order to run the two t-tests you first need to make a function that returns the desired P-values:
    get.pval.ttest <- function(dataf,index1,index2,
    datafilter=as.numeric){
    f <- function(i) {
    return(t.test(datafilter(dataf[i,index1]),
    datafilter(dataf[i,index2]))$p.value)}
    return(sapply(1:length(dataf[,1]),f))}
    

    Do the tests and get out the P-values in a nice format:
     
    pVal.TF <- signif(get.pval.ttest(Norm.Int.m,9:15,16:22), digits = 3)
    pVal.wt <- signif(get.pval.ttest(Norm.Int.m,5:8,1:4), digits = 3)
    
     
    Order the data in accordance to the P-values from the wildtype experiment:
    orders <- order(pVal.wt)
    ordered.data <- cbind(as.vector(Intensity[,1])[orders],pVal.wt[orders],pVal.TF[orders])
     
    Write your results into a file:
    write(t(as.matrix(ordered.data)),ncolumns=3,file = "Desktop/t-test.wt.tab")

    Now do the same thing with the results from the mutant experiment:
    orders <- order(pVal.TF)
    ordered.data <- cbind(as.vector(Intensity[,1])[orders],pVal.wt[orders],pVal.TF[orders])
    write(t(as.matrix(ordered.data)),ncolumns=3,file = "Desktop/t-test.TF.tab")

    The two files you get from this contain exactly the same data, but are sorted differently. The "t-test.wt.tab" is sorted on the P-values from the t-test on the data from the 4 wildtype experiments (second-last column). The "t-test.TF.tab" file is sorted on the P-values from the t-test on the data from the 7 mutant experiments (last column). Examine the two files and try to answer the following questions:

    t-test Questions

    1. Which of the t-tests have given us the "best" result (more significant genes)? And why do you think so?
    2. What have we tested in these experiments?
    3. How many genes have we tested? (hint: length(Intensity[,1]))
    4. At which P-value would you apply the cutoff (use Bonferroni at 0.01)?
    5. How many genes are significant in each of the two experiments?
    6. Which correction for multiple testing is most strict, Bonferroni or Benjamini-Hochberg?
    7. What is the "allowed" P-value at the Bonferroni cutoff according to Benjamini-Hochberg? Look at the output from both tests. You will get two different values. (hint: P = i/N*0.01 and i is the rank of the last significant gene, ... or in other words: The same numbers that are the answers in question 5)

    Biological Interpretation

    Download to your Desktop: annotation.tab, and see if you can find out which genes may be involved in the mechanism that changes the phenotype.

    Now, produce a file from the transcription-factor experiment, that gives you the gene name, the annotation, the P-value, the fold change and the log2-fold change:
    ann <- read.delim(file="Desktop/annotation.tab",sep="\t")
    fold.m <- Norm.Int.m[,9:15]/Norm.Int.m[,16:22]
    fold.mean.m <- signif(apply(fold.m,1,function(x){mean(x,na.rm=TRUE)}), digits=3)
    log.fold.mean.m <- signif(log2(fold.mean.m), digits=3)
    ordered.data <- cbind(as.vector(Intensity[,1])[orders],
    pVal.TF[orders],fold.mean.m[orders],log.fold.mean.m[orders],as.vector(ann[,2])[orders])
    write(t(as.matrix(ordered.data)),ncolumns=5,file = "Desktop/t-test.fold.TF.tab")

    Biological Questions

    1. Do you find any significantly affected genes, that you could suspect plays a role in the alcohol-induced adhesion?
    2. From these results which genes would you investigate further?
    3. How do you think you could verify these findings?

    Volcano Plot:

    A Volcano plot shows the connection between the P-values and the log2 of the fold change, compared to the same analysis of the permuted data. The permutation is a total randomization of the columns of the experiment. After this we do the t-test again and calculate a new log2-fold-change (M) value. Finally we plot the two plots on top of each other in order to see if our significant genes are still significant in the permuted data, which would be an indication of false positives. In order to make a Volcano Plot we need to obtain the belonging P-values, so you perform a t-test on a permuted matrix (Permuted.m). Then you calculate the "permuted" M value (log.fold.permuted.mean.m) and plot it against the "permuted" P-values (, green dots). The "real" M values are plottet against the "real" P-values (blue dots) on top of the permuted plot.
    random <- sample(c(9:22))
    permuted.m <- Norm.Int.m[,c(1:8,random)]
    pVal.permuted.TF <- get.pval.ttest(permuted.m,9:15,16:22)
    log.fold.permuted.mean.m <- log2(apply(permuted.m[,9:15]/permuted.m[,16:22],1,function(x){mean(x,na.rm=TRUE)}))
    plot(log.fold.mean.m,pVal.TF,main="Volcano Plot", log="y",xlab="M (log2 fold change)",ylab="P-value", pch="+", col="blue")
    points(log.fold.permuted.mean.m,pVal.permuted.TF, type="p", pch="+", col="red")
    

    Volcano Plot Question

    1. Do you now think that your cutoff was reasonable?

    The two-way ANOVA:

    Because the two performed experiments are performed under identical conditions we have everything constant except two factors (+/- alcohol and +/- mutation), this gives us the possibility of using a two-way ANOVA. The two-way ANOVA takes advantage of all data from both experiments. In this way we are not only able to extract genes affected by alcohol with a higher degree of confidence (lower P-values), but also genes affected by the mutation and genes affected by an interaction of the two factors. We therefore get three P-values in the output. Run the two-way ANOVA by the use of the following R-commands:
    anova.2F<-function(vector){
    anova.df$x<-vector
    return(anova(aov(x ~ F1*F2, anova.df))$Pr[1:3])}
    TF.f<-factor(c(0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1))
    alcohol.f<-factor(c(0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0))
    anova.df<-data.frame(F1=TF.f, F2=alcohol.f, x="Na")
    BA.2f.pval<-apply(Norm.Int.m, 1, anova.2F) # this command may take a few minutes
    pVal1 <- signif(BA.2f.pval[1,], digits = 3)
    pVal2 <- signif(BA.2f.pval[2,], digits = 3)
    pVal3 <- signif(BA.2f.pval[3,], digits = 3)
    orders <- order(pVal2)
    ordered.data <- cbind(as.vector(Intensity[,1])[orders],
    pVal2[orders],pVal1[orders],pVal3[orders])
    write(t(as.matrix(ordered.data)),ncolumns=4,file = "Desktop/ANOVA.tab")
    The output is sorted on the first of the three P-values (the third last column of the output file). The top of the list is therefore showing the genes that are significantly affected by the presence of alcohol. The second last column is the P-value, that shows which genes are affected by the transcription-factor mutation. The last column show the P-values from the test where the two conditions are combined, that is which genes that are affected by the presence of the mutation and the alcohol at the same time, not just one of them (a synergistic effect).

    ANOVA Questions

    1. How does this affect the P-values?
    2. Is the rank of the genes the same?
    3. What do the "1"/"0" stand for in each factor?
    4. Is the two-way ANOVA "stronger" than the t-test?
    Now try to run the commands again but change one or two of the "1"/"0" in the alcohol factor.
    alcohol.changed.f<-factor(c(0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0)) # make changes! 
    anova.df<-data.frame(F1=TF.f, F2=alcohol.changed.f, x="Na")
    BA.2f.changed.pval<-apply(Norm.Int.m, 1, anova.2F) # this command may take a few minutes
    pVal2.changed <- BA.2f.changed.pval[2,]
    
    Plot the "real" P-values against the permuted P-values:
    plot(pVal2, pVal2.changed,xlab="P-value",ylab="permuted P-value",log="xy",ylim=range(pVal2)) 
    abline(0,1,col="red") # makes a red line through the diagonal of the plot
    

    permuted ANOVA Questions

    1. What is it you do when you make this change?
    2. How does this affect the P-values?
    Do also the comparison of the output from the t-test (the TF experiment) and the ANOVA (+/- alcohol):
    plot(pVal2, pVal.TF,xlab="P-value from ANOVA",ylab="P-value from t-test",log="xy",ylim=range(pVal2))
    abline(0,1,col="red")
    

    Biological interpretation

    Make as before a smaller file, that contains the gene name, P-value (with and without alcohol), fold change and log2-fold change.

    fold.m <- Norm.Int.m[,5:8]/Norm.Int.m[,1:4]
    fold.mean.m <- signif(apply(fold.m,1,function(x){mean(x,na.rm=TRUE)}), digits=3)
    log.fold.mean.m <- signif(log2(fold.mean.m), digits=3)
    orders <- order(pVal2)
    ordered.data <- cbind(as.vector(Intensity[,1])[orders],
    pVal2[orders],fold.mean.m[orders],log.fold.mean.m[orders],as.vector(ann[,2])[orders])
    write(t(as.matrix(ordered.data)),
    ncolumns=5,
    file = "Desktop/ANOVA.fold.tab")
    Compare this to the output from the t-tests, where the effect of alcohol on both the wild type and the transcription-factor mutant were tested.

    Final Questions

    1. Do you find the same genes in the top?
    2. Do you find any genes that you think could be involved in the unknown mechanism? - consult the annotation file and maybe PubMed
    3. Suggest laboratory experiments that may reveal more about the investigated mechanism?
    4. Besides the alcohol effect, what other two effects do we test for?

    BONUS EXERCISE

    If you have time left, try running the ANOVA on the "raw" data and compare the two results.


    Made by Hanne Jarmer, CBS, August 17, 2004