Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Exercise in Comparative Genomic Hybridization

Written by: Carsten Friis


This exercise deals with the analysis of data from Comparative Genomic Hybridization (CGH). The exercise will illustrate some of the basic concepts and possibilities with this type of data. Some of the calculations can be quite cumbersome and you will need a fair bit of pre-written code to get through. Use cut-and-paste on the code to transfer it to R.

For more on R turn to the "Introduction to R"


Background

So far, microarrays have mostly covered single or few strains from the same species. However, with cheaper high-throughput sequencing techniques emerging, multiple strains of the same species are rapidly becoming available, allowing for the definition and characterization of a whole species as a population of genomes - the 'pan-genome'. A so-called pan-genome microarray has been developed by Willenbrock et al. 2007 covering 32 E. coli and Shigella genomes, designed in order to provide a tool for characterization of the E. coli pan-genome.

The microarray data originates from a study of Escherichia coli involved in a food poison outbreak a few years ago. In the study two distinct strains were identified. Of these, only one was found to express Shiga toxins, which we will call stx+, while we will the other stx-. As a reference strain we included E. coli O157:H7 EDL933, which means we have data on a total of three separate strains.


Load Data

The first step is to start R and load the data. The data set consists of CGH data on three E. coli strains.

  1. Start 'R' by typing R on the command prompt
  2. This exercise must run on the 'ibiology' server. It is computationally demanding and it uses certain data which have not yet been published.

  3. Use the following commands to load the data into R
  4. path <- "/home/projects/carsten/MVNexercises/CGH/"
    load(paste(path, "data.all.Rdata", sep=""))
    

    Note: Loading the data can take several minuttes...

  5. Let us take a quick look at what we loaded
  6. ls()
    

    The ls() function shows what data objects we have in memory. The genomic microarray data are stored in the pair.data object. The other object contain annotation for the genes on the array as well as other descriptory objects.

  7. Look at the data
  8. str(pair.data)
    

    The str() function shows us the structure of an object without actually printing all of the data on the screen (which you do not want to do!). You should be able to see that you have a data.frame (essentially a data matrix) listing an id for each probe as well as its x and y coordinates on the array surface. In addition you have data from 12 genomic samples, each consisting of 389211 measurements.


Quality Control

We can use our knowledge of the EDL933 strain to perform quality control on the array design. Since this is genomic data, if we look only at the arrays to which EDL933 was hybridized, we should see signal only from the probes which target genes present in EDL933.

  1. To make this analysis we need to calculate a couple of intermediate results first
  2. Simply cut-n-paste the following chunck of code into R:

    edl933probes.tbl <- design[which((design[,5]==43 | design[,5]==23) & design[,1]==1),]
    edl933probes <- paste(edl933probes.tbl[,1],edl933probes.tbl[,2],sep="-")
    index.edl933probes <- which(!is.na(match(pair.data$PROBE_ID,edl933probes)))
    
    sub1.filtered <- pair.data[!is.na(match(pair.data$PROBE_ID,probes.filter.sub1.ok)),]
    edl933 <- grep("EDL933g",colnames(sub1.filtered))
    sub1 <- pair.data[grep("^1",pair.data$PROBE_ID),]
    index.edl933probes <- which(!is.na(match(sub1$PROBE_ID,edl933probes)))
    
  3. Now plot the density distribution of known EDL933 probes against the others
  4. plot(density(log2(sub1[-index.edl933probes,edl933[1]])),col="red",
         main="Distribution of signal from all probes", xlab="Log-transformed intensities")
    lines(density(log2(sub1[index.edl933probes,edl933[1]])),col="green")
    legend("topright", c("EDL933 probes","non-EDL933 probes"), lty=1, col=c("green", "red"), cex=0.7)
    
    • What can you tell from the plot?

  5. Apply a filter and draw the figure again
  6. Designing probes for certain genes can be very difficult. If a gene has an unfortunate degree of homology with other genes it can be difficult to avoid cross-hybridization while still ensuring an appropriate sensitivity to the intended target(s). On top of that problem genes are often very short thus narrowing the field of possible non-overlapping probe candidates. But being restircted to using only a few probes for a given gene means that the overall signal for that gene will be based on fewer measurements and will thus be potentially more noisy.

    It is very simple to apply a filter which removes the signal for any gene which is based on fewer than three good probes.

    index.edl933probes <- which(!is.na(match(sub1.filtered$PROBE_ID,edl933probes)))
    plot(density(log2(sub1.filtered[-index.edl933probes,edl933[1]])),col="red",
         main="Distribution of signal from filtered probes")
    lines(density(log2(sub1.filtered[index.edl933probes,edl933[1]])),col="green")
    legend("topright", c("EDL933 probes","non-EDL933 probes"), lty=1, col=c("green", "red"), cex=0.7)
    
    • Did the filtering help?

Copy Number Analysis

For this part we will use the Circular Binary Segmentation (CBS) algorithm to analyse the presence or absence of each gene in each strain (No, no relation to our institute, Sorry). The CBS algorithm is implemented in R in the DNAcopy package. Unfortunately, it is very computationally demanding and running it on the whole data set would take hours, even on our servers. For this reason we have prepared a pre-calculated object for you.

  1. Load the prepared data object
  2. load(paste(path,"pred.present.lst.Rdata",sep=""))
    

    The new object contains a list for each array with the identifiers for every gene found present in that sample. We will use this information to compare the three strains in a Venn diagram using the generic microarray utility package limma.

  3. Load the limma package
  4. library(limma)
    
  5. Count the genes idenified in each bacterial strain
  6. pred.present.venn <- sapply(pred.present.lst, tabulate, nbins=max(unlist(pred.present.lst)))
    samples.venn <- sapply(c("EDL933","Stx_neg","Stx_pos"), grep, x=names(pred.present.lst))
    venn.CNA <- cbind(matrix(tabulate(5:8),ncol=1,nrow=8), matrix(tabulate(3:4),ncol=1,nrow=8), matrix(tabulate(2),ncol=1,nrow=8))
    
    venn.counts <- apply(venn.CNA,1, function(x) {col <- unlist(samples.venn[x>0])
      sum(sapply(1:nrow(pred.present.venn), function(x, ppv) {
        (sum(ppv[x,col]) == sum(ppv[x,])) && (sum(ppv[x,col]) == length(col))}, ppv = pred.present.venn))})
    
  7. Draw the venn diagram
  8. venn.CNA <- cbind(venn.CNA,venn.counts)
    colnames(venn.CNA) <- c(names(samples.venn), "Counts")
    class(venn.CNA) <- "VennCounts"
    
    vennDiagram(venn.CNA, cex=0.8, main="Relationship between gene content")
    
    • What does the diagram tell you?

End of Part I

This is the end of the first part of the exercise. We will continue with Part 2 tomorrow, which can be found here