|
Exercise in Comparative Genomic Hybridization
Written by: Carsten Friis
This exercise deals with the analysis of data from Comparative Genomic Hybridization (CGH). The exercise will illustrate some of the basic concepts and possibilities with this type of data. Some of the calculations can be quite cumbersome and you will need a fair bit of pre-written code to get through. Use cut-and-paste on the code to transfer it to R.
For more on R turn to the "Introduction to R"
Background
So far, microarrays have mostly covered single or few strains from the same species. However, with cheaper high-throughput sequencing techniques emerging, multiple strains of the same species are rapidly becoming available, allowing for the definition and characterization of a whole species as a population of genomes - the 'pan-genome'. A so-called pan-genome microarray has been developed by Willenbrock et al. 2007 covering 32 E. coli and Shigella genomes, designed in order to provide a tool for characterization of the E. coli pan-genome.
The microarray data originates from a study of Escherichia coli involved in a food poison outbreak a few years ago. In the study two distinct strains were identified. Of these, only one was found to express Shiga toxins, which we will call stx+, while we will the other stx-. As a reference strain we included E. coli O157:H7 EDL933, which means we have data on a total of three separate strains.
Load Data
The first step is to start R and load the data. The data set consists of CGH data on three E. coli strains.
- Start 'R' by typing R on the command prompt
This exercise must run on the 'ibiology' server. It is computationally demanding and it uses certain data which have not yet been published.
- Use the following commands to load the data into R
path <- "/home/projects/carsten/MVNexercises/CGH/"
load(paste(path, "data.all.Rdata", sep=""))
Note: Loading the data can take several minuttes...
- Let us take a quick look at what we loaded
ls()
The ls() function shows what data objects we have in memory. The genomic microarray data are stored in the pair.data object. The other object contain annotation for the genes on the array as well as other descriptory objects.
- Look at the data
str(pair.data)
The str() function shows us the structure of an object without actually printing all of the data on the screen (which you do not want to do!). You should be able to see that you have a data.frame (essentially a data matrix) listing an id for each probe as well as its x and y coordinates on the array surface. In addition you have data from 12 genomic samples, each consisting of 389211 measurements.
Quality Control
We can use our knowledge of the EDL933 strain to perform quality control on the array design. Since this is genomic data, if we look only at the arrays to which EDL933 was hybridized, we should see signal only from the probes which target genes present in EDL933.
- To make this analysis we need to calculate a couple of intermediate results first
Simply cut-n-paste the following chunck of code into R:
edl933probes.tbl <- design[which((design[,5]==43 | design[,5]==23) & design[,1]==1),]
edl933probes <- paste(edl933probes.tbl[,1],edl933probes.tbl[,2],sep="-")
index.edl933probes <- which(!is.na(match(pair.data$PROBE_ID,edl933probes)))
sub1.filtered <- pair.data[!is.na(match(pair.data$PROBE_ID,probes.filter.sub1.ok)),]
edl933 <- grep("EDL933g",colnames(sub1.filtered))
sub1 <- pair.data[grep("^1",pair.data$PROBE_ID),]
index.edl933probes <- which(!is.na(match(sub1$PROBE_ID,edl933probes)))
- Now plot the density distribution of known EDL933 probes against the others
plot(density(log2(sub1[-index.edl933probes,edl933[1]])),col="red",
main="Distribution of signal from all probes", xlab="Log-transformed intensities")
lines(density(log2(sub1[index.edl933probes,edl933[1]])),col="green")
legend("topright", c("EDL933 probes","non-EDL933 probes"), lty=1, col=c("green", "red"), cex=0.7)
- What can you tell from the plot?
- Apply a filter and draw the figure again
Designing probes for certain genes can be very difficult. If a gene has an unfortunate degree of homology with other genes it can be difficult to avoid cross-hybridization while still ensuring an appropriate sensitivity to the intended target(s). On top of that problem genes are often very short thus narrowing the field of possible non-overlapping probe candidates. But being restircted to using only a few probes for a given gene means that the overall signal for that gene will be based on fewer measurements and will thus be potentially more noisy.
It is very simple to apply a filter which removes the signal for any gene which is based on fewer than three good probes.
index.edl933probes <- which(!is.na(match(sub1.filtered$PROBE_ID,edl933probes)))
plot(density(log2(sub1.filtered[-index.edl933probes,edl933[1]])),col="red",
main="Distribution of signal from filtered probes")
lines(density(log2(sub1.filtered[index.edl933probes,edl933[1]])),col="green")
legend("topright", c("EDL933 probes","non-EDL933 probes"), lty=1, col=c("green", "red"), cex=0.7)
Copy Number Analysis
For this part we will use the Circular Binary Segmentation (CBS) algorithm to analyse the presence or absence of each gene in each strain (No, no relation to our institute, Sorry). The CBS algorithm is implemented in R in the DNAcopy package. Unfortunately, it is very computationally demanding and running it on the whole data set would take hours, even on our servers. For this reason we have prepared a pre-calculated object for you.
- Load the prepared data object
load(paste(path,"pred.present.lst.Rdata",sep=""))
The new object contains a list for each array with the identifiers for every gene found present in that sample. We will use this information to compare the three strains in a Venn diagram using the generic microarray utility package limma.
- Load the limma package
library(limma)
- Count the genes idenified in each bacterial strain
pred.present.venn <- sapply(pred.present.lst, tabulate, nbins=max(unlist(pred.present.lst)))
samples.venn <- sapply(c("EDL933","Stx_neg","Stx_pos"), grep, x=names(pred.present.lst))
venn.CNA <- cbind(matrix(tabulate(5:8),ncol=1,nrow=8), matrix(tabulate(3:4),ncol=1,nrow=8), matrix(tabulate(2),ncol=1,nrow=8))
venn.counts <- apply(venn.CNA,1, function(x) {col <- unlist(samples.venn[x>0])
sum(sapply(1:nrow(pred.present.venn), function(x, ppv) {
(sum(ppv[x,col]) == sum(ppv[x,])) && (sum(ppv[x,col]) == length(col))}, ppv = pred.present.venn))})
- Draw the venn diagram
venn.CNA <- cbind(venn.CNA,venn.counts)
colnames(venn.CNA) <- c(names(samples.venn), "Counts")
class(venn.CNA) <- "VennCounts"
vennDiagram(venn.CNA, cex=0.8, main="Relationship between gene content")
- What does the diagram tell you?
End of Part I
This is the end of the first part of the exercise. We will continue with Part 2 tomorrow, which can be found here
|