Exercise in preprocessing using Bioconductor
Written by: Carsten Friis
The purpose of this exercise is to familiarize you with the BioConductor and to give you some hands-on experience with the preprocessing of microarray data.
BioConductor is an international open-source oriented initiative which provides a host of packages for R to facilitate analysis of microarray data. For more on BioConductor itself, see the Bioconductor home page.
Bioconductor and Packages in R
The true power of R lies in its packages. Packages contain extra functions and objects which expand the basic functionality of R,
but they are too numerous and too specialized to be loaded into R by default.
- Begin R session.
If you have not done so already you need to install Bioconductor:
The function source() is a nifty function which reads and executes the R code found in the specified text file or (in this case) internet location. If you want to know more on how to install specific packages via Bioconductor you can visit following adress http://www.bioconductor.org/install/ .
For installation of Bioconductor core packages we use biocLite() function with no arguments (like above). To install a specific Bioconductor package we use name of the package as an argument to biocLite() function. Let us install 'affy' package which we will use today.
- Use the library() function without arguments to get an overview of the installed libraries:
The library() function has a special help system for retrieving package descriptions. It works like this:
library(help = [package_name])
- Find the help page for the package "affy", which contains a lot of functions collected by Bioconductor for working with microarray data from Affymetrix-style chips.
- Spend a few minutes reading a bit about some of the functions in the affy package (look under where it says 'Index:')
Help pages including descriptions of each function can be found for packages not currently loaded into memory like this:
help([function_name], package = affy)
- Look up the function normalize.qspline, you should recognize the name/algorithm from the talk earlier.
Before we can work with this function, we must first load the affy package. Please remember that you only need to install packages once but you need to load them every time you open a new R session:
Now that the package is loaded, you can view help on affy functions without specifying the package name. For example:
The affy package contains tools useful for analysing oligonucleotide arrays such as those produced by the company Affymetrix.
Because these arrays are highly standardized it is easy to write generic methods for handling them.
Bioconductor also provides packages like 'oligo' and 'marray' for working with NimbleGen, spotted arrays and the like, but such arrays are much less standardized and marray requires much more customization from the analyst. For this reason, we will focus on Affymetrix in today's exercise.
- Now type:
The function search() can be used to reveal which packages are currently loaded. When you type search() you should notice several names like 'graphics' and 'datasets'. These are packages automatically loaded by R every time you start the program. You should also see the affy package, which although not automatically loaded by R, we just loaded manually.
Exercise in Normalization of Microarray Data
In this exercise we will work with normalization and pre-processing of microarray data.
We will use the qspline normalization method as well as other pre-processing algorithms and investigate their effect on data
1 Installing additional packages
For the purpose of today's exercise we will use an additional R package called 'hexbin'. It will be used to plot and visualize your data. The package can be installed using the following command (you may be asked to choose a package source, you can choose Denmark):
To load the package into R type:
2 Affymetrix Arrays
During this exercise, you will use data from oligonucleotides arrays sold by the company Affymetrix (one of the primary manufacturer of oligonucleotide chips). The way data are stored is worth a look. It will help us in understanding what we manipulate and how to access what we want.
2.1 Affymetrix design
As you should know already, Affymetrix arrays are constituted of short probes (oligonucleotides), with several probes related to one gene. Most of the probe pair sets are constituted of 20 probe pairs.
2.2 Affymetrix files
The data you are about to use are stored in two different types of files, the CDF files (standing for Chip Definition File) and the CEL files. CDF files are used to store information related to a specific type of oligonucleotide array. All the arrays belonging to a given type will share this same information. As the quantity of information in a CDF can be rather large, this is an important point. This is stored in the package as only one cdf object, to which the corresponding Cel files will refer. In most cases, the user of the R-package affy does not have to worry about CDF data, since the package automatically takes care of this. The CEL file holds the intensity of the probes from a single GeneChip hybridization.
2.3 Loading the affy package
If you haven't already done so, now is your last chance to load the affy package. If, for some reason you should restart you R session you will have to load it again. (It is a component of the Bioconductor project)
2.4 Loading the data
For this exercise, we will use a subset of a larger data set developed by a company called GeneLogic. This data set is primarily geared towards evaluation and benchmarking of normalization algorithms, so it will do fine for our purposes today. The data are located here and you will need to download it to your own computers. You can download the *.CEL files individually or you can download and uncompress the *.zip file which contains all the data files. Pay particular attention to where you put the data, you will need the path to load it with the following commands:
mixture.batch <- ReadAffy(celfile.path="Directory where you put the data")
TIP: If you put the data in your working directory, you can load it succesfully with just ReadAffy(), no arguments, since the function defaults to your current working directory. You can see your working directory and change it with the getwd() and setwd() commands, respectively. You can also control it through the GUI in the pull-down menus.
Before starting any data analysis, it is always a good idea to understand the underlying experiment. In short, this experiment consists of a mixture of two samples (one from liver tissue, one from CNS=Central Nervous System) at various concentrations. Each set of concentrations is replicated two times. The whole thing is summarized here:
Overview of Data Set
|Chip name||ug Liver||ug CNS
You can learn more by looking at this whitepaper from GeneLogic. You don't need to look at it in detail, but if you do, please remember that you are only working with a subset of the whole GeneLogic data set.
2.5 Observing the raw data
It is usually wise to have a glance at the images before going further. Experimental problems might be revealed and subsequent action concerning the data could be taken. So, let's make an image of the data.
image(mixture.batch[,1], transfo = I)
You can also visualize log-transformed data:
image(mixture.batch[,1], transfo = log)
Q1 - What comment can you make about the log transformation of the data?
Q2 - Do you think the picture is more or less detailed after the log transformation?
The probe intensities stored in an AffyBatch object (i.e. mixture.batch) can be accessed by the function intensity(). Scatterplots of probe intensities represent an interesting view on your data, as shown below. However, modern Affymetrix chips contain several millions of probes. This makes traditional scatterplots very heavy to display or print and hard to overview. To overcome the problem, we can use a hexagonal binning plot. In brief, the plotting plane is broken into hexagons and the number of 'dots' that fall in each hexagon is counted. The hexagons are then colored according to the number of points they represent. The darker the hexagon, the more points it represents.
h1 <- hexbin(intensity(mixture.batch)[, 1], intensity(mixture.batch)[,2])
plot(h1, main = "Raw intensities")
h2 <- hexbin(log2(intensity(mixture.batch)[, 1]), log2(intensity(mixture.batch)[,2]))
plot(h2, main = "Log transformed intensities")
Q3 - Comments about the transformation on the data?
Q4 - Plot the intensities of different CEL files against each others. You do this by re-using the code above, changing the numbers in the square brackets to match other data columns (each column corresponds to one chip).
Q5 - What observation can you make?
A superposition of the density estimates for the intensities on each chip is also a very helpful plot:
hist(mixture.batch,main="Log2 transformed intensities",log=TRUE)
Q6 - Comments?
2.6 Normalization (Scaling)
The general purpose of normalization is to adjust (or correct) a signal in order to make the comparison with other signals more meaningful. In the case of microarrays, we need to make the results from each of our arrays comparable with the rest. The normalization step is done at the probe level.
The normalization method Qspline is one of the normalization methods available in the affy package.
mixture.qsp <- normalize(mixture.batch, method="qspline")
A plot of the distribution of the probe intensities is a useful diagnostic plot:
hist(mixture.qsp, main = "Normalized by qspline")
hist(mixture.batch, main="Un-normalized intensities")
Q7 - What observation can you make?
Q8 - How would you perform the normalization method called 'constant'?
2.7 Expression Index Calculation
Data analysts are usually interested in the expression values rather than individual probe values. The complete preprocessing of raw Affymetrix data into expression values can be done using background correction, normalization, perfect match correction and summary value computation. Several options exist for each step. To simplify things, we will skip the background correction here, and just focus on normalization and expression index calculation. One way to preprocess affymetrix chips is through the expresso() function, which performs both background correction and normalization as well as expression index calculation. It is slow, however, which is why we only use a small data set in this exercise.
mixture.ei.es <- expresso(mixture.batch, bg.correct = F, normalize = F, pmcorrect.method = "pmonly", summary.method = "liwong")
mixture.ei <- exprs(mixture.ei.es)
The expresso() function takes an object of class 'AffyBatch', but returns an object of class 'ExpressionSet'. Objects of class 'ExpressionSet' are easily converted to matrices using the function exprs().
You can easily see the difference after expression index calculation by comparing the number of data points for each chip before and after:
See how the number of data points has decreased.
Q9 - There are typically around 11-20 unique probes matching each target (i.e. each gene), so howcome the number of data points has decreased with more than a factor of 20?
3 Effect of Normalization in Detail [Optional Exercise]
Because the GeneLogic data set is constructed around a dilution experiment (same samples used, but diluted to different concentrations), the expression levels should correlate with the degree of dilution as given by this vector:
dilution <- c(1,1,0.75,0.75,0.5,0.5,0.25,0.25,0,0)
We can use this information to verify that normalization indeed does remove noise and not biologically relevant information. But first we need to calculate some expression indexes for qsplined data. For speed, the normalization was omitted before when you calculated the expression indexes above, but that is okay, for we need both normalized and un-normalized data here.
mixture.qsp.es <- expresso(mixture.batch, bg.correct = F, normalize.method = "qspline", pmcorrect.method = "pmonly", summary.method = "liwong")
mixture.qsp <- exprs(mixture.qsp.es)
Now calculate the correlations for the two data sets with the vector vec:
unnorm.cor <- apply(mixture.ei,1, cor, y=dilution)
norm.cor <- apply(mixture.qsp, 1, cor, y=dilution)
Visualize the difference in correlation for the normalized and un-normalized data with vec in the following 2 histograms:
Q10 - What do the plots tell you?
4 cDNA arrays [Optional Exercise]
cDNA microarrays differ much from Affymetrix arrays, and the strategies to normalize data are different. For this reason Bioconductor contains a package called 'marray' addressing cDNA arrays. As stated previously, however, using the 'marray' package to process cDNA arrays can be very cumbersome. Because of the lack of standardization, you would have to manually input information on slide layouts, choice of probes, type of equipment used, etc. (all the stuff given in the CDF files from Affymetrix, see 2.2 above). Indeed, the marray package is mostly useful for laboratories who produce thousands of cDNA arrays with the same setup. If, however, you want to use the qspline algorithm to normalize a cDNA array data set, you can do this by loading the cDNA data set into R as a matrix, and use the function normalize.qspline(). The only required argument would be the data matrix. Nevertheless, here is a small optional exercise showing how using marray would look
4.1 Load the package and data
The package marray addresses normalization for cDNA arrays. We will use the dataset swirl. More information about it can be found using the help system.
4.2 Observing your arrays
Once again, observing the spatial distribution of your intensities can show interesting features (mainly artefacts)
The function maImage() has several options. You can see them using the help system, and eventually want to explore some of them.
For cDNA array data the boxplot is often used.
You can normalize cDNA array data like this:
swirl.norm <- maNormMain(swirl)
You can observe the effect of normalization:
Here again, there are different strategies to normalize, some being very computer intensive. Help files for maNormMain() can tell you more.
5 Bibliography and links
- Laurent Gautier, Leslie Cope, Benjamin M. Bolstad and Rafael A. Irizarry: affy -analysis of Affymetrix data at the probe level. Bioinformatics, 2003
- Robert Gentleman and Ross Ihaka. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5(3), 1996.
- Steen Knudsen: A Biologistis Guide to Analysis of DNA Microarray Data. Wiley, New York, 2002.
- C Workman, LJ Jensen, H Jarmer, R Berka, L Gautier, HB Nielsen, HH Saxild, C Nielsen, S Brunak, and S Knudsen. A new non-linear normalization method to reduce variability in dna microarray experiments. Genome Biology, 2002.
Last modified by Marcin Krzystanek on 03/01/2011
Inspired by Carsten Friis, Laurent Gautier and Chris Workman