data loading considerations:
Data can be read directly from the CEL files, or even compressed CEL files (the deflation is performed
while reading and parsing the file).
Data from CEL files can be matched to the data in the corresponding CDF file. So far CDF files have to be preprocessed (not very limiting, since there are not that many CDF files one is likely to work with). Preprocessed CDF files are available here. If one has to generate a preprocessed file from an Affymetrix CDF file, a standalone program is available (this program was made with memory constraints in mind. It generates preprocessed files of the big HG_U95 chips on an linux PC with a Duron CPU in a pair of tens of seconds while consuming at most 25 Mb of memory).
material considerations:
The memory consumption was kept low (having the represention of a Hu6800 CEL file in memory represents
at most about 5 Mb, and can go down to 3 Mb under certain conditions). The memory usage consideration lead to have data
relative to the CDF file kept distinct from the data relative to the CEL file, avoiding as much as possible to duplicate
information.
With the current price of memory, having 20 CEL files at the same time in memory is not reserved to expensive systems.
50 Hu6800 CEL files were loaded in R and the RSSIZE of the R process reached 180 Mb.
The chosen data structure is not the fastest one for tasks like generating expression values (or expression indexes),
but it can performs decently (especially since the scripting does not need human interaction). On the same AMD Duron PC with
linux, the computation of the Affymetrix trimmed average difference takes about 30 minutes. There is no question about it, this
is slow. One has however to consider that the click-menu-click operation to obtain the same result with other software takes at
least 20 minutes of somebody's time. More advanced methods requiring parameters estimation an iterative way will be slower
(on the same PC, processing HU6800 chips with E. Lazardris playerout methods takes about one hour and a half.)
analysis considerations:
The package allows to interact with the data at different levels. The work done will be presented sometimes...
Very recent approaches have been included very fast in the package, like generation of expression values for the probe pair usage according to Li and Wong, or according to E. Lazardris et al., normalization by invariant set by Li and Wong. More can be added by the user. This should demonstrate its versatility and adaptability (thanks to the R machinery).
Compatibility considerations:
The package offers compatibility to the
affy package (up to version 0.4... afterwards, merging of both packages was initiated).
The function convert2probepair.set.format does the conversion.
library(affyR) data(listcel) data(CDF.HU6800) # name=TRUE is important all <- getall.pps.val(CDF.HU6800, listcel, name=T) all.converted <- convert2probepair.set.format(all) # lets try to use affy functions... library(affy) affy.mva.pairs(x)
Data structures returned by the generate.all.ev method can be used with the permax package, and should be usable with the sma package.
| spatial level | The package allows quick visual quality control of the chip data.
The following lines of R code are opening a window to display the image of a CEL file.
mycel <- read.celfile("path/to/my/celfile.CEL")
## display the image
image(mycel, transf=log)
|
| probe pair level |
Data can also be observed at the probe pair level.
(insert part here)
|
| generate expression values |
Generating expression values from the probe pairs information has become a discussed matter. One can choose the way to do it, and eventually implement easly his own way. The bit that follows generates the expression values for a set CEL files according to the a given method.
## read the data in the CEL files
listcel <- list()
listcel$cel1 <- read.celfile("exp1.CEL")
listcel$cel2 <- read.celfile("exp2.CEL")
listcel$cel3 <- read.celfile("exp3.CEL")
listcel$cel4 <- read.celfile("exp4.CEL")
## read the data in the corresponding preprocessed CDF file
cdf <- read.cdffile("HU6800.CDF.forR.dat")
## generate the expression values
exprvalues <- generate.all.ev(listcel, cdf, method="liwong.reduced")
The generated expression values can be then exported, or used in other R packages (like the sma package). |
| automated processing |
An automated sequence of operations to do in order to generate expression values for the genes can be set.
Using partly what has be presented above, the following example shows what a R script could be.
##################################################
## The_way_I_want_it_to_be_done. #
## An R script to generate expression value data #
## #
## author: Dr. Frankenstein #
##################################################
library(affyR)
## reading the data from the CEL files.
## One may want to have strategies more adapted to a real use,
## like reading the file names from the command line, the stdin,
## or why not specifed by the user a 'select'n'click' way using
## the Gtk interface. This is up to you.)
listcel <- list()
listcel$cel1 <- read.celfile("exp1.CEL", sd=FALSE)
listcel$cel2 <- read.celfile("exp2.CEL", sd=FALSE)
listcel$cel3 <- read.celfile("exp3.CEL", sd=FALSE)
listcel$cel4 <- read.celfile("exp4.CEL", sd=FALSE)
## normalization by a constant factor
## (now almost generally agreed to be not too good, but well this is just an example)
listcel <- normalize.celfile.constant(listcel, refindex=1)
## generate the expression values
## (assuming you strongly believe the average difference is what should be done)
exprvalues <- generate.all.ev(listcel, cdf, method="avgdiff")
## dump the expression values in a flat file
write.table(exprvalues, quote=FALSE)
Hopefully this demonstrates one can keep up to the most recent processing strategies, compare them or try new ones an easy and concise way... with the possibility to go after details if needed. |