|
Stabilization matrix method
In todays exercise you will accomplish the following
- Implement two minimization procedures of the SMM method (gradient decent, and Monte Carlo)
- Design a cross-validation training procedure to evaluate the two algorithms on data set of MHC-peptide binding
- Optimize the weight on the stabilization term in the SMM method, and test if this optimization differs between MHC
molecules and minimization method.
Data
The project consists of two algorithm-templates where you must fill out the missing code, and two data sets of MHC-peptide binding data.
First you must copy the data files
Login to CBS servers (login.cbs.dtu.dk) using your login and password
Make a new directory called SMM by typing
mkdir SMM
Change to this directory and copy two sets of MHC binding data files
to this directory. You can get a list of the possible alleles by typing
ls -ltr /usr/opt/www/pub/CBS/courses/27625.algo/exercises/data/SMM/
Copy data files for at least TWO alleles by typing the following command twice (remember the "." in the end of the command).
Note, you could select the alleles so that one has many peptide data points, and one has few. You can find the
number of peptides and number of peptide binders per allele in the following link
Data statistics. Here a binder is a peptide with a binding affinity stronger than
500 nM.
cd
cd SMM
cp -r /usr/opt/www/pub/CBS/courses/27625.algo/exercises/data/SMM/XXXXX .
Here, XXXXX is the name of an MHC allele (A0201 for instance). For each allele (MHC molecule) you will now have 10 files
called c000, c001, c002, c003, c004, f000, f001, f002, f003, and f004. Check this by typing
ls -ltr XXXXX
where XXXXX is the name of one of the selected MHC alleles. Each of these files contains peptide-MHC binding data. The format of the
files is
YVMTMILFL 1 A0201
YLYALYSPL 1 A0201
YLMDKLNLT 1 A0201
YITALNHLV 1 A0201
WLIGFDFDV 1 A0201
VLALYSPPL 1 A0201
TLAPFNFLV 1 A0201
SLDVINYLI 1 A0201
The first column gives the peptide sequence, the second the binding value, and the last column the MHC allele name. The MHC binding values
are given in 1-log(IC50)/log(50000) values meaning that a value of 1 corresponds to an IC50 value of 1 nM, and a value of 0 to an IC50 value
of 50000 nM. Note, that low IC50 values correspond to strong binding.
The data files are prepared to be used for 5 fold cross validated training. The files c000, f000 for instance contain data
for one of the five cross-validation partitions where the f000 file is used for training and the c000 file for testing.
Implementing the algorithms
Now we are ready to code. First you must copy the program templates to your src directory.
cd
cd src
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/code/SMM/smm.c .
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/code/SMM/smm_mc.c .
The two files are templates for the gradient decent, and Monte Carlo based minimization procedures of the SMM method,
respectively.
smm.c
Open the file smm.c in your favorite editor. You shall fill in the missing code in the update_weight and main routines.
Go to the update_weight and main routines and find the place
with the missing code (XXXX). Make sure you understand the structure of the routines, and then fill out the
missing code.
Next, compile the program
make smm
When the compilation is successful, copy the code to the bin directory
cp smm ../bin
smm_mc.c
The smm_mc.c implements a Monte Carlo based minimization of the SMM algorithm. To keep the work load down
this program is almost completely written out for you. Open the file smm_mc.c. Spend some time to make sure you understand the
structure of the program. Fill in the missing code (XXXXXX), and compile the program
make smm_mc
When the compilation is successful copy the code to the bin directory
cp smm_mc ../bin
Training and evaluation of the two SMM algorithms
Now we are ready to train the first MHC peptide binding prediction algorithm. Go to the SMM directory
cd
cd SMM
Demonstrate to your-self that the two implementations work, and that they convert
to similar error values. Use default parameters, the -v option and a peptide-file to do this. For example
cat c00? | smm -v -- > smm.out
Plot a curve showing the error as a function of the number of iterations.
Design a 5 fold cross validation training procedure for each allele using the data sets c000, c001, c002, c003, and c004.
Train the SMM method
using the smm and smm_mc programs with default parameters and calculate the cross-validated
performance in terms of the Pearsons correlation
(using the xycorr program) and the MSE (mean-squared-error). The latter is calculated as
cat file | gawk 'BEGIN{n+0; e=0.0}{n++; e += ($1-$2)*($1-$2)}END{print e/n}'
where file is a two-column file with target and prediction values.
Note, that you can evaluate an SMM-matrix using the program pep2score
pep2score -mat smm.mat peptide-file | grep -v "#" | gawk '{print $2,$3}'
where smm.mat is a matrix generate by one of the two SMM minimization programs, and peptide-file is a file with peptides (one of the c00? files).
Using default parameters for the two programs smm and smm_mc which program has the highest performance, and
is this consistent for the different MHC alleles?
Using the same 5 fold-cross validation setup, try to find the optimal value for lambda (the -l option).
Does this value differ between the two methods and MHC alleles? If you have time you can do a series of MHC alleles
and try to find a correlation exists between the optimal lambda value and the size of the training data
(or the number of peptide binders in the training data).
You might want to use some shell-scripting to make the training and evaluation more effective. You can get an exammple
of such a tcsh script here
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/code/SMM/doit .
|