Stabilization matrix method

In todays exercise you will accomplish the following
  • Implement two minimization procedures of the SMM method (gradient decent, and Monte Carlo)
  • Design a cross-validation training procedure to evaluate the two algorithms on data set of MHC-peptide binding
  • Optimize the weight on the stabilization term in the SMM method, and test if this optimization differs between MHC molecules and minimization method.


The project consists of two algorithm-templates where you must fill out the missing code, and two data sets of MHC-peptide binding data. First you must copy the data files

Make a new directory called SMM by typing

mkdir SMM

Copy data files for at least TWO HLA alleles by typing the following command (remember the "." in the end of the command). Note, you could select at least three alleles so that one has many peptide data points, and one has few, and one is in between these.. You can find the number of peptides and number of peptide binders per allele in the following link Data statistics. Here a binder is a peptide with a binding affinity stronger than 500 nM.

cd SMM
cp -r /usr/opt/www/pub/CBS/courses/27625.algo/exercises/data/SMM/XXXXX .

Here, XXXXX is the name of an MHC allele (A0201 for instance). For each allele (MHC molecule) you will now have 10 files called c000, c001, c002, c003, c004, f000, f001, f002, f003, and f004. Check this by typing

ls -ltr XXXXX

where XXXXX is the name of one of the selected MHC alleles. Each of these files contains peptide-MHC binding data. The format of the files is


The first column gives the peptide sequence, the second the binding value, and the last column the MHC allele name. The MHC binding values are given in 1-log(IC50)/log(50000) values meaning that a value of 1 corresponds to an IC50 value of 1 nM, and a value of 0 to an IC50 value of 50000 nM. Note, that low IC50 values correspond to strong binding.

The data files are prepared to be used for 5 fold cross validated training. The files c000, f000 for instance contain data for one of the five cross-validation partitions where the f000 file is used for training and the c000 file for testing.

Implementing the algorithms

Now we are ready to code. First you must copy the program templates to your src directory.

cd src
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/code/SMM/smm.c .
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/code/SMM/smm_mc.c .

The two files are templates for the gradient decent, and Monte Carlo based minimization procedures of the SMM method, respectively.


Open the file smm.c in your favorite editor. You shall fill in the missing code in the update_weight and main routines. Go to the update_weight and main routines and find the place with the missing code (XXXX). Make sure you understand the structure of the routines, and then fill out the missing code.

Next, compile the program

make smm

When the compilation is successful, copy the code to the bin directory

cp smm ../bin


The smm_mc.c implements a Monte Carlo based minimization of the SMM algorithm. To keep the work load down this program is almost completely written out for you. Open the file smm_mc.c. Spend some time to make sure you understand the structure of the program. Fill in the missing code (XXXXXX), and compile the program

make smm_mc

When the compilation is successful copy the code to the bin directory

cp smm_mc ../bin

Training and evaluation of the two SMM algorithms

Now we are ready to train the first MHC peptide binding prediction algorithm. Go to the SMM directory

cd SMM

Demonstrate to your-self that the two implementations work, and that they convert to similar error values. Use default parameters, the -v option and a peptide-file to do this. For example

cat c00? | smm -v -- > smm.out

Plot a curve showing the error as a function of the number of iterations.

Design a 5 fold cross validation training procedure for each allele using the data sets c000, c001, c002, c003, and c004. Train the SMM method using the smm and smm_mc programs with default parameters and calculate the cross-validated performance in terms of the Pearsons correlation (using the xycorr program) and the MSE (mean-squared-error). The latter is calculated as

cat file | gawk 'BEGIN{n=0; e=0.0}{n++; e += ($1-$2)*($1-$2)}END{print e/n}'

where file is a two-column file with target and prediction values.

Note, that you can evaluate an SMM-matrix using the program pep2score developed earlier in the course.

pep2score -mat smm.mat peptide-file | grep -v "#" | gawk '{print $2,$3}' 

where smm.mat is a matrix generate by one of the two SMM minimization programs, and peptide-file is a file with peptides (one of the c00? files).

Using default parameters for the two programs smm and smm_mc which program has the highest performance, and is this consistent for the different MHC alleles?

Using the same 5 fold-cross validation setup, try to find the optimal value for lambda (the -l option). Does this value differ between the two methods and MHC alleles? If you have time you can do a series of MHC alleles and try to find a correlation exists between the optimal lambda value and the size of the training data (or the number of peptide binders in the training data).

You might want to use some shell-scripting to make the training and evaluation more effective. You can get an exammple of such a tcsh script here

cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/code/SMM/doit .