Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Support vector machines

Morten Nielsen (mniel@cbs.dtu.dk)

I todays exercise you shall use Support vector machines to train methods for predicting peptide binding to MHC molecules. For the first (and only) time in the course you shall not be developing your own code, but rather use a program package called Weka.

Details on the Weka program packages can be found here Weka Userguide

In the exercise you shall first develop a small program use to transform amino acid sequences (peptide) into numerical values suitable for numerical data mining. This a program that will come handy for many data mining projects including the work we will do on artificial neural networks in the end of the course.

Once you have developed this program, use shall use the WEKA software to train different SVM to predict peptide-MHC binding testing the use of different kernel functions and sequence encoding schemes

Please remember for this weeks exercise to upload both the pep2inp.c program and a document with the answers to the exercise questions to campusnet.


Implementing the sequence encoding algorithm

First, you must make a new directory for todays exercise and copy some data files.

cd
mkdir SVM
cd SVM
cp /usr/opt/www/pub/CBS/courses/27623.algo/exercises/data/SVM/* .

The eval.dat and train.dat files contain peptide binding data. The 2nd column is each file gives the binding affinity. A values of 1 is a strong binders and a values of 0 is a non-binder.

The other three files (sparse.txt, blosum50.txt, and zscales.txt) are files containing information to be used for three different sequence encoding schemes. Each file has the format

A, 0.24, -2.32, 0.60, -0.14, 1.30
R, 3.52, 2.50, -3.50, 1.99, -0.17
N, 3.05, 1.62, 1.04, -1.15, 1.61
D, 3.98, 0.93, 1.93, -2.46, 0.75
C, 0.84, -1.67, 3.71, 0.18, -2.65
Q, 1.75, 0.50, -1.44, -1.34, 0.66
E, 3.11, 0.26, -0.11, -3.04, -0.25
G, 2.05, -4.06, 0.36, -0.82, -0.38
..
W, -4.36, 3.94, 0.59, 3.44, -1.59
Y, -2.54, 2.44, 0.43, 0.04, -1.47
V, -2.59, -2.64, -1.54, -0.85, -0.02
with one line per amino acid, and each line containing information on how the amino acid should be encoded. For sparse encoding, each amino acid is encoded as one values of 1 and 19 zeroes. For blosum encoding, each amino acids is encoded as the Blosum50 scores to each of the 20 amino acids. The z-score encoding is a condensed manner to represent each amino acids. Here only 5 values are included to encode each amino acid. For details on how these Zscore were estimate please refer to PMID 9651153
Next, you must copy the program templates to you src directory. (Remember the "." in the end of the command)
cd
cd src
cp /usr/opt/www/pub/CBS/courses/27623.algo/exercises/code/SVM/pep2inp.c .

The program pep2inp.c encodes peptides in numerical form given an encoding scheme (using the -sfile option)

Open the file pep2inp.c in your favorite editor. Most the code is (as usual) pre-written. Go to the main procedure. Make sure you understand the structure of the program.

You shall fill in the missing code (XXXX) and make sure you understand the structure of the routine.

Next, compile the program

make -f Makefile pep2inp

When the compilation is successful copy the code to the bin directory

cp pep2inp ../bin/
rehash

Now go to the SVM directory.

Next, run the pep2inp program on the train.dat and eval.dat files using the three different encoding schemes

pep2inp -sfile sparse.txt train.dat | grep -v "#" > train_sparse.csv
pep2inp -sfile sparse.txt eval.dat | grep -v "#" > eval_sparse.csv

pep2inp -sfile blosum50.txt train.dat | grep -v "#" > train_blosum.csv
pep2inp -sfile blosum50.txt eval.dat | grep -v "#" > eval_blosum.csv

pep2inp -sfile zscales.txt train.dat | grep -v "#" > train_zscore.csv
pep2inp -sfile zscales.txt eval.dat | grep -v "#" > eval_zscore.csv

Look at the output. How does the output compare to the files?

eval_sparse.csv
eval_blosum.csv
eval_zscore.csv

train_sparse.csv
train_blosum.csv
train_zscore.csv


PSSM recap

To get an idea on how difficult it is to make a model for peptide-MHC binding using the training data, we first construct a simple PSSM using the pep2mat program, and next evaluate the predictive performance of this PSSM on the evaluation set

cat train.dat | gawk '$2>0.426' | pep2mat -sw -- > train.mat
pep2score -mat train.mat eval.dat | grep -v "#" | gawk '{print $2,$3}' | xycorr

What is the predictive performance?


Using WEKA

Now we are ready to use WEKA to train some SVM's. Open WEKA by typing

weka-3-6-1 &

Sparse encoding

This will open the Weka GUI Chooser window on your terminal. Select the Explorer and open the train_sparse.csv file (use the Open file option). Select Files of Type CSV. Spend some time looking at the data (use the Visualize All button). Can you understand why the histograms look as they do? How many alanin (A) do you have at position one in the training data?

Next go to Classify. Select you classifier using the Choose option, functions, and select SMOreg.

Next, select what type SVM kernel to use. Click on the SMOreg text. This will open a new window where the different kernel options are shown. You shall first use the polynomial kernel, so leave all parameters as they are.

Now we are ready to train the SVM. First, select Use training set as the Test option, and type start. What is the performance of the model (Correlation coefficient)?

Next, select Cross-validation with number of folds equal to 5, and type start. The training takes a little while (2-4 minutes), so be patient. What is the performance now of the model (Correlation coefficient)? Why is this value lower compared to what you found above?

You can evaluate how the model perform on an independent test set by clicking on Supplied test set and here upload the eval_sparse.csv. Note, also here you must specify that the file format is CSV. To evaluate the model on the test data, right-click on the model file (lower left corner on the Weka Explorer window), and select re-evaluate model on current test set. What is the predictive performance of the model on the test set?

Now change the kernel function to be a polynomial kernel of power 2 (click on the SMOreg in the Weka Explorer window, and on the PolyKernel in the weke.gui.GenericObjectEditor window, and set the exponent value to 2). Set test options back to cross-validation, and retrain the model. What is the performance with the new kernel function, and what is the test set performance?

What has happened. Has the model improved by changing the kernel?

Change the power of the polynomial kernel back to 1, and change the complexity parameter c to 0.01. Redo the training and evaluation. What is the cross-validated and evaluation performance of the model?

Now change the kernel function to be a polynomial kernel of power 2. Redo the training and evaluation. What is the cross-validated and evaluation performance of the model? Can you understand why the power 2 kernel now performs better than the linear kernel?

Now change the kernel to a RBFKernel. Click on SMOreg in the Weka Explorer window, and on kernel Choose in the weke.gui.GenericObjectEditor window. Here, select RBFKernel (Radial Basis Functions). Set the complexity parameter C to 1. Leave all options with default values, and train the model using 5 fold cross validation.

What is the predictive performance and what is the performance on the test set?

Change the gamma value of the RBF function to 0.1, and retrain the model

What is the predictive performance and what is the performance on the test set?

What has happened. Has the model improved by changing the kernel?


Blosum encoding

Now repeat all the above analyses using the blosum encoded version of the training and evaluation peptides. Which kernel (PolyKernel with exponent X or RBF with gamma Y) works best for the Blosum encoding scheme?


Z-score encoding

As a final thing, you shall try to reduce the number of parameters in the SVM model by reducing the size of the numerical encoding of the amino acids. To do this you shall use the zscore encoded version of the training and evaluation files.

Upload the train and evaluation files, and train the SVM using the optimal kernel function found for the sparse and blosum encoding schemes above. Note, the training takes longer (up to 10 minutes) using the reduced amino acid encoding compared to the Blosum ands Sparse encodings, so be patient.

What is the predictive performance and what is the performance on the test set?

What has happened. Has the model improved by changing the amino acid encoding scheme?

As you will probably find, the reduced amino acid encoding leads to a large drop in predictive performance on the method. This is a general observation. It is very hard (if not impossible) to design an intelligent way to reduce the encoding of amino acids to data mining prediction algorithms.

What happens if you instead of reducing the encoding, expand it? You can do this for instance by combining different encoding schemes. You can do this very simple by using gawk. For instance can you combined blosum and sparse encoding as (Note that the tr command replaces all z's with y. This is used inorder to have all columns have a unique name).

cat train_blosum.csv | tr "z" "y" > tmpfile
paste tmpfile train_sparse.csv | gawk '{printf( "%s,%s\n", $1,$2)}' | 
gawk -F "," '{for ( i=1;i<181;i++ ) { printf( "%s,", $i)} for ( i=182;i<181+181;i++ ) {  printf( "%s,", $i)} printf( "%s\n", $NF)}' > train_blosum_sparse.csv
cat eval_blosum.csv | tr "z" "y" > tmpfile
paste tmpfile eval_sparse.csv | gawk '{printf( "%s,%s\n", $1,$2)}' | 
gawk -F "," '{for ( i=1;i<181;i++ ) { printf( "%s,", $i)} for ( i=182;i<181+181;i++ ) {  printf( "%s,", $i)} printf( "%s\n", $NF)}' > eval_blosum_sparse.csv
rm -f tmpfile

or combine sparse and zscore encoding as

cat train_sparse.csv | tr "z" "y" > tmpfile
paste tmpfile train_zscore.csv | gawk '{printf( "%s,%s\n", $1,$2)}' | 
gawk -F "," '{for ( i=1;i<181;i++ ) { printf( "%s,", $i)} for ( i=182;i<181+46;i++ ) {  printf( "%s,", $i)} printf( "%s\n", $NF)}' > train_sparse_zscore.csv
cat eval_sparse.csv | tr "z" "y" > tmpfile
paste tmpfile eval_zscore.csv | gawk '{printf( "%s,%s\n", $1,$2)}' | 
gawk -F "," '{for ( i=1;i<181;i++ ) { printf( "%s,", $i)} for ( i=182;i<181+46;i++ ) {  printf( "%s,", $i)} printf( "%s\n", $NF)}' > eval_sparse_zscore.csv
rm -f tmpfile

Make sure you understand how the gawk code works!

Try to train an SVM using a combined amino acid encoding scheme. Can you improve the predictive performance?


If you have more time, try to use other regression methods (for instance LinearRegression, and MultilayerPerceptron). Note, MultilayerPerceptron (artificial neural network) takes a very long time to train.

Now you done. As you have seen, it has been very difficult to design an SVM model that performs better than a simple PSSM trained using pseudo counts and sequence weighting. What you also find, is that the best SVM model has a linear kernel. What does this say about the peptide:MHC binding specificity?

What you have observed is a general thing of machine learning. In many cases different machine learning algorithms will have equal performance on predicting a given biological feature, and the choice of method is determined by personal preferences rather than objective arguments. Bioinformaticians tend to behave like chickens. When chickens leave the egg, they will believe that any moving object larger than them self is their mother (source Ole Lund). Likewise bioinformaticians will keep using the machine learning algorithm they first encountered during their scientific carrier.