Gibbs sampling

In this exercise, you shall implement the Gibbs sampling algorithm for identification of linear binding motifs in peptide ligand data.

Implementing the algorithms

First you must make a new directory for todays exercise and copy some data files. Change to the directory where you keep the files for the course. This could be your home directory, or a directory called Algo?

mkdir gibbs
cd gibbs
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/data/Gibbs/DRB10401.lig .
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/data/Gibbs//DRB10401.eval .

Next, you must copy the program templates to you src directory. (Remember the "." in the end of the command)

cd $ALGOHOME/src
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/code/Gibbs/gibbs_sampler.c .
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/code/Gibbs/cl2pred.c .

The first program constructs a PSSM from a list of peptides using the Gibbs sampler algorithm to identify the optimal core alignment and writes the PSSM to standard output. The second program scores a peptide list against a PSSM and prints the peptide sequence, (eventual binding affinity) and PSSM score to standard output. The prediction score of the peptide to a PSSM is the score of the highest scoring binding core contained within the peptide. The format of the PSSM matrix made by the gibbs_sampler program is the standard Psi-Blast profile matrix output.


Open the file gibbs_sampler.c in your favorite editor. Go to the main procedure. Make sure you understand the structure of the program.

You shall fill in the missing code (XXXX). Again make sure you understand the structure of the routine, and then fill out the missing code.

Next, compile the program

make gibbs_sampler.c

When the compilation is successful copy the program to the binary directory

cp gibbs_sampler ../bin/

The rehash command updates a table used by the operating system to keep track of all executable programs to include your new program pep2mat

Now go to the gibbs directory. First check what is the content of the file DRB10401.lig by typing

cat DRB10401.lig

Next, run the gibbs_sampler program on the file DRB10401.lig

gibbs_sampler DRB10401.lig > DRB10401.mat

Look at the output. How does it compare to the DRB10401.mat scoring matrix?

Note, the solution you get might differ (even dramatically) from the solution shown above. This is due to the randomness in the search "order" of the Gibbs sample imposed by different computers used to make the calculation. The solution above was found on the CBS interaction server. If you run on MACOS your solution could be more similar to DRB10401.mat (MACOS)

You can alter the search order, by specifying a new seed for the random number generator using the -s option of the gibbs_sampler, i.e

gibbs_sampler -s 135 DRB10401.lig > DRB10401.mat_newseed

Doing this you will see that the solution comes out rather different, and might even have a better fitness score ("Final Kullbach Leibler distance sum").

You can make a sequence logo visualization of the predicted binding motif from the identified binding cores using the Seq2Logo server. Here you simply copy the PSSM predicted by the Gibbs sampler to the Seq2Logo server. Note, you have pass the complete PSSM including the first line defining the amino acid order to the Seq2Logo server. Look at the predicted binding motif. Does it have the expected characteristics (a dominant hydrophobic P1 anchor for example)?


Now go back to the src directory and open the file cl2pred.c. Spend some time to make sure you understand the structure of the program. Fill in the missing code (XXXXXX), and compile the program

make cl2pred

When the compilation is successful copy the program to the binary directory

cp cl2pred ../bin/

Now go to the Gibbs directory, and evaluate the predictive performance of the first PSSM. To do this you shall use the evaluation file DRB10401.eval. This file contain peptides with measured binding affinity to the DRB10401 MHC molecule. The format of the file is

PRYVKQNTLKLATGM 0.340972 1249.549551
YLQMNSLRAEDTAVY 0.919907 2.378782
ESSFVMMSAPPAEYK 0.375605 859.040473
FVMMSAPPAEYKLQQ 0.412954 573.469869
SFVMMSAPPAEYKLQ 0.415043 560.653628
GKWYLKAMTADQEVPE 0.769569 12.100058
where the first column is the peptide, the second column the log-transformed binding affinity (1-log(IC50)/log(50000)), and the last column in the IC50 binding affinity.
cl2pred -mat DRB10401.mat DRB10401.eval | more

You can now evaluate the predictive performance (in terms of the Pearsons correlation) of the Gibbs sampler on the evaluation set.

cl2pred -mat DRB10401.mat DRB10401.eval | grep -v "#" | gawk '{print $4,$6}' | xycorr

Next, do a Gibbs sampling where you include sequence weighting and compare the performance of this matrix to the one above

gibbs_sampler -sw DRB10401.lig > DRB10401.mat_sw
cl2pred -mat DRB10401.mat_sw DRB10401.eval | grep -v "#" | gawk '{print $4,$6}' | xycorr

You can try different values of the weight of low-count (-wlc), and see how this impacts the predictive performance.

Next include the -p1a option together with sequence weighting. How does this improve the predictive performance?

Make a sequence logo of the biding motif identified using both sequence weighting and -p1a bias as described above. Does this logo to a higher degree match to the expected binding motif of the MHC class II molecule (hydrophobic P1 anchor and auxiliary anchors at (P4), P6 and P9)?

Finally if you have more time you can play with the -w option of the cl2pred program to place different weights on the different positions in the binding core. Can you find a set of weights that improve the predictive performance?

Now you are done. Remember to upload both the gibbs_sampler.c and the cl2pred.c programs via CampusNet.