|
Gibbs sampling
In todays exercise, you shall implement the Gibbs sampling algorithm for identification of linear binding motifs
in peptide ligand data.
Implementing the algorithms
First you must make a new directory for todays exercise and copy some data files. Change to the directory where you
keep the files for the course. This could be your home directory, or a directory called Algo?
cd (or where your course directory is)
mkdir Gibbs
cd Gibbs
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/data/Gibbs/DRB10401.lig .
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/data/Gibbs/DRB10401.eval .
Next, you must copy the program templates to you src directory. (Remember the "." in the end of the command)
cd
cd src
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/code/Gibbs/gibbs_sampler.c .
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/code/Gibbs/cl2pred.c .
The first program constructs a PSSM from a list of peptides using the Gibbs sampler algorithm to identify the optimal
core alignment and writes the PSSM to standard output.
The second program scores a peptide list against a PSSM and prints the peptide
sequence, (eventual binding affinity) and PSSM score to standard output. The prediction score
of the peptide to a PSSM is the score of the highest scoring binding core contained within the peptide.
The format of the PSSM matrix
made by the gibbs_sampler program is the standard Psi-Blast profile matrix output.
Gibbs_sampler
Open the file gibbs_sampler.c in your favorite editor. Go to the main procedure. Make sure you understand the
structure of the program.
You shall fill in the missing code (XXXX). Again make sure you understand the structure of the routine,
and then fill out the missing code.
Next, compile the program
make gibbs_sampler.c
When the compilation is successful copy the program to the binary directory
cp gibbs_sampler ../bin/
rehash
The rehash command updates a table used by the operating system to keep track of all executable programs to
include your new program pep2mat
Now go to the Gibbs directory. First check what is the content of the file DRB10401.lig by typing
cat DRB10401.lig
Next, run the gibbs_sampler program on the file DRB10401.lig
gibbs_sampler DRB10401.lig > DRB10401.mat
Look at the output. How does it compare to the DRB10401.mat
scoring matrix?
You can make a sequence logo visualization of the predicted binding motif from the identified binding cores
using the EasyPred webserver. Run the calculation from before but this time save the core alignment to a file using
the -pc option
gibbs_sampler -pc DRB10401.lig > DRB10401.mat
Now you will have a file called core containing the predicted binding cores of each MHC class II ligand.
The first column is the file gives the core and the second the sequence weight (in this case 1.0 for
all peptides since you did not include sequence weighting in the calculation).
Go to the EasyPred webserver EasyPred.
Upload the core file as training examples, set the cutoff for counting an example as positive to 0 (make
sure you understand why this is important and why it is ok). Leave all other parameters unchanged and press submit.
Look at the predicted binding motif. Does it have the expected characteristics (a dominant hydrophobic P1 anchor
for example)?
Alternatively you can visualize the predicted weight-matrix directly using the Seq2Logo server. Here you simply copy the PSSM predicted by the Gibbs sampler to the Seq2Logo server. Note, you have pass the complete PSSM including the first line defining the amino acid order to the Seq2Logo server.
cl2pred
Now go back to the src directory and open the file cl2pred.c. Spend some time to make sure you understand the
structure of the program. Fill in the missing code (XXXXXX), and compile the program
make cl2pred
When the compilation is successful copy the program to the binary directory
cp cl2pred ../bin/
rehash
Now go to the Gibbs directory, and evaluate the predictive performance of the first PSSM. To do this you shall
use the evaluation file DRB10401.eval. This file contain peptides with measured binding affinity to the
DRB10401 MHC molecule. The format of the file is
PRYVKQNTLKLATGM 0.340972 1249.549551
TLYLQMNSLRAEDTA 1 1.0
YLQMNSLRAEDTAVY 0.919907 2.378782
NTLYLQMNSLRAEDT 1 1.0
ESSFVMMSAPPAEYK 0.375605 859.040473
FVMMSAPPAEYKLQQ 0.412954 573.469869
SFVMMSAPPAEYKLQ 0.415043 560.653628
GKWYLKAMTADQEVPE 0.769569 12.100058
where the first column is the peptide, the second column the log-transformed binding affinity (1-log(IC50)/log(50000)),
and the last column in the IC50 binding affinity.
cl2pred -mat DRB10401.mat DRB10401.eval | more
You can now evaluate the predictive performance (in terms of the Pearsons correlation)
of the Gibbs sampler on the evaluation set.
cl2pred -mat DRB10401.mat DRB10401.eval | grep -v "#" | gawk '{print $4,$6}' | xycorr
Next, do a Gibbs sampling where you include sequence weighting and compare the performance of this matrix
to the one above
gibbs_sampler -sw DRB10401.lig > DRB10401.mat_sw
cl2pred -mat DRB10401.mat_sw DRB10401.eval | grep -v "#" | gawk '{print $4,$6}' | xycorr
You can try different values of the weight of low-count (-wlc), and see how this impacts the predictive
performance.
Next include the -p1a option together with sequence weighting. How does this improve the predictive performance?
Make a sequence logo of the biding motif identified using both sequence weighting and -p1a bias as
described above. Does this logo to a higher degree match to the expected binding motif of the MHC class II molecule
(hydrophobic P1 anchor and auxiliary anchors at (P4), P6 and P9)?
Finally if you have more time you can play with the -w option of the cl2pred program to place different weights
on the different positions in the binding core. Can you find a set of weights that improve the predictive
performance?
Now you are done. Remember to upload both the gibbs_sampler.c and the cl2pred.c programs via CampusNet.
|