Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Construction of PSSMs

In todays exercise, you shall implement the algorithm for construction of position specific scoring matrices (PSSM) introduced in this weeks lecture. The algorithm shall include pseudo counts to correct for small number of observations, and sequences weighting using heuristics.


Implementing the algorithms

First you must make a new directory for todays exercise and copy some data files. Change to the directory where you keep the files for the course. This could be tyou home directory, or a directory called Algo?

cd (or where your course directory is)
mkdir PSSM
cd PSSM
cp /home/projects/mniel/ALGO/data/PSSM/* .

To evaluate the predictive performance of your PSSM's you shall use the Pearson's correlation coefficient. This number gives you the correlation between two numerical columns of the input file. A value of 1 indicates the perfect prediction, and a value of 0 a random prediction. I have made a small GAWK script calculating the Pearson's correlation.

Copy the xycorr script to your bin directory

cd
cp /home/projects/mniel/ALGO/code/bin/xycorr bin

Next, you must copy the program templates to you src directory. (Remember the "." in the end of the command)

cd (or where your course directory is)
cd src
cp /home/projects/mniel/ALGO/code/PSSM/pep2mat.c .
cp /home/projects/mniel/ALGO/code/PSSM/pep2score.c .

The first program constructs a PSSM from a list of peptides of equal length and writes the PSSM to standard output. The program takes the option -b specifying the weight on prior, and the option -sw to switch on sequence weighting by use of sequence weighting heuristics. The second program scores a peptide list against a PSSM and prints the peptide sequence, (eventual binding affinity) and PSSM score to standard output. The format of the PSSM matrix made by the pep2mat program is the standard Psi-Blast profile matrix output.


pep2mat

Open the file pep2mat.c in your favorite editor. Go to the main procedure. Make sure you understand the structure of the program.

You shall fill in the missing code (XXXX). Again make sure you understand the structure of the routine, and then fill out the missing code.

Next, compile the program

make pep2mat

When the compilation is successful copy the program to the binary directory

cp pep2mat ../bin/
rehash

The rehash command updates a table used by the operating system to keep track of all executable programs to include your new program pep2mat

Now go to the PSSM directory. First check what is the content of the file A0201.single_lig by typing

cat A0201.single_lig

Next, run the pep2mat program on the file A0201.single_lig

pep2mat A0201.single_lig

Look at the output. How does the out compare to the BLOSUM62 scoring matrix? That is, the first line in the PSSSM matrix should correspond to the BLOSUM scoring matrix line for the amino acids in the first position in the peptide from the A0201.single_lig file.

Next, make a PSSM based on the peptide data in the file A0201.small_lig without sequence weighting and pseudo counts

pep2mat -b 0 A0201.small_lig > A0201.mat_b0_small_lig

Look at the output. Can you understand why the matrix has so many zero values?

Make a matrix using the pseudo counts with a weight on prior of 50

pep2mat A0201.small_lig > A0201.mat_b50_small_lig

Look at the output. What happened to all the zero values. Can you understand what has happened?


pep2score

Now go back to the src directory and open the file pep2score.c. Spend some time to make sure you understand the structure of the program. Fill in the missing code (XXXXXX), and compile the program

                        
make pep2score

When the compilation is successful copy the program to the binary directory

cp pep2score ../bin/
rehash

Now go to the PSSM directory. You can now test the performance of your code by making a PSSM and evaluating how this matrix can be used to predict the binding affinity for other peptide data. The file A0201.eval contain 1266 such peptide data. The format of the file is

ILYQVPFSV 0.8532
VVMGTLVAL 0.5891
ILDEAYVMA 0.4941
KILSVFFLA 0.8512
HLYQGCQVV 0.5386
where the first column is the peptide sequence and the second column is the binding score normalized so the a value of 1 corresponds to strong binding and a value of 0 corresponds to non-binding.

Now evaluate you matrix from before on the data set by typing

pep2score -mat A0201.mat_b0_small_lig A0201.eval | grep -v "#" | gawk '{print $2,$3}' | xycorr
The xycorr program computes the Pearson's correlation between two columns (here the measured and predicted binding values). A value of 1 indicates the perfect prediction, and a value of 0 a random prediction. What is the predictive performance of the A0201.mat_b0_small_lig matrix?

Repeat the analysis for the matrix made using pseudo counts A0201.mat_b50_small_lig. What is the predictive performance for this matrix?

Make a new matrix now using sequence weighting

pep2mat -sw A0201.small_lig > A0201.mat_small_lig_sw
and test the performance of that matrix using the command from before (with the new PSSM matrix). Does the performance improve?

Finally, you shall make a PSSM from a larger peptide data set

pep2mat -sw A0201.large_lig > A0201.mat_large_lig_sw
and test the performance of that matrix using the command from before (with the new PSSM matrix). What is the predictive performance of that matrix?

If you have more time you can try to modify the pep2score program to allow for position specific weighting to increase the weights of position 2 and 9 in the scoring matrix.


You can see the output from the calcutations in

A0201.single_lig_mat
A0201.mat_b0_small.lig
A0201.mat_b50_small.lig
A0201.mat_small.lig_sw
A0201.mat_large.lig_sw
A0201.eval scored against A0201.mat_large.lig_sw

Now you are done. Remember to upload both the pep2mat.c and the pep2score.c programs via CampusNet.