|
Construction of PSSMs
In todays exercise, you shall implement the algorithm for construction of position specific scoring matrices (PSSM)
introduced in this weeks lecture. The algorithm shall include pseudo counts to correct for small number of
observations, and sequences weighting using heuristics.
Implementing the algorithms
First you must make a new directory for todays exercise and copy some data files. Change to the directory where you
keep the files for the course. This could be tyou home directory, or a directory called Algo?
cd (or where your course directory is)
mkdir PSSM
cd PSSM
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/data/PSSM/* .
To evaluate the predictive performance of your PSSM's you shall use the Pearson's correlation coefficient. This
number gives you the correlation between two numerical columns of the input file.
A value of 1 indicates the perfect prediction, and a value of 0 a random prediction. I have made a small GAWK script
calculating the Pearson's correlation.
Copy the xycorr script to your bin directory
cd
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/code/bin/xycorr bin
Next, you must copy the program templates to you src directory. (Remember the "." in the end of the command)
cd
cd src
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/code/PSSM/pep2mat.c .
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/code/PSSM/pep2score.c .
The first program constructs a PSSM from a list of peptides of equal length and writes the PSSM to standard output.
The program takes the option -b specifying the weight on prior, and the option -sw to switch on sequence weighting
by use of sequence weighting heuristics. The second program scores a peptide list against a PSSM and prints the peptide
sequence, (eventual binding affinity) and PSSM score to standard output. The format of the PSSM matrix
made by the pep2mat program is the standard Psi-Blast profile matrix output.
pep2mat
Open the file pep2mat.c in your favorite editor. Go to the main procedure. Make sure you understand the
structure of the program.
You shall fill in the missing code (XXXX). Again make sure you understand the structure of the routine,
and then fill out the missing code.
Next, compile the program
make pep2mat
When the compilation is successful copy the program to the binary directory
cp pep2mat ../bin/
rehash
The rehash command updates a table used by the operating system to keep track of all executable programs to
include your new program pep2mat
Now go to the PSSM directory. First check what is the content of the file A0201.single_lig by typing
cat A0201.single_lig
Next, run the pep2mat program on the file A0201.single_lig
pep2mat A0201.single_lig
Look at the output. How does the out compare to the BLOSUM62
scoring matrix? That is, the first line in the PSSSM matrix should correspond to the BLOSUM scoring
matrix line for the amino acids in the first position in the peptide from the A0201.single_lig file.
Next, make a PSSM based on the peptide data in the file A0201.small_lig without sequence weighting and pseudo
counts
pep2mat -b 0 A0201.small_lig > A0201.mat_b0_small_lig
Look at the output. Can you understand why the matrix has so many zero values?
Make a matrix using the pseudo counts with a weight on prior of 50
pep2mat A0201.small_lig > A0201.mat_b50_small_lig
Look at the output. What happened to all the zero values. Can you understand what has happened?
pep2score
Now go back to the src directory and open the file pep2score.c. Spend some time to make sure you understand the
structure of the program. Fill in the missing code (XXXXXX), and compile the program
make pep2score
When the compilation is successful copy the program to the binary directory
cp pep2score ../bin/
rehash
Now go to the PSSM directory. You can now test the performance of your code by making a PSSM and evaluating
how this matrix can be used to predict the binding affinity for other peptide data. The file A0201.eval
contain 1266 such peptide data. The format of the file is
ILYQVPFSV 0.8532
VVMGTLVAL 0.5891
ILDEAYVMA 0.4941
KILSVFFLA 0.8512
HLYQGCQVV 0.5386
where the first column is the peptide sequence and the second column is the binding score normalized so the a value
of 1 corresponds to strong binding and a value of 0 corresponds to non-binding.
Now evaluate you matrix from before on the data set by typing
pep2score -mat A0201.mat_b0_small_lig A0201.eval | grep -v "#" | gawk '{print $2,$3}' | xycorr
The xycorr program computes the Pearson's correlation between two columns (here the measured and predicted
binding values). A value of 1 indicates the perfect prediction, and a value of 0 a random prediction. What is
the predictive performance of the A0201.mat_b0_small_lig matrix?
Repeat the analysis for the matrix made using pseudo counts A0201.mat_b50_small_lig.
What is the predictive performance for this matrix?
Make a new matrix now using sequence weighting
pep2mat -sw A0201.small_lig > A0201.mat_small_lig_sw
and test the performance of that matrix using the command from before (with the new PSSM matrix). Does the
performance improve?
Finally, you shall make a PSSM from a larger peptide data set
pep2mat -sw A0201.large_lig > A0201.mat_large_lig_sw
and test the performance of that matrix using the command from before (with the new PSSM matrix). What is the
predictive performance of that matrix?
If you have more time you can try to modify the pep2score program to allow for position specific weighting to
increase the weights of position 2 and 9 in the scoring matrix.
You can see the output from the calcutations in
A0201.single_lig_mat
A0201.mat_b0_small.lig
A0201.mat_b50_small.lig
A0201.mat_small.lig_sw
A0201.mat_large.lig_sw
A0201.eval scored against A0201.mat_large.lig_sw
Now you are done. Remember to upload both the pep2mat.c and the pep2score.c programs via CampusNet.
|