Construction of sequence logos and weight matrices
Morten Nielsen (firstname.lastname@example.org)
You shall use Bioinformatics tools to predict peptide-MHC
binding and select potential epitope vaccine candidates. The exercise has four parts:
- Identification of MHC binding motif
- Visualize the motif using sequence logos
- Training of MHC binding prediction methods
- Use the MHC binding prediction method to select potential epitope vaccine candidates
Background: Peptide MHC binding
The most selective step in identifying potential peptide immunogens is
the binding of the peptide to the MHC complex. Only one in about
200 peptides will bind to a given MHC complex. A very large number of
different MHC alleles exist each with a highly selective peptide
The binding motif for a given MHC class I complex is in
most cases 9 amino acids long. The motif is characterized
by a strong amino acid preference at specific positions
in the motif. These position are called anchor positions. For
many MHC complexes the anchor position are placed at
P2 and P9 in the motif. However this is not always the case.
Large number of peptide data exist describing this MHC
specificity variation. One important source of data is the
SYFPEITHI MHC database (http://www.syfpeithi.de).
This database contains information on MHC ligands and binding motifs.
Once an accurate and reliable prediction method of peptide MHC binding
has been developed it can be applied in the context of effective vaccine
design to search for potential epitope candidates. On a whole genome
scale one can search for peptides with high MHC binding affinity.
Purpose of exercise
In this exercise you are going to
- Visualize the binding motif using sequence logos.
- Use the Easypred web-interface to train Bioinformatics predictor for
- Apply a MHC binding prediction method to select peptides in the Sars
genome useful for vaccine design.
We shall use two performance measures to evaluate the predictive performance
of a prediction method. The two measures are AUC (the area under the ROC curve),
and the Pearson correlation. A short description of different performance methods
is given here Performance measures.
Identification of MHC binding motifs
Go to the
SYFPEITHI MHC database (click on the right mouse button, and
select open in a new window). Use the Find motif, Ligand or epitope
option. Have a look at the peptide characteristics for HLA-A*0201 MHC allele.
This you do by selecting HLA-A*0201 and press do Query.
What are the characteristics of the peptides
that bind HLA-A*0201?
Q1: Which positions are anchor position and what amino acids
are found at the anchor positions?
Answer: Anchor positions are P2 and P9. Prefered amino acids are P2: LM, P9: VL.
A powerful way to visualize the peptide characteristics of the binding
motif of an MHC complex, is to plot a sequence logo.
In a logo the information content at each
position in the sequence motif corresponds to the height of a
column of letters. The Information content Ip on each position
p is defined as log2(20) + Σa(pap
The information content is a measure of the degree of conservation
and lies within the range of 0 (no conservation-all amino acids
are equally probable) and
log2(20) = 4.3 (full conservation-only a single amino acid is
observed at that position).
The height of each letter within the columns is proportional to the
frequency pap of the corresponding amino acid a at that position p.
The amino acids are colored according to their properties:
- Acidic [DE] red
- Basic [HKR]: blue
- Hydrophobic: [ACFILMPVW] black
- Neutral [GNQSTY]: green
In the exercise, you shall use logo plots to visualize the MHC
binding motif contained in different weight matrix predictors. Below you
see an example of a logo visualization of the binding motif of the MHC
Construction of weightmatrices
First you shall use the EasyPred web-server
(click on the right mouse button, and select open in a new window) for confirm the
results from the lecture calculating the weight matrix from the multiple alignment
Go to the EasyPred web-server
and past the five peptide sequences from above into the Paste in training examples
window. Select No clustering, and set Weight on prior to 10000. Next, press
Have a look at the sequence logo.
Q2: How many different amino acids are present at the P1 position in the logo?
Answer: More than 10!
Q3: How many different amino acids are present at the P1 position in the binding data?
Click on the link Parameters for prediction method. This will open a window with the
weightmatrix estimated from the input peptide sequences.
Q3.1:Try to reproduce the matrix values for
P1(I), and P1(K). Remember, you shall use the blosum substitution matrix to estimate the
pseudo frequency for the two amino acids as we did in the lecture. Here, you find a link
to a file containing the Blosum substitution matrix.
Here, is a link to a file with the background frequences for the different amino acids
Background frequency tables.
Also, remember that the values in the weightmatrix are calculated as
where p is the frequency of a given amino acid (calculated using pseudo counts), and q is the background frequence.
You might not get exactly identical numbers to those of the EasyPred program. So, +/- 20%
Answer:. The pseudo frequences for I and K are
g(I) = 0.4*0.12 + 0.6*0.16 = 0.144
g(K) = 0.4*0.03 + 0.06*0.03 = 0.03
Since weight on prior (or weight on pseudo count) is much greater than the number of sequences, the final
frequences p(I) = g(I), and p(K) = g(K). Using the formula for the values in the weight matrix with q(I)=0.068,
and q(K)=0.058 (remember q is the back ground frequeces), we find the weight matrix value to be
Score(I) = 2.17, and Score(K) = -1.90
These values compare fine with the values calculated y the Easypred program (2.34, and -2.16).
Prediction of MHC-peptide binding
You shall now use the EasyPred web-interface
to train and evaluate a series of different MHC-peptide binding
predictors. The EasyPred web-server is an interface that allows
for easy training and performance evaluation of weight matrix and
artificial neural prediction methods. In this exercise, you shall only
use the part of the web-server that trains weight-matrices.
You shall use three files (eval.set, small.train.set
and large.train.set) that contain peptides and binding affinity to the MHC alleles HLA-A*0201.
In the two train.sets you find peptide with binding affinity of either 0.1 (this
value has absolutely no meaning, it is set to 0.1 for practical reasons) or 1, where
0.1 indicates a non-binder and 1 that the peptide binds to the MHC complex.
The small.train.set contains 110 peptides,
whereas the large.train.set contains 232 peptides.
For the evaluation set (eval.set) the affinities are given as real values. A high value
indicates strong binding (a value of 0.5 corresponds to a
binding affinity of approximately 200 nM). The values are
calculated from the actual nM binding affinities using the
relation x = 1 - log(aff nM)/log(50000). A peptide that binds with
an affinity stronger than 500 nM is said to be an intermediate binder, and
a peptide that binds stronger than 50 nM is a high binder. Note that low
affinity means strong binding. As a rule of thumb a peptide must be
at least an intermediate binder in order to induce an immune-response.
Using the above transformation an intermediate binder (500 nM) will have
a value og 0.426, and a strong binding a value of 0.638, respectively. Again
note that a high transformed value corresponds to a low affinity value, and hence
to a strong binder.
The eval.set contains 1266,
Click on the filenames to view the content of the files.
During the exercise you shall use the files small.train.set and eval.set
often. It might therefore be smart if you save these files
on the Desktop on your lab-top. You
do that by clicking on the files names (eval.set,
small.train.set) and saving the files as text files
on the Desktop. You can also just copy and paste the files every time
you need them.
The accuracy of a prediction method can be measured in many ways. Here we use
two measures. The area under the receiver operating curve (ROC) curve (Aroc),
and the Pearson correlation. Both measures have a value of 1 for the perfect method.
For a method that is random the Aroc measure has a value of 0.5 whereas the
Pearson correlation is 0. When you evaluate the performance of a prediction method
you hence rank a method with high Aroc and Pearson correlation highest (if you what
to know more, use Goggle to find information on how the two measures are calculated).
Weight Matrix generation
You shall now use EasyPred web-server to
train a series of methods to predict peptide-MHC binding. Go to the
(again click on the right mouse button, and select open in a new window).
When you go through the exercise please read carefully the instructions (all of
them). Many of the examples only make sense if you do exactly as described
in the text.
In the upload training examples window
browse and select the small.train.set file from the Desktop, in the upload evaluation window
browse and select the eval.set file from the Desktop.
Important! In the window "Cutoff for counting an example as a positive example"
type 0. Now press Submit query.
Q4 What is the predictive performance of the matrix method?
Answer:: Pearson coefficient for N= 1266 data: 0.07628
Aroc value: 0.56979
View the logo plot of the calculated matrix. Can you understand
why the matrix performs so poorly?
Answer:: The logo shows very low information at all positions. We have trained a
method for peptide:MHC binding on a mixture of peptide binders and peptide non-binders. This
is clearly wrong. We can only include binders when estimating a binding motif.
Q5 How many of the 110 peptides in the small.train.set are included
in the matrix construction?
Answer:: All 110 peptides were included
Go back to the EasyPred server window (use the Back button). Set
the "Cutoff for counting an example as a positive example" to 0.5. Set
clustering method to No clustering and the
weight on prior to 0.0 and redo calculation.
Q6 What is the predictive performance of the matrix method now?
Answer:: Pearson coefficient for N= 1266 data: 0.29529
Aroc value: 0.71191
Q7 How many of the 110 peptides in the small.train.set are included in the matrix construction?
View the logo plot of the calculated matrix. Does the logo
resemble the logo for the HLA-A*0201 binding motif shown in the
beginning of the exercise?
Return to the EasyPred server window (use the Back button). Set the
clustering method to Clustering at 62% identity. Keep
weight on prior on 0.0. Redo calculation.
Q8 What is the predictive performance of the matrix method now?
Answer:: Pearson coefficient for N= 1266 data: 0.45328
Aroc value: 0.81865
Again view the logo plot of the calculated matrix. Has it changed
compared to the previous calculation?
Return to the EasyPred server window (use the Back button). Keep
the clustering method to Clustering at 62% identity.
Set the weight on prior to 200.0, and redo calculation.
Q9 What is the predictive performance of the matrix method now?
Answer:: Pearson coefficient for N= 1266 data: 0.49684
Aroc value: 0.84838
View the logo plot of the calculated matrix. What is the
big difference between this logo and the two previous ones? (how many different amino acids are present at each
position in the binding motif?)
Answer:: In the two previous logo plots, only four amino acids where present at for instance P2. In this
last logo all amino acids are present.
What are reasons for these differences?
Answer:: Using pseudo counts will give non-zero frequency values also for amino acids not observed.
Does the logo begin
to resemble the logo for the HLA-A*0201 binding motif shown in the
beginning of the exercise?
Note, that you using as few as 10 binding peptides in the small.train.set
was able to derive a weight-matrix with a reasonable predictive performance, and
a corresponding logo plot that captured many the features described in
the both the SYFPHETHI database, and in the logo plot calculated from the large
training data set.
Now you shall do the last matrix training. Return to the EasyPred server window.
Now you shall train a weight matrix using a larger set of data.
Press Clear fields. Upload the large.train.set
in the "Paste in training examples" window., and the eval.set
in the "Paste in evaluation examples". Leave all other option unchanged
(Cluster at 62% identity, and weight on prior at 200). Select
Sort output on predicted values and submit query.
Q10 What is the predictive performance of the matrix method now?
Answer:: Pearson coefficient for N= 1266 data: 0.71798
Aroc value: 0.96651
View the logo plot of the calculated matrix. Does the logo compare
to the logo for the HLA-A*0201 binding motif shown in the
beginning of the exercise?
Q11 Look at the prediction list. How many false positive hits do you
find among the top 20 highest scoring peptides (Assignment score < 0.426)?
Answer:: One false positive
Before you continue click on the "Parameters for prediction method" link and save
the content to a file on you Desktop (para.dat for instance).
Finding epitopes in real proteins
As the last part you shall use the weight matrix
to find potential epitopes in the Sars virus. In the EasyPred
web-interface clear field to reset all
parameter fields. Go to the Swiss-prot
home-page Swiss-prot. Search
for a Sars entry by typing Sars in the search window. Click
you way to the FASTA format for one of the proteins.
is a link if you are lazy. Paste in FASTA file into the Paste in
evaluation examples. Upload the weight matrix parameter file (para.dat) from before
into the Load saved prediction method window.
Make sure that the option for sorting the output
is set to Sort output on predicted values, and press Submit query.
Now the top scoring peptides should be the peptides that resemble the binding
motif (i.e. your weight matrix) the most. Selecting the top 10 peptides
you should have a high change of including a set of potential HLA-A*0201 epitopes. These
you could sell to a big pharma company and become rich and famous!!
Now you are done!!