Prediction of T cell epitopes
During this exercise you will use bioinformatics tools to predict peptide-MHC
binding. The exercise has three parts:
- Identification of MHC binding motif
- Visualize the motif using sequence logos
- Training of MHC class I and class II binding prediction methods
Background: Peptide MHC binding
The most selective step in identifying potential peptide immunogens is
the binding of the peptide to the MHC complex. Only one in about
200 peptides will bind to a given MHC complex. A very large number of
different MHC alleles exist each with a highly selective peptide
The binding motif for a given MHC class I complex is in
most cases 9 amino acids long. The motif is characterized
by a strong amino acid preference at specific positions
in the motif. These position are called anchor positions. For
many MHC complexes the anchor position are placed at
P2 and P9 in the motif. However this is not always the case.
Large number of peptide data exist describing this MHC
specificity variation. One important source of data is the
SYFPEITHI MHC database (http://www.syfpeithi.de).
This database contains information on MHC ligands and binding motifs.
Purpose of exercise, description of data
In this exercise you are going to
- Search the SYFPEITHI database to characterize the binding motif for
different MHC alleles.
- Visualize the binding motif using sequence logos.
- Use the Easypred and NNAlign web-interfaces to train bioinformatics predictor for
Identification of MHC binding motifs
Go to the
SYFPEITHI MHC database (click on the right mouse button, and
select open in a new window). Use the Find motif, Ligand or epitope
option. Have a look at the peptide characteristics for different MHC alleles
like HLA-A*0201, HLA-A*01, HLA-A*1101, and HLA-B27.
This you do by selecting for instance
HLA-A*0201 and press do Query.
Repeat the analysis for a few (at least two) other MHC alleles.
- Q1: What are the characteristics of the peptides that bind HLA-A*0201, that is
which positions are anchor position and what amino acids
are found at the anchor positions?
- A1: P2 and P9 are anchor positions. At P2 L and M are referred, and at P9 V and L are preferred. For HLA-A1, the anchors are P3 and P9, and the preferred amino acids are D/E at P3 and Y at P9.
Does the location of anchors differ, and does the amino acid preferences differ
between different MHC alleles?
- A2: Yes the motifs differ both in the anchor positions and the preferred amino acids
A powerful way to visualize the peptide characteristics of the binding
motif of an MHC complex, is to plot a sequence logo. The files
in the exercise directory contain peptides
known to bind to a particular MHC complex (HLA-A*0201 for example).
The files are in the format used by a program that generates sequence logos.
Go to the web-site Seq2Logo.
You shall use Seq2Logo to visualize sequence
logos. This you by pasting in the three files one at the time into the Submission window and type
Submit query. Do this for each of the three peptide files and examine the sequence logos.
Does the anchor
positions and amino acid preferences correspond to what you found using
the SYFPEITHI database?
- A3: For HLA-A0201, the anchor positions are P2 and P9 as reported
in the SYFPEITHI database. The preferred amino acids are L/I/V at P2 and V/L/I at P9. This is similar but not identical to what is reported in the SYFPEITHI
Prediction of MHC-peptide binding
In this part of the exercise you shall use the EasyPred web-interface
to train and evaluate a series of different MHC-peptide binding
predictors. You shall use two data sets (eval.set, train.set)
that contain peptides and binding affinity to the MHC alleles HLA-A*0201.
The binding affinity is a number between 0 and 1, where a high value
indicates strong binding (a value of 0.5 corresponds to a
binding affinity of approximately 200 nM). The eval.set
contains 66, and the train.set 1200 such peptides.
Click on the filenames to view the content of the files.
Before you start using the EasyPred you must save the
train.set and eval.set files locally on the Desktop on your lab-top. You
do that by clicking on the files names (eval.set,
train.set) and saving the files as text files
on the Desktop.
You shall now use EasyPred web-server to
train a series of methods to predict peptide-MHC binding. Go to the
Weight Matrix construction
First you shall train a matrix predictor. On the EasyPred web-server
press Clear fields. In the upload training examples window
browse and select the train.set file from the Desktop, in the upload evaluation window
browse and select the eval.set file from the Desktop. In the Matrix method parameters
select Clustering at 62% identity, and set weight on prior (weight on
pseudo counts) to 200. Press Submit query.
This will calculate a weight-matrix using sequence
weighting by clustering, and a weight on prior (pseudo counts) of 200.
- Q4: What is the predictive performance of the matrix method (Pearson coefficient and Aroc value)?
- A4: The predictive performance is: Pearson coefficient for N= 66 data: 0.63507, Aroc value: 0.83399
- Q5: How many of the 1200 peptides in the train set are included
in the matrix construction?
- A5: Number of positive training examples: 136
Go back to the EasyPred server window (use the Back bottom). Set
clustering method to No clustering and the
weight on prior to zero and redo calculation.
- Q6: What is the predictive performance of the matrix method now?
- A6: The performance has dropped (Pearson coefficient for N= 66 data: 0.51898, and Aroc value: 0.78954)
- Q7: Does clustering and pseudo count (weight on prior) improve the prediction accuracy?
- A7: Yes
Training set partition
Now the fun starts. You shall now train some neural networks
to predict MHC-peptide binding. In the Type of prediction method
window select neural networks. Leave all other parameters
as they are. Press Submit query.
This will train a neural network with 2 hidden neurons
running up-to 300 training epochs.
The top 80% (960 peptides) of the train.set is used to train
the neural network and the bottom 20% (240 peptides) are used to
stop the training to avoid over-fitting.
- Q8: What is the maximal test performance (maximal test set Pearson correlation),
and in what epoch (training cycle) does it occur?
- A8: The test set performance is Maximal test set pearson correlation coefficent sum = 0.801300 in epoch 103, and minimal per example squared error = 0.016800 in epoch 101
- Q9: What is evaluation performance (Pearson correlation and Aroc values)?
- A9: Pearson coefficient for N= 66 data: 0.58693, Aroc value: 0.85490
Go back to the EasyPred interface and change the parameters so
that you use the bottom 80% of the train.set to train the neural
network and the top 20% to stop the training. Redo the network training
with the new parameters.
- Q10: What is the maximal test performance, and in
what epoch does it occur?
- A10: Maximal test set pearson correlation coefficent sum = 0.837800 in epoch 90, minimal per example squared error = 0.015300 in epoch 111
- Q11: What is evaluation performance?
- A11: Pearson coefficient for N= 66 data: 0.55571, and Aroc value: 0.78170
- Q12: How does the performance differ from what you found in the previous
- A12: The performance differs between the two networks. The test performance is better in the first case, but the evaluation performance is best for the last network.
- Q13: Why do you think the performance differ so much?
- A13: This is not a trivial question. When the network is stopped on the top 20% of the data, the network will have an inherent bias
towards peptides similar to the once in the top 20%. If these peptides are either
very similar to the
peptides in the evaluation set this network will perform better. Also if the diver
sity of the peptides in the
bottom 20% of the data is very low, then this set of peptides will be easy to lear
n, but the network will not learn
to be able to generalize to other peptides. This was what was observed in question
Q9 and Q11.
Go back to the EasyPred interface and change the parameters back
so that you use the top 80% of the train.set of training. Next
do neural network training with a different set of hidden neurons
(1 and 5 for instance).
- Q14: How does the test performance differ when
you vary the number of hidden neurons?
NH1: Maximal test set pearson correlation coefficent sum = 0.795600 in epoch 94,
minimal per example squared error = 0.017300 in epoch 101
NH5: Maximal test set pearson correlation coefficent sum = 0.805600 in epoch 100,
minimal per example squared error = 0.016500 in epoch 101
- Q15: How does the evaluation performance differ?
NH1: Pearson coefficient for N= 66 data: 0.58614, Aroc value: 0.85621
NH5: Pearson coefficient for N= 66 data: 0.59031, Aroc value: 0.84706
- Q16: Can you decide on an optimal number of hidden neuron?
- A16: The number of hidden neuron does not seem to influence the predictive performance much
- Q17: Why do you think the number of hidden neurons has
so little importance?
- A17: To learn higher order correlations the network must estimate paired amino acids correlations.
These are defined in terms of 400 amino acid pair frequencies estimated from the set of
peptide binders. The data available contains 1200 peptides, but only 136 are binders (affinity value > 0.5). It is hence unlikely
that the network can accurately pick up these 400 amino acid pair frequencies for
this limited set of binding peptides.
As you found in the first part of neural network training,
the network performance can depend strongly on the partition
of the training data into the training and stop set. One way
of improving the network performance is to make use
of this network variation in a cross-validated training.
The general idea behind the cross-validated training is that
since you cannot in advance tell which training set partition
that will be optimal you make a series of N network trainings
each with a different partition. The final network prediction is
then taken as the simple average over the N predictions. In
a 5-fold cross-validated training, the training set is split up
into 5 sets. In one training the sets 1,2,3 and 4 are used to train
the network and the 5th set to stop the training, in the another
training the sets 1,3,4,5 are used for training and the 2nd set
to stop the training, and so forth.
Go back to the EasyPred interface and set the hidden neuron parameter back to 2.
Next set the number of partitions
for cross-validated training to 5 and redo the neural network training (this
might take some minutes).
Write down the test performance for each of the five networks
Now you must save the parameters for the cross-validated
network you have just trained. Use the right mouse-bottom on the
Parameters for prediction method to save the neural network
parameters to a file (say para.dat). You can now run predictions using
this neural network without redoing network training by uploading the
parameter file in the Load saved prediction method window.
- Q18: How does the train/test performance differ between the different
- Q19: What is the evaluation performance and how does it compare to
the performance you found previously?
- A19: Pearson coefficient for N= 66 data: 0.61446. Aroc value: 0.83137. The
Pearsons correlation is the best neural network performance obtained so fare.
Finding epitopes in real proteins
You shall use the neural network
to find potential epitopes in the Sars virus. In the EasyPred
web-interface clear field to reset all
parameter fields. Go to the Uniprot
homepage Uniprot. Search
for a Sars entry by typing "Sars virus" in the search window. Click
you way to the FASTA format for one of the proteins.
is a link if you are lazy. Paste in FASTA file into the Paste in
evaluation examples. Upload the network parameter file (para.dat) from before
into the Load saved prediction method window.
Leave the window Networks to chose in ensemble
blank, make sure that the option for sorting the output
is set to Sort output on predicted values, and press Submit query.
- Q20: How many high binding epitopes do you find?
Is this number reasonable (how large a fraction of random 9meric peptides are expected to bind to a given HLA complex?)
- A20: Depending on the protein sequence selected you will find in between 2-5 peptides with a prediction score greater than 0.5. This corresponds to a fraction of 0.005 - 0.02, and is thus very reasonable since we expect around
1-2% of random peptides to bind to a given MHC molecule.
Training a prediction method for HLA class II binding.
If you have more time, you shall now make a prediction method for MHC class II binding.
HLA class II binding peptides have a broad length distribution complicating the development of prediction
methods. Identifying the correct alignment of a set of peptides known to bind the MHC class II complex is a crucial part of
identifying the core of an MHC class II binding motif.
Here you shall use the NNAlign web-server to
develop and evaluate a prediction method for the HLA class II allele DRB1*0401.
The data used to train and evaluate the method are down-loaded from the
Immune Epitope Database (IEDB).
The DRB10401.train contains 800 peptides with measured binding
affinity to the DRB1*0401 MHC class II allele. The binding values have also here been transformed to fall in the
range 0-1, where a high value indicates strong binding (a value of 0.5 corresponds to a
binding affinity of approximately 200 nM).
The DRB10401.test likewise contains 800 peptides
with associated binding affinity to be used for evaluation.
As explained in the todays lecture, the prediction of MHC class II is complicated by the fact that
one needs to correctly identify the binding register in order to learn the binding preferences. To
illustrate this, we have extracted the set of binding peptides from the Class II_train.set. Use this data
set DRB10401.train_bind to display the binding motif using
the Seq2Logo server.
- Q21: Does the calculated logo display any positions with high information content, and any
amino acid specificities?
- A21: The motif shows no positions with high information content.
Next you shall use the NNAlign server to derive the binding motif for the class II molecule.
Go to the NNAlign web-server.
Upload the DRB10401.train as training data, and the
DRB10401.test as evaluation data. Select expert
options. Set the motif length to 9, Folds for cross-validation to 0, and leave the other
settings as default. Press submit. It takes a little while for the calculation to complete so be patient.
- Q22: What is the predictive performance of the method (Pearsons correlation, and Spearmans correlation)?
- A22: RMSE = 0.209096, Pearson correlation = 0.5508, and Spearman correlation = 0.5584
- Q23: And how does the sequence logo differ from what you found in question 21?
- A23: The logo is very different. Here we find that P1 and P6 are anchors with high information content, and the preferred amino acids at P1 are Y/L/I/M and at P6 S/T/N.
The reason for this is that the Seq2Logo in contrast to NNAlign does not re-align the peptide
sequences to a common binding core before calculating the logo.
Now you have within less than 3 hours developed advanced and competitive methods for predicting
binding of peptides to HLA class I and II. Also you have identified potential CTL epitope vaccine
candidates for the SARS virus. All you need now is to find some venture capital and make your own
Biotech startup company. That is not so bad an outcome of 3 hours work!
Now you are done!!