Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Prediction of T cell epitopes

Claus Lundegaard (lunde@cbs.dtu.dk) and Morten Nielsen (mniel@cbs.dtu.dk)


Overview

During this exercise you will train bioinformatics tools to predict peptide-MHC binding.

  1. Training of MHC class I prediction methods


Background: Peptide MHC binding

The most selective step in identifying potential peptide immunogens is the binding of the peptide to the MHC complex. Only one in about 200 peptides will bind to a given MHC complex (with an affinity stronger than 200 nM). A very large number of different MHC alleles exist each with a highly selective peptide binding specificity.

The binding motif for a given MHC class I complex is in most cases 9 amino acids long. The motif is characterized by a strong amino acid preference at specific positions in the motif. These position are called anchor positions. For many MHC complexes the anchor position are placed at P2 and P9 in the motif. However this is not always the case.

Large number of peptide data exist describing this MHC specificity variation. One important source of data is the SYFPEITHI MHC database (http://www.syfpeithi.de). This database contains information on MHC ligands and binding motifs.

Purpose of exercise, description of data

In this exercise you are going to

  1. Use the Easypred web-interface to train bioinformatics predictors for MHC-peptide binding.


The exercise

Prediction of MHC-peptide binding

In this part of the exercise you shall use the EasyPred web-interface to train and evaluate a series of different MHC-peptide binding predictors. You shall use two data sets (eval.set, train.set) that contain peptides and binding affinity to the MHC alleles HLA-A*0201. The binding affinity is a number between 0 and 1, where a high value indicates strong binding (a value of 0.5 corresponds to a binding affinity of approximately 200 nM). The eval.set contains 66, and the train.set 1200 such peptides.

Open each file, and describe the content in a few words - differences and similarities between the two datasets.

Before you start using the EasyPred you must save the train.set and eval.set files locally on the Desktop on your lab-top. You do that by clicking on the files names (eval.set, train.set) and saving the files as text files on the Desktop.

You shall now use EasyPred web-server to train a series of methods to predict peptide-MHC binding. Go to the EasyPred web-server.


Weight Matrix construction

First you shall train a matrix predictor. This you have done before so it should be relatively fast. On the EasyPred web-server press Clear fields. In the upload training examples window browse and select the train.set file from the Desktop, in the upload evaluation window browse and select the eval.set file from the Desktop. In the Matrix method parameters select Clustering at 62% identity, and set weight on prion (weight on pseudo counts) to 200. Press Submit query. This will calculate a weight-matrix using sequence weighting by clustering, and a weight on prior (pseudo counts) of 200.

  • Q1: What is the predictive performance of the matrix method (Pearson coefficient and Aroc value)? Both the Pearsons correlation and Aroc (area under the ROC curve) values should be as close to 1 as possible for a perfect prediction method. A Pearsons correlation of 0, and a Aroc value of 0.5 correspond to a random prediction method.
  • Q2: How many of the 1200 peptides in the train set are included in the matrix construction?

Go back to the EasyPred server window (use the Back bottom). Set clustering method to No clustering and the Weight on prior to zero and redo calculation.

  • Q3: What is the predictive performance of the matrix method now?
  • Q4: Can you argue why clustering and pseudo count (weight on prior) improve the prediction accuracy?

Neural networks

Training set partition

Now the fun starts. You shall now train some neural networks to predict MHC-peptide binding. In the Type of prediction method window select neural networks. Leave all other parameters as they are. Upload the train.set as training example, and the eval.set as evaluation examples and press Submit query.

This will train a neural network with 2 hidden neurons running up-to 300 training epochs (training cycles). The top 80% (960 peptides) of the train.set is used to train the neural network and the bottom 20% (240 peptides) are used to stop the training to avoid over-fitting.

  • Q5: What is the maximal test performance (maximal test set Pearson correlation), and in what epoch (numner of training cycles) does it occur?
  • Q6: What is evaluation performance (Pearson correlation and Aroc values)?

Go back to the EasyPred interface and change the parameters so that you use the bottom 80% of the train.set to train the neural network and the top 20% to stop the training. Redo the network training with the new parameters.

  • Q7: What is the maximal test performance, and in what epoch (number of training cycles) does it occur?
  • Q8: What is evaluation performance?
  • Q9: How does the performance differ from what you found in the previous training?
  • Q10: Why do you think the performance differ so much?

Hidden Neurons

Go back to the EasyPred interface and change the parameters back so that you use the top 80% of the train.set of training. Next do neural network training with a different set of hidden neurons (1 and 5 for instance).

  • Q11: How does the test performance differ when you vary the number of hidden neurons?
  • Q12: How does the evaluation performance differ?
  • Q13: Can you decide on an optimal number of hidden neuron?
  • Q14: As you most likely have found, the number of hidden neurons have little influence on the performance of the neural network. Why do you think this is? In the lectures, it was explained that neural networks with hidden neurons could capture higher order correlations. Could it be that peptide:MHC binding has no higher order correlations, or can you think of another explanation.

Cross-validated training

As you found in the first part of the neural network training, the network performance can depend strongly on the partition of the training data into the training and stop set. One way of improving the network performance is to make use of this network variation in a cross-validated training. The general idea behind the cross-validated training is that since you cannot in advance tell which training set partition that will be optimal, you make a series of N network trainings each with a different partition. The final network prediction is then taken as the simple average over the N predictions. In a 5-fold cross-validated training, the training set is split up into 5 sets. In one training the sets 1,2,3 and 4 are used to train the network and the 5th set to stop the training, in the another training the sets 1,3,4,5 are used for training and the 2nd set to stop the training, and so forth.

Go back to the EasyPred interface and set the hidden neuron parameter back to 2. Next, set the number of partitions for cross-validated training to 5 and redo the neural network training (this might take some minutes).

Write down the test performance for each of the five networks

  • Q15: How does the train/test performance differ between the different partitions?
  • Q16: What is the evaluation performance and how does it compare to the performance you found previously?

Now you must save the parameters for the cross-validated network you have just trained. Use the right mouse-bottom on the Parameters for prediction method to save the neural network parameters to a file (say para.dat). You can now run predictions using this neural network without redoing network training by uploading the parameter file in the Load saved prediction method window.


Finding epitopes in real proteins

You shall use the neural network to find potential epitopes in the Sars virus. In the EasyPred web-interface clear field to reset all parameter fields. Go to the Uniprot home-page UNI-PROT. Search for a Sars entry by typing Sars virus in the search window. Click you way to the FASTA format for one of the proteins (select your protein of interest, and click retrieve in the bottom of the page, next open as FASTA). Paste the FASTA file into the Paste in evaluation examples. Upload the network parameter file (para.dat) from before into the Load saved prediction method window. Leave the window Networks to chose in ensemble blank, make sure that the option for sorting the output is set to Sort output on predicted values, and press Submit query.

  • Q17: How many high binding epitopes do you find (affinity stronger than 200 nM)? Is this number reasonable (how large a fraction of random 9meric peptides are expected to bind to a given HLA complex?)

Now you are done!!