Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Training of neural networks

Morten Nielsen (mniel@cbs.dtu.dk)


Overview

During this exercise you will use the EasyPred web server to train and evaluate an artificial neural network method for prediction of peptide MHC binding


Background: Peptide MHC binding

The most selective step in identifying potential peptide immunogens is the binding of the peptide to the MHC complex. Only one in about 200 peptides will bind to a given MHC complex. A very large number of different MHC alleles exist each with a highly selective peptide binding specificity.


Purpose of exercise, description of data

In this exercise you are going to use the Easypred web-interface to train bioinformatics predictors for MHC-peptide binding. First you shall make a little toy example to show how hidden neurons can allow the artificial neural network to learn the XOR function, and next you shall (once more -)) use peptide/MHC binding data to train a artificial neural network method to do peptide/MHC binding predictions


The XOR function

As stated in the lecture today, the XOR function can not be learned by a liniar method like for instance the SMM method used yesterday. To capture the higher order correlations in the XOR function, you must use a higher order method like articial neural networks.

You shall now see that this is indeed the case. Go to the EasyPred web-server.

Make an XOR example in the training example window. Use only two amino acids to make the example, i.e something like (S for small side-chain, L for large side-chain);

SS 0
SL 1
LS 1
LL 0
..
..

Repeat the example twice, so that you have 8 training examples in total. Likewise, fill in the XOR examples in the evaluation window.

Select Neural network method. Set number of hidden neurons to 1, Number of iterations (epochs) to 60000, and Fraction of data to train on to 0.5. Next, press Submit query

Note, be patient. It might take a few minutes for the calculation to complete.

Could the network learn the XOR function? Make sure you understand the output produced by the EasyPred method, in particular make sure you understand where the predictive performance is reported.

Now go back to the EasyPred website, and change the number of hidden neurons to 2. Leave the other options as they were in the previous run. Next, press Submit query

Note, again be patient. It might take a few minutes for the calculation to complete.

Could the neural network with hidden neurons learn the XOR function?


MHC/peptide binding predictions

You shall now use the EasyPred web-interface to train and evaluate a series of different MHC-peptide binding predictors. You shall use two data sets (eval.set, train.set) that contain peptides and binding affinity to the MHC alleles HLA-A*0201. The binding affinity is a number between 0 and 1, where a high value indicates strong binding (a value of 0.42 corresponds to a binding affinity of approximately 500 nM, which is the value needed to be presented on the cell surface, generally speaking). The eval.set contains 66, and the train.set 1200 such peptides. Click on the filenames to view the content of the files.

Before you start using the EasyPred you might save the train.set and eval.set files locally on the Desktop on your lab-top. You do that by clicking on the files names (eval.set, train.set) and saving the files as text files on the Desktop. This will make upload of the files during the exercise easier.

Neural networks

Training set partition

You shall now train some neural networks to predict MHC-peptide binding.

Go to the EasyPred web-server. Press Clean fields. In the upload training examples window browse and select the train.set file from the Desktop, in the upload evaluation window browse and select the eval.set file from the Desktop. Select neural networks.

In the window Fraction of data to train on (the rest is used to avoid overtraining) type 0.99. Leave all other parameters as they are. This will train a neural network with 2 hidden neurons using running up-to 300 training epochs. The top 99% (1188 peptides) of the train.set is used to train the neural network and the bottom 1% (12 peptides) are used to stop the training to avoid over-fitting. Press Submit query.

  • Q1: What is the maximal test performance (maximal test set Pearson correlation), and in what epoch (number of training cycles) does it occur?
  • Q2: What is evaluation performance (Pearson correlation and Aroc values)?

Now go back to the submission site and change the Fraction of data to train on (the rest is used to avoid overtraining) to 80%. This will train a neural network running up-to 300 training epochs. The top 80% (960 peptides) of the train.set is used to train the neural network and the bottom 20% (240 peptides) are used to stop the training to avoid over-fitting.

  • Q3: What is the maximal test performance (maximal test set Pearson correlation), and in what epoch does it occur?
  • Q4: What is evaluation performance (Pearson correlation and Aroc values)?
  • Q5A: Does this network perform better og worse than the one from before?
  • Q5B: And why do you think the evaluation performance is so low in the first case (Q2)?

Go back to the EasyPred interface and change the parameters so that you use the bottom 80% of the train.set to train the neural network and the top 20% to stop the training. Redo the network training with the new parameters.

  • Q6: What is the maximal test performance, and in what epoch does it occur?
  • Q7: What is evaluation performance?
  • Q8: How does the performance differ from what you found in the previous training?
  • Q9: Why do you think the performance differ so much?

Cross-validated training

As you found in the first part of neural network training, the network performance can depend strongly on the partition of the training data into the training and stop set. One way of improving the network performance is to make use of this network variation in a cross-validated training. The general idea behind the cross-validated training is that since you cannot in advance tell which training set partition that will be optimal you make a series of N network trainings each with a different partition. The final network prediction is then taken as the simple average over the N predictions. In a 5-fold cross-validated training, the training set is split up into 5 sets. In one training the sets 1,2,3 and 4 are used to train the network and the 5th set to stop the training, in the another training the sets 1,3,4,5 are used for training and the 2nd set to stop the training, and so forth.

Go back to the EasyPred interface and set the hidden neuron parameter back to 2. Next set the number of partitions for cross-validated training to 5 and redo the neural network training (this might take some minutes).

Write down the test performance for each of the five networks

  • Q14: How does the train/test performance differ between the different partitions?
  • Q15: What is the evaluation performance and how does it compare to the performance you found previously?

Now you must save the parameters for the cross-validated network you have just trained. Use the right mouse-bottom on the Parameters for prediction method to save the neural network parameters to a file (say para.dat). You can now run predictions using this neural network without redoing network training by uploading the parameter file in the Load saved prediction method window.


Finding epitopes in real proteins

You shall use the neural network to find potential epitopes in the Sars virus. In the EasyPred web-interface clear field to reset all parameter fields. Go to the Uni-prot homepage Uni-prot. Search for a Swine Influenza protein entry by typing "Swine Influenza" in the search window. Click you way to the FASTA format for one of the proteins. Here is a link if you are lazy. Paste in FASTA file into the Paste in evaluation examples. Upload the network parameter file (para.dat) from before into the Load saved prediction method window. Leave the window Networks to chose in ensemble blank, make sure that the option for sorting the output is set to Sort output on predicted values, and press Submit query.

  • Q16: How many high binding epitopes do you find (affinity stronger than 500 nM)? Is this number reasonable (how large a fraction of random 9meric peptides are expected to bind to a given HLA complex?)

Now you have within less than 1 hours developed advanced and competitive methods for predicting binding of peptides to HLA class I. Also you have identified potential CTL epitope vaccine candidates for the Swine Flu virus. All you need now is to find some venture capital and make your own Biotec startup company.

Now you are done!!