|
Training of neural networks
Morten Nielsen (mniel@cbs.dtu.dk)
Overview
During this exercise you will use the EasyPred web server to train and evaluate an
artificial neural network method for prediction of peptide MHC binding
Background: Peptide MHC binding
The most selective step in identifying potential peptide immunogens is
the binding of the peptide to the MHC complex. Only one in about
200 peptides will bind to a given MHC complex. A very large number of
different MHC alleles exist each with a highly selective peptide
binding specificity.
The binding motif for a given MHC class I complex is in
most cases 9 amino acids long. The motif is characterized
by a strong amino acid preference at specific positions
in the motif. These position are called anchor positions. For
many MHC complexes the anchor position are placed at
P2 and P9 in the motif. However this is not always the case.
Large number of peptide data exist describing this MHC
specificity variation. One important source of data is the
SYFPEITHI MHC database (http://www.syfpeithi.de).
This database contains information on MHC ligands and binding motifs.
Purpose of exercise, description of data
In this exercise you are going to use the Easypred web-interface to train bioinformatics predictors for
MHC-peptide binding. First you shall make a little toy example to show how hidden neurons can allow
the artificial neural network to learn the XOR function, and next you shall (once more -)) use peptide/MHC
binding data to train a artificila neural network method to do peptide/MHC binding predictions
The XOR function
As stated in the lecture today, the XOR function can not be learned by a liniar method like for instance the SMM
method used in the MINI-project. To capture the higher order correlations in the XOR function, you must use a
higher order method like articial neural networks.
You shall now see that this is indeed the case. Go to the
EasyPred web-server.
Make an XOR example in the training example window. Use only two amino acids to make the example, i.e something
like;
AA 1
AC 0
..
..
Repeat the example twice, so that you have 8 training examples in total. Likewise, fill in the XOR examples in the
evaluation window.
Select Neural network method. Set number of hidden neurons to 1, Number of iterations (epochs) to 60000, and
Fraction of data to train on to 0.5. Next, press
Submit query
Could the network learn the XOR function? Make sure you understand the output produced by the EasyPred method, in particular make
sure you understand where the predictive performance is reported.
The training is perfomed using the top 50% of the training examples, and the buttom 50% are used ias test
set to stop the training to avoid overfitting. The test performance is reported as
Maximal test set correlation coefficent sum = 0.577400 in epoch 241
Maximal test set pearson correlation coefficent sum = 0.607500 in epoch 212
minimal per example squared error = 0.167800 in epoch 59705
where Maximal test set correlation coefficent is the Matthews correlation. The performance on the evaluation
data is reported as
Pearson coefficient for N= 8 data: -0.00160
Aroc value: 0.50000
where Aroc is the area under the ROC curve. Both Pearsons correlation and Aroc are 1 for a perfect prediction.
Aroc is 0.5 and Pearsons correlation 0.0 for a random prediction.
Both test set and evaluation performance values are close to random, and the
network could NOT learn the XOR function
Now go back to the EasyPred website, and change the number of hidden neurons to 2. Leave the other options as they were in the
previous run. Next, press Submit query
Could the neural network with hidden neurons learn the XOR function?
Now the network can learn the XOR function.
MHC/peptide binding predictions
You shall use the EasyPred web-interface
to train and evaluate a series of different MHC-peptide binding
predictors. You shall use two data sets (eval.set, train.set)
that contain peptides and binding affinity to the MHC alleles HLA-A*0201.
The binding affinity is a number between 0 and 1, where a high value
indicates strong binding (a value of 0.5 corresponds to a
binding affinity of approximately 200 nM).
The eval.set
contains 66, and the train.set 1200 such peptides.
Click on the filenames to view the content of the files.
Before you start using the EasyPred you must save the
train.set and eval.set files locally on the Desktop on your lab-top. You
do that by clicking on the files names (eval.set,
train.set) and saving the files as text files
on the Desktop.
You shall now use EasyPred web-server to
train a series of methods to predict peptide-MHC binding. Go to the
EasyPred web-server.
Neural networks
Training set partition
You shall now train some neural networks
to predict MHC-peptide binding.
Go to the
EasyPred web-server.
In the Type of prediction method window select neural networks.
In the upload training examples window browse and select the train.set file from the Desktop, in the upload evaluation
window browse and select the eval.set file from the Desktop.
In the window Fraction of data to train on (the rest is used to avoid overtraining) type 0.99.
Leave all other parameters as they are.
This will train a neural network with 2 hidden neurons using
running up-to 300 training epochs.
The top 99% (1188 peptides) of the train.set is used to train
the neural network and the bottom 1% (12 peptides) are used to
stop the training to avoid over-fitting.
Press Submit query.
- Q1: What is the maximal test performance (maximal test set Pearson correlation),
and in what epoch (number of training cycles) does it occur?
- A1: Maximal test set pearson correlation coefficent sum = 0.932500 in epoch 29
- Q2: What is evaluation performance (Pearson correlation and Aroc values)?
- A2: Pearson coefficient for N= 66 data: 0.46948. Aroc value: 0.75163.
Now go back to the submission site and change the
Fraction of data to train on (the rest is used to avoid overtraining) to 80%.
This will train a neural network running up-to 300 training epochs.
The top 80% (960 peptides) of the train.set is used to train
the neural network and the bottom 20% (240 peptides) are used to
stop the training to avoid over-fitting.
- Q3: What is the maximal test performance (maximal test set Pearson correlation),
and in what epoch does it occur?
- A3: Maximal test set pearson correlation coefficent sum = 0.801300 in epoch 103
- Q4: What is evaluation performance (Pearson correlation and Aroc values)?
- A4: Pearson coefficient for N= 66 data: 0.58693. Aroc value: 0.85490
- Q5: Does this network perform better or worse than the one from before?
- A5: The network has a higher performance on the evaluation set, so the network performs better
Go back to the EasyPred interface and change the parameters so
that you use the bottom 80% of the train.set to train the neural
network and the top 20% to stop the training. Redo the network training
with the new parameters.
- Q6: What is the maximal test performance, and in
what epoch does it occur?
- A6: Maximal test set pearson correlation coefficent sum = 0.837800 in epoch 90.
- Q7: What is evaluation performance?
- A7: Pearson coefficient for N= 66 data: 0.55571. Aroc value: 0.78170.
- Q8: How does the performance differ from what you found in the previous
training?
- A8: The evaluation performance is much lower, and the test performance is higher for the second network
- Q9: Why do you think the performance differ so much?
- A9: When the network is stopped on the top 20% of the data, the network will have an inherent bias
towards peptides similar to the once in the top 20%. If these peptides are either very similar to the
peptides in the evaluation set this network will perform better. Also if the diversity of the peptides in the
bottom 20% of the data is very low, then this set of peptides will be easy to learn, but the network will not learn
to be able to generalize to other peptides. This was what was observed in question Q1 and Q2.
Hidden Neurons
Go back to the EasyPred interface and change the parameters back
so that you use the top 80% of the train.set of training. Next
do neural network training with a different set of hidden neurons
(1 and 5 for instance).
- Q10: How does the test performance differ when
you vary the number of hidden neurons?
- A10: NH1: Maximal test set pearson correlation coefficent sum = 0.838600 in epoch 85
- A10: NH5: Maximal test set pearson correlation coefficent sum = 0.839700 in epoch 111
- Q11: How does the evaluation performance differ?
- A11: NH1: Pearson coefficient for N= 66 data: 0.55949. Aroc value: 0.80261
- A11: NH2: Pearson coefficient for N= 66 data: 0.55402. Aroc value: 0.78301
- Q12: Can you decide on an optimal number of hidden neuron?
- A12: The number of hidden neuron does not seem to influence the predictive performance much
- Q13: Why do you think the number of hidden neurons has
so little importance?
- A13: To learn higher order correlations the network must estimate paired amino acids correlations.
These are defined in terms of 400 amino acid pair frequencies estimated from the set of peptide binders.
The data available contains 1200 peptides, but only 136 are binders (affinity value > 0.5). It is hence unlikely
that the network can accurately pick up these 400 amino acid pair frequencies for this limited set of binding peptides.
Cross-validated training
As you found in the first part of neural network training,
the network performance can depend strongly on the partition
of the training data into the training and stop set. One way
of improving the network performance is to make use
of this network variation in a cross-validated training.
The general idea behind the cross-validated training is that
since you cannot in advance tell which training set partition
that will be optimal you make a series of N network training
each with a different partition. The final network prediction is
then taken as the simple average over the N predictions. In
a 5-fold cross-validated training, the training set is split up
into 5 sets. In one training the sets 1,2,3 and 4 are used to train
the network and the 5th set to stop the training, in the another
training the sets 1,3,4,5 are used for training and the 2nd set
to stop the training, and so forth.
Go back to the EasyPred interface and set the hidden neuron parameter back to 2.
Next set the number of partitions
for cross-validated training to 5 and redo the neural network training (this
might take some minutes).
Write down the test performance for each of the five networks
- Q14: How does the train/test performance differ between the different
partitions?
- Q15: What is the evaluation performance and how does it compare to
the performance you found previously?
- A15: Pearson coefficient for N= 66 data: 0.61468. Aroc value: 0.82745. The Pearsons correlation
is the best obtained so fare, and the Aroc is a bit lower than what we found in question Q3 and Q4.
Now you must save the parameters for the cross-validated
network you have just trained. Use the right mouse-bottom on the
Parameters for prediction method to save the neural network
parameters to a file (say para.dat). You can now run predictions using
this neural network without redoing network training by uploading the
parameter file in the Load saved prediction method window.
Finding epitopes in real proteins
You shall use the neural network
to find potential epitopes in the Sars virus. In the EasyPred
web-interface clear field to reset all
parameter fields. Go to the Swiss-prot
homepage Swiss-prot. Search
for a Sars entry by typing Sars in the search window. Click
you way to the FASTA format for one of the proteins.
Here
is a link if you are lazy. Paste in FASTA file into the Paste in
evaluation examples. Upload the network parameter file (para.dat) from before
into the Load saved prediction method window.
Leave the window Networks to chose in ensemble
blank, make sure that the option for sorting the output
is set to Sort output on predicted values, and press Submit query.
- Q16: How many high binding epitopes do you find?
Is this number reasonable (how large a fraction of random 9meric peptides are expected to bind to a given HLA complex?)
- A16: Depending on the protein sequence selected you will find in between 2-5 peptides with a prediction score greater than 0.5. This corresponds to a fraction of 0.005 - 0.02, and is thus very reasonable since we expect around
1-2% of random peptides to bind to a given MHC molecule.
Now you have within less than 1 hours developed advanced and competitive methods for predicting
binding of peptides to HLA class I. Also you have identified potential CTL epitope vaccine
candidates for the SARS virus. All you need now is to find some venture capital and make your own
Biotec startup company. That is not so bad an outcome of 1 hour work!
Now you are done!!
|