|
Øvelse: Data redundancy
Øvelse skrevet af: Claus Lundegaard
Introduction
Data redundancy is an important issue in training, validation, and comparison of predictive methods.
Here, we will try to train and evaluate artificial neural networks using a web interface, EasyPred and different data sets with varying degrees of redundancy.
All data used in this exercise are peptide MHC binding affinity data. The binding values have been
transformed to fall in the range 0-1. A high value
indicates strong binding (a value of 0.5 corresponds to a
binding affinity of approximately 200 nM). The values are
calculated from the actual nM binding affinities using the
relation x = 1 - log(aff nM)/log(50000). A peptide that binds with
an affinity stronger than 500 nM is said to be an intermediate binder, and
a peptide that binds stronger than 50 nM is a high binder. Note that low
affinity means strong binding. As a rule of thumb a peptide must be
at least an intermediate binder in order to induce an immune-response.
Using the above transformation an intermediate binder (500 nM) will have
a value og 0.426, and a strong binding a value of 0.638, respectively. Again
note that a high transformed value corresponds to a low affinity value, and hence
to a strong binder.
Links:
Training session 1
First download the first training data set to your desktop: Right click at train1 and save as train1 on your desktop.
Go to the
EsyPred web-site.
Under upload training examples push choose file and select the train1 file from your desktop.
Under Type of prediction method select Neural network.
Leave all other options as is and push Submit query.
Wait until the final result page appears (1-2 minutes).
- Q1: How many training data is in the training set?
- Q2: How many of the examples have actually been used in the training?
- Q3: What part of the training set has been used for testing?
- Q4: What epoque (training cycle) is obtaining the best linear correlation and what is the value?
- Q5: What happens with the training at this point?
Training session 2
Now download the second training data set to your desktop: Right click at train2 and save as train2 on your desktop.
Go to the
EsyPred web-site.
Under upload training examples push choose file and select the train2 file from your desktop.
Under Type of prediction method select Neural network.
Leave all other options as is and push Submit query.
Wait until the final result page appears (1-2 minutes).
- Q6: How many training data is in this training set?
- Q7: How many data has been used for testing?
- Q8: What epoque is obtaining the best linear correlation and what is the value?
- Q9: Is this different from the previous training?
Both training sets contain only unique peptides.
- Q10: What might the reason be of the difference between the two training sessions?
Training session 3
We now try to evaluate the performance on a set were the data have not been used in training or test.
Download the evaluation data set eval1 to your desktop: Right click at eval1 and save as eval1 on your desktop.
We will use the dataset from Session 1 (train1).
Go to the
EsyPred web-site.
Under upload training examples push choose file and select the train1 file from your desktop.
Under upload evaluation examples push choose file and select the eval1 file from your desktop.
Under Type of prediction method select Neural network.
Leave all other options as is and push Submit query.
Wait until the final result page appears (1-2 minutes).
- Q11: How many data points are in this evaluation set?
- Q12: What is the Pearson's correlation coefficient for this evaluation set?
Here we have used the same training set as in the first training session.
- Q13: What is the difference between the Pearson's correlation coefficient between the evaluation set and the test set?
- Q14: Would you expect the sequences in this evaluation set to be more or less similar to sequences used for training than the sequences in the test set?
Training session 4
We will now use the same evaluation set to evaluate the training using the other training set.
Go to the
EsyPred web-site.
Under upload training examples push choose file and select the train2 file from your desktop.
Under upload evaluation examples push choose file and select the eval1 file from your desktop.
Under Type of prediction method select Neural network.
Leave all other options as is and push Submit query.
Wait until the final result page appears (1-2 minutes).
- Q15: How many data points are in this evaluation set?
- Q16: What is the Pearson's correlation coefficient for this evaluation set?
Training session 5
We have also created another evaluation set:
Right click at eval2 and save as eval1 on your desktop.
Go to the
EsyPred web-site.
Under upload training examples push choose file and select the train2 file from your desktop.
Under upload evaluation examples push choose file and select the eval2 file from your desktop.
Under Type of prediction method select Neural network.
Leave all other options as is and push Submit query.
Wait until the final result page appears (1-2 minutes).
- Q17: How many data points are in this evaluation set?
- Q18: What is the Pearsons correlation coefficient for this evaluation set?
- Q19: What could be the reason for the difference in correlation coefficients between the two evaluation sets?
- Q20: Which set will be more similar to the training set?
Use the evaluation set from question Q20 to evaluate the training in session 1.
Compare the test set and evaluation set values.
- Q21: Would you be surprised to know that this evaluation set and test set are nearly identical?
- Q22: Why (not)?
|