The student is advised to keep two browser windows open throughout the
exercise. The cookbook
i.e. this page should be loaded into one
while the neural network training should take place in the other.
This exercise will be like this: first, you will play around with the neural network simulator
to create your best network which will be able to predict protein secondary structure from amino acid sequences, hopefully.
Then, you will check the performance of your network in a test model. Finally, we will together compare the results
from all groups to conclude how we try to make the best network.
Datadriven neural network prediction
The problem
The exercise in data-driven neural networks concerns prediction of protein
secondary structure from the linear sequence of amino acids. The
exercise will be carried out using a simple neural network simulator - howlite
- and ready made sequence data constructed by conversion of Brookhaven
coordinate data into secondary structure assignments by the Kabsch and
Sander DSSP program.
Take a look at the
sequence data
linking the linear sequence of amino acids and the secondary structure
assignments. The file contains 647 protein sequences of varying length
with secondary structure assignments for each amino acid.
The aim of the exercise is to produce a two category helix/non-helix
network, i.e., a neural network that predicts whether a given
amino acid belongs to a helix or not. Helix is assigned by H in the data
file, all other symbols are treated as non-helix. Initially, we will use the first
100 sequences for training and the last 50 sequences for testing.
The method
(
NOTE, Mac users: this currently does not run on Safari, but works well on Firefox)
Make yourself familiar with the WWW based
neural network simulator:
- 1. Take a brief look at the
instructions
, especially the list of run-time parameters. However, it is not necessary or required to
understand every parameter during the exercise.
When you have checked the different run-time parameters, make a trial
training run by clicking on the 'Run' button at the bottom of the main
page. Do not modify any parameters this time; just use the default
settings. The program will run for 10 epochs, and produce a trained
neural network. The output of the program will appear in the browser
window. It will show a report of the training process; at the end of the
training run the best training and test set performances will be
reported.
Q1: Let us take a look on the architecture of the neural networks you are using. What is the window size? how many hidden units are there? How many sequences for training and testing?
Q2: Since the computational units take numerical inputs, amino acid sequences must be encoded as numbers. How many input units are using on this neural network?
Q3: Check the performance of networks. What are the best correlation coefficient values on the training set and on the test set? In which epoch?
Q4: Which parameters do you think should be changed in order to improve the performance and why?
- 2. In the main page, change the value of the LSTOP parameter to 25 to make
the network run for 25 epochs.
Run the program again.
Q5: Check the performance of networks. What are the best correlation coefficient values on the training set and on the test set? In which epochs?
Q6: Explain why the performance changed?
When howlite has finished running click on
the 'PLOT' button at the bottom of the resulting page. This will produce
learning curves: four plots of performance measures will be displayed.
The measures are:
a. Percentage correct helix/non-helix
b. Average error
c. Category percent correct
d. Correlation coefficient
Q7: Look at the correlation plot. Does it seem like the learning curve has reached a max at the epoch 25?
Optimize the training
The idea in this section is that all groups independently experiment
with a number of parameters trying to create as good a network as
possible
i.e. the network having the highest correlation
coefficient on the test data.
Evaluate the performance variation when changing the following
parameters. Try to find the combination giving the best performance:
- number of sequences for training (parameter LEARNC, 100-590),
- specific set of training sequences (parameter LSKIP determines where the training set starts and stops, please note that the last
50 sequences in the file are used as test set in this exercise,
and
it is important that you don't train on these sequences
),
- window size (parameter NWSIZE, try for example windows
of 5, 9, 13, 17 amino acids),
- number of hidden units (parameter N2HID, try 10 - 30).
- random initial values for the weights before training (parameter ISSEED, try large uneven integer).
- number of training epochs (parameter LSTOP, typically 10 - 50).
(It is time-consuming using high values on parameters. If you set N2HID to 50, LEARNC 500, LSTOP 50, perhaps
you may need to wait for 15 mins to finish.)
Note:
After trying different values for those parameters, you need to decide the best network with
the best performance.
When you get the best one, you need to write down/remember the run number displayed at the top of
the output page (the value looks like: 8421541) as well as the window size
and the number of hidden neurons used in that run.
You will need them to identify and use the network you have just trained.
Report the performance of your best network
In this part of the exercise you will not train a new network. Instead,
you will run your best network, which you trained previously, and generate detailed
output needed for comparison with networks made by other groups.
- Enter the run number that you saved before as the value of "Network to use";
- Change NWSIZE and N2HID to the corresponding values;
- Change the value for the number of training samples, LEARNC, into 0;
- Change the value of IVIRGN into 'no';
- Change the value of IACTIV into 'yes'.
- Make sure that ITSKIP is set to 597 and TESTC to 50.
This will put the network in "test mode". The howlite program
will use the trained network to predict on the test
sequences. Output will be produced for the test sequences only; and the actual network output from the two output
units (helix or non-helix) will be shown for each amino acid in each test sequence.
- Run the network in this mode
Examine the output: see how the binary helix/coil decision is made from the
winner-takes-all interpretation of the output values.
Output example:
93 F H H 0.560 0.444
The format is: position in sequence, amino acid, category assignment
in test file, category assignment by network, output value from 'Helix'
neuron, output value from 'non-Helix' neuron. In this case the output
from the 'Helix' neuron is larger than the output from the 'non-Helix'
neuron, and the amino acid is therefore predicted to be in a helix
("winner-takes-all").
Q8: What is the output for the residue 18 in the first test sequence?.
-
Click on the 'REPORT' button at the bottom of the output page.
Your results
will now contribute to the emerging ensemble of networks. Its performance
will be discussed at the final session of the exercise (see below).
-
If time allows you may continue to play around with the run-time
parameters, and make additional runs.
Analyze the results
Q9: You have now tried different parameters on the neural networks
(mainly: window size, hidden units, training sequences, and training epochs),
and you have created your best network. Please make sure you have done the following:
wrote down the run number (the ID of networks) displayed at the top of the
output page; and reported your ONE best performance
by clicking on the 'REPORT' button at the bottom of the output page. Then,
Kristoffer will help us to collect the performance from each group. Finally we
will evaluate the reported networks to find out which best neural
network is the real BEST?