|
hoWWWlite - Artificial neural network simulator
for symbol sequences
Cookbook, exercise 1
Datadriven neural network prediction
The student is advised to keep two browser windows open throughout the
exercise. The cookbook i.e. this page should be loaded into one
while the neural network training should take place in the other.
The problem
The exercise in datadriven neural networks concerns prediction of protein
secondary structure from the linear sequence of amino acids. The
exercise is carried out using a simple neural network simulator - howlite
- and ready made sequence data constructed by conversion of Brookhaven
coordinate data into secondary structure assignments by the Kabsch and
Sander DSSP program.
Take a look at the
sequence data
linking the linear sequence of amino acids and the secondary structure
assignments. The file contains 134 protein sequences of varying length
with helix/non-helix assignments for each amino acid.
In the exercise the aim is to produce a two category helix/non-helix
network, i.e., a neural network that predicts whether a given
amino acid belongs to a helix or not. Helix is assigned by H in the data
file, all other symbols are treated as non-helix. Initially, we will use the first
20 sequences for training and the last 10 sequences for testing.
The method
Make yourself familiar with the WWW based
neural network simulator:
- 1. Take a brief look at the
instructions
, especially the list of run-time parameters.
When you have checked the different run-time parameters, make a trial
training run by clicking on the 'Run' button at the bottom of the main
page. Do not modify any parameters this time; just use the default
settings. The program will run for 10 epochs, and produce a trained
neural network. The output of the program will appear in the browser
window. It will show a report of the training process; at the end of the
training run the best training and test set performances will be
reported.
- 2. In the main page, change the value of the ISSEED parameter. This
parameter will change the initialization of the random number generator
and make a distinct network. Choose a large uneven integer.
Change also the value of the LSTOP parameter to 25 to make
the network run for 25 epochs.
Run the program again. When howlite has finished running click on
the 'PLOT' button at the bottom of the resulting page. This will produce
learning curves: four plots of performance measures will be displayed.
The measures are:
a. Percentage correct helix/non-helix
b. Average error
c. Category percent correct
d. Correlation coefficient
If you like, you can print the page containing the plots directly
from your browser to the local printer.
Optimize the training
The idea in this exercise is that all groups independently experiment
with a number of parameters trying to create as good a network as
possible i.e. the network having the highest correlation
coefficient on the test data.
Evaluate the performance variation when changing the following
parameters. Try to find the combination giving the best performance:
- sequences for training (parameter LSKIP, please note that the last
10 sequences in the file are used as test set in this exercise,
and
it is important that you don't train on these sequences
),
- window size (parameter NWSIZE, try for example windows
of 5, 9, 13, 17 amino acids),
- number of training epochs (parameter LSTOP, typically 5 - 30).
When you get the best results save the run number displayed at the top of
the output page (typical value: 8421541) alongside the window size
(parameter NWSIZE, typical value: 13) and the number of hidden
neurons (parameter N2HID, typical value: 10) used in that run.
You will need them to identify and use the network you have just trained.
Report the performance of your best network
In this part of the exercise you will not train a new network. Instead,
you will run your best network, trained previously, and generate detailed
output needed for comparison with networks made by the other groups.
- Enter the run number that you saved before as value of "Network to use";
- Change NWSIZE and N2HID to the corresponding values;
- Change the value for the number of training parameters LEARNC into 0;
- Change the value of IVIRGN into 'no';
- Change the value of IACTIV into 'yes'.
This will put the network in "test mode". The howlite program
will now use the trained network parameters for predicting on the test
sequences. Output will be produced for the test sequences only, and for each
amino acid in each test sequence the actual network output from the two output
units (helix and non-helix) will be shown.
- Run the network in this mode
Examine the output: see how the binary helix/coil decision is made from the
winner-takes-all interpretation of the output values.
Output example:
93 F H H 0.560 0.444
The format is: position in sequence, amino acid, category assignment
in test file, category assignment by network, output value from 'Helix'
neuron, output value from 'non-Helix' neuron. In this case the output
from the 'Helix' neuron is larger than the output from the 'non-Helix'
neuron, and the amino acid is therefore predicted to be in a helix
("winner-takes-all").
-
Click on the 'REPORT' button at the bottom of the output page.
Your results
will now contribute to the emerging ensemble of networks. Its performance
will be discussed at the final session of the exercise (see below).
-
If time allows you may continue to play around with the run-time
parameters, and make additional runs.
Analyze the results
Gather in plenum in the lecture room (same as in the beginning
of the exercise). The performance of the network ensemble based on
the reported networks will be evaluated.
|