The exercise in datadriven neural networks concerns the prediction of protein secondary structure from the linear sequence of amino acids. The exercise is carried out using a simple neural network simulator - howlite - and ready made sequence data constructed by conversion of Brookhaven coordinate data into secondary structure assignments by the Kabsch and Sander DSSP program.
Alternatively you can use, for instance, the jot editor to see the sequences in the file:
nedit /usr/opt/www/pub/CBS/phdcourse/cookbooks/neuralnets/aa2dssp.seq
In the exercise the aim is to produce a two category helix/non-helix network, i.e., a neural network that predicts whether a given amino acid belongs to a helix or not. Helix is assigned by H in the data file, all other symbols are treated as non-helix. We will use the first 20 sequences for training and the next 10 sequences for testing.
cd neuralnets
followed by
less how.dat
Read briefly the man page for the howlite program by typing:
man howlite
When you have checked the different run-time parameters, make a trial training run by typing:
howlite < how.dat
The program will run for 10 epochs, and produce a trained neural network. (The weights and thresholds making up the trained network are saved in a file named synapse.syn). At the end of the training run the best training and test set performances will be reported.
howlite < how.dat > how.run
When howlite has finished running, view the temporal evolution of the neural network performance by using the howplot script:
howplot how.run
The script will show four plots of performance measures
in a separate window when
you press RETURN four times in the command window.
The measures are:
a. Percentage correct helix/non-helix
b. Average error
c. Category percent correct
d. Correlation coefficient
If you like, you can print the plots by performing the following two steps
howplot keep=how.run how.run (pressing RETURN four times like before)
gnu2ps how.run.gnu | lp
The sequence file contains a total of 134 sequences. The idea in this exercise is that all groups independently select 20 of these sequences for the purpose of training a neural net. This selection is done by changing the LSKIP parameter. (Please note that the last 10 sequences in the file are used as test set in this exercise, and it is important that you don't train on these sequences. This can be avoided by choosing a value of LSKIP that is between 0 and 103). When you have selected a value, train a network with NWSIZE 13, and N2HID 10 for 25 epochs (LSTOP=25).
howlite < how.dat
When the network has been trained, change the value for the number of training parameters LEARNC into 0, the value of IVIRGN into -1, and the value of IACTIV into 1. This will put the network in "test mode". The howlite program will now use the trained network parameters (that have just been saved in the synapse.syn file) for predicting on the test sequences. Output will be produced for the test sequences only, and for each amino acid in each test sequence the actual network output from the two output units will be shown. Run the network in this mode and dump the output into a file:
howlite < how.dat > how.ensemble
Be sure NOT to delete the how.ensemble file, as it will be
used later for the ensemble.
Take a look at the single window output activities in the
how.ensemble
file, and see how the binary helix/coil decision is made from
the winner-takes-all interpretation of the output values.
Output example:
93 F H H 0.560 0.444
The format is: position in sequence, amino acid, category assignment in test file, category assignment by network, output value from 'Helix' neuron, output value from 'non-Helix' neuron. In this case the output from the 'Helix' neuron is larger than the output from the 'non-Helix' neuron, and the amino acid is therefore predicted to be in a helix ("winner-takes-all").
If time allows you may continue to play around with the run-time parameters, and make additional runs, BUT remember to send the output to the screen, or to a file with a name different from how.ensemble.