Datadriven Neural Network Prediction

DATADRIVEN NEURAL NETWORK PREDICTION



The exercise in datadriven neural networks concerns the prediction of protein secondary structure from the linear sequence of amino acids. The exercise is carried out using a simple neural network simulator - howlite - and ready made sequence data constructed by conversion of Brookhaven coordinate data into secondary structure assignments by the Kabsch and Sander DSSP program.

  1. Take a look at the sequence data linking the linear sequence of amino acids and the secondary structure assignments.

    aa2dssp.seq

    Alternatively you can use, for instance, the jot editor to see the sequences in the file:

    nedit /usr/opt/www/pub/CBS/phdcourse/cookbooks/neuralnets/aa2dssp.seq

    In the exercise the aim is to produce a two category helix/non-helix network, i.e., a neural network that predicts whether a given amino acid belongs to a helix or not. Helix is assigned by H in the data file, all other symbols are treated as non-helix. We will use the first 20 sequences for training and the next 10 sequences for testing.

  2. Make yourself familiar with the neural network simulator and especially the run-time parameter file how.dat. Type

    cd neuralnets

    followed by

    less how.dat

    Read briefly the man page for the howlite program by typing:

    man howlite

    When you have checked the different run-time parameters, make a trial training run by typing:

    howlite < how.dat

    The program will run for 10 epochs, and produce a trained neural network. (The weights and thresholds making up the trained network are saved in a file named synapse.syn). At the end of the training run the best training and test set performances will be reported.

  3. Start an editing session of the how.dat file:

    nedit how.dat &

    Change the value of the ISSEED parameter. This parameter will change the initialization of the random number generator and make a distinct network. Choose a large uneven integer. Change also the value of the LSTOP parameter to 25 to make the network run for 25 epochs.

  4. Produce a learning curve by redirecting the output from a training run to a file, for example:

    howlite < how.dat > how.run

    When howlite has finished running, view the temporal evolution of the neural network performance by using the howplot script:

    howplot how.run

    The script will show four plots of performance measures in a separate window when you press RETURN four times in the command window. The measures are:

    a. Percentage correct helix/non-helix

    b. Average error

    c. Category percent correct

    d. Correlation coefficient


    If you like, you can print the plots by performing the following two steps

    howplot keep=how.run how.run (pressing RETURN four times like before)

    gnu2ps how.run.gnu | lp

  5. Evaluate the performance variation when changing the window size parameter NWSIZE in the how.dat file. Try for example windows of 5, 9, 13, 17 amino acids, and collect the maximal test set performance for each network architecture.

  6. Make your contribution to a network ensemble by training a network on data selected by your group.

    The sequence file contains a total of 134 sequences. The idea in this exercise is that all groups independently select 20 of these sequences for the purpose of training a neural net. This selection is done by changing the LSKIP parameter. (Please note that the last 10 sequences in the file are used as test set in this exercise, and it is important that you don't train on these sequences. This can be avoided by choosing a value of LSKIP that is between 0 and 103). When you have selected a value, train a network with NWSIZE 13, and N2HID 10 for 25 epochs (LSTOP=25).

    howlite < how.dat

    When the network has been trained, change the value for the number of training parameters LEARNC into 0, the value of IVIRGN into -1, and the value of IACTIV into 1. This will put the network in "test mode". The howlite program will now use the trained network parameters (that have just been saved in the synapse.syn file) for predicting on the test sequences. Output will be produced for the test sequences only, and for each amino acid in each test sequence the actual network output from the two output units will be shown. Run the network in this mode and dump the output into a file:

    howlite < how.dat > how.ensemble

    Be sure NOT to delete the how.ensemble file, as it will be used later for the ensemble. Take a look at the single window output activities in the how.ensemble file, and see how the binary helix/coil decision is made from the winner-takes-all interpretation of the output values.

    Output example:

          93 F H H  0.560 0.444
    

    The format is: position in sequence, amino acid, category assignment in test file, category assignment by network, output value from 'Helix' neuron, output value from 'non-Helix' neuron. In this case the output from the 'Helix' neuron is larger than the output from the 'non-Helix' neuron, and the amino acid is therefore predicted to be in a helix ("winner-takes-all").

    If time allows you may continue to play around with the run-time parameters, and make additional runs, BUT remember to send the output to the screen, or to a file with a name different from how.ensemble.

  7. Gather in plenum where the performance from the different groups should be reported. The performance of the network ensemble based on the 15 versions of the how.ensemble files should be evaluated.