Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

hoWWWlite - Artificial neural network simulator for symbol sequences




Cookbook, exercise 1

Datadriven neural network prediction

The student is advised to keep two browser windows open throughout the exercise. The cookbook i.e. this page should be loaded into one while the neural network training should take place in the other.

The problem

The exercise in datadriven neural networks concerns prediction of protein secondary structure from the linear sequence of amino acids. The exercise is carried out using a simple neural network simulator - howlite - and ready made sequence data constructed by conversion of Brookhaven coordinate data into secondary structure assignments by the Kabsch and Sander DSSP program.

Take a look at the sequence data linking the linear sequence of amino acids and the secondary structure assignments. The file contains 134 protein sequences of varying length with helix/non-helix assignments for each amino acid.

In the exercise the aim is to produce a two category helix/non-helix network, i.e., a neural network that predicts whether a given amino acid belongs to a helix or not. Helix is assigned by H in the data file, all other symbols are treated as non-helix. Initially, we will use the first 20 sequences for training and the last 10 sequences for testing.

The method

Make yourself familiar with the WWW based neural network simulator:

  • 1. Take a brief look at the instructions , especially the list of run-time parameters.

    When you have checked the different run-time parameters, make a trial training run by clicking on the 'Run' button at the bottom of the main page. Do not modify any parameters this time; just use the default settings. The program will run for 10 epochs, and produce a trained neural network. The output of the program will appear in the browser window. It will show a report of the training process; at the end of the training run the best training and test set performances will be reported.

  • 2. In the main page, change the value of the ISSEED parameter. This parameter will change the initialization of the random number generator and make a distinct network. Choose a large uneven integer.

    Change also the value of the LSTOP parameter to 25 to make the network run for 25 epochs.

    Run the program again. When howlite has finished running click on the 'PLOT' button at the bottom of the resulting page. This will produce learning curves: four plots of performance measures will be displayed. The measures are:

    a. Percentage correct helix/non-helix
    b. Average error
    c. Category percent correct
    d. Correlation coefficient

    If you like, you can print the page containing the plots directly from your browser to the local printer.

Optimize the training

The idea in this exercise is that all groups independently experiment with a number of parameters trying to create as good a network as possible i.e. the network having the highest correlation coefficient on the test data.

Evaluate the performance variation when changing the following parameters. Try to find the combination giving the best performance:

  1. sequences for training (parameter LSKIP, please note that the last 10 sequences in the file are used as test set in this exercise, and it is important that you don't train on these sequences ),

  2. window size (parameter NWSIZE, try for example windows of 5, 9, 13, 17 amino acids),

  3. number of training epochs (parameter LSTOP, typically 5 - 30).

When you get the best results save the run number displayed at the top of the output page (typical value: 8421541) alongside the window size (parameter NWSIZE, typical value: 13) and the number of hidden neurons (parameter N2HID, typical value: 10) used in that run. You will need them to identify and use the network you have just trained.

Report the performance of your best network

In this part of the exercise you will not train a new network. Instead, you will run your best network, trained previously, and generate detailed output needed for comparison with networks made by the other groups.

  1. Enter the run number that you saved before as value of "Network to use";

  2. Change NWSIZE and N2HID to the corresponding values;

  3. Change the value for the number of training parameters LEARNC into 0;

  4. Change the value of IVIRGN into 'no';

  5. Change the value of IACTIV into 'yes'.

This will put the network in "test mode". The howlite program will now use the trained network parameters for predicting on the test sequences. Output will be produced for the test sequences only, and for each amino acid in each test sequence the actual network output from the two output units (helix and non-helix) will be shown.

  1. Run the network in this mode

    Examine the output: see how the binary helix/coil decision is made from the winner-takes-all interpretation of the output values.

    Output example:

    93 F H H  0.560 0.444
    The format is: position in sequence, amino acid, category assignment in test file, category assignment by network, output value from 'Helix' neuron, output value from 'non-Helix' neuron. In this case the output from the 'Helix' neuron is larger than the output from the 'non-Helix' neuron, and the amino acid is therefore predicted to be in a helix ("winner-takes-all").

  2. Click on the 'REPORT' button at the bottom of the output page.

    Your results will now contribute to the emerging ensemble of networks. Its performance will be discussed at the final session of the exercise (see below).

  3. If time allows you may continue to play around with the run-time parameters, and make additional runs.

Analyze the results

Gather in plenum in the lecture room (same as in the beginning of the exercise). The performance of the network ensemble based on the reported networks will be evaluated.