Exercise X - artificial neural network prediction - cookbook

The student is advised to keep two browser windows open throughout the exercise. The cookbook i.e. this page should be loaded into one while the neural network training should take place in the other.

This exercise will be like this: first, you will play around with the neural network simulator to create your best network which will be able to predict protein secondary structure from amino acid sequences, hopefully. Then, you will check the performance of your network in a test model. Finally, we will together compare the results from all groups to conclude how we try to make the best network.

Datadriven neural network prediction

The problem

The exercise in data-driven neural networks concerns prediction of protein secondary structure from the linear sequence of amino acids. The exercise will be carried out using a simple neural network simulator - howlite - and ready made sequence data constructed by conversion of Brookhaven coordinate data into secondary structure assignments by the Kabsch and Sander DSSP program.

Take a look at the sequence data linking the linear sequence of amino acids and the secondary structure assignments. The file contains 647 protein sequences of varying length with secondary structure assignments for each amino acid.

The aim of the exercise is to produce a two category helix/non-helix network, i.e., a neural network that predicts whether a given amino acid belongs to a helix or not. Helix is assigned by H in the data file, all other symbols are treated as non-helix. Initially, we will use the first 100 sequences for training and the last 50 sequences for testing.

The method

(NOTE, Mac users: this currently does not run on Safari, but works well on Firefox)

Make yourself familiar with the WWW based neural network simulator:

  • 1. Take a brief look at the instructions , especially the list of run-time parameters. However, it is not necessary or required to understand every parameter during the exercise.

    When you have checked the different run-time parameters, make a trial training run by clicking on the 'Run' button at the bottom of the main page. Do not modify any parameters this time; just use the default settings. The program will run for 10 epochs, and produce a trained neural network. The output of the program will appear in the browser window. It will show a report of the training process; at the end of the training run the best training and test set performances will be reported.

    Q1: Let us take a look on the architecture of the neural networks you are using. What is the window size? how many hidden units are there? How many sequences for training and testing?

    Q2: Since the computational units take numerical inputs, amino acid sequences must be encoded as numbers. How many input units are using on this neural network?

    Q3: Check the performance of networks. What are the best correlation coefficient values on the training set and on the test set? In which epoch?

    Q4: Which parameters do you think should be changed in order to improve the performance and why?

  • 2. In the main page, change the value of the LSTOP parameter to 25 to make the network run for 25 epochs.

    Run the program again.

    Q5: Check the performance of networks. What are the best correlation coefficient values on the training set and on the test set? In which epochs?

    Q6: Explain why the performance changed?

    When howlite has finished running click on the 'PLOT' button at the bottom of the resulting page. This will produce learning curves: four plots of performance measures will be displayed. The measures are:

    a. Percentage correct helix/non-helix
    b. Average error
    c. Category percent correct
    d. Correlation coefficient

    Q7: Look at the correlation plot. Does it seem like the learning curve has reached a max at the epoch 25?

Optimize the training

The idea in this section is that all groups independently experiment with a number of parameters trying to create as good a network as possible i.e. the network having the highest correlation coefficient on the test data.

Evaluate the performance variation when changing the following parameters. Try to find the combination giving the best performance:

  1. number of sequences for training (parameter LEARNC, 100-590),

  2. specific set of training sequences (parameter LSKIP determines where the training set starts and stops, please note that the last 50 sequences in the file are used as test set in this exercise, and it is important that you don't train on these sequences ),

  3. window size (parameter NWSIZE, try for example windows of 5, 9, 13, 17 amino acids),

  4. number of hidden units (parameter N2HID, try 10 - 30).

  5. random initial values for the weights before training (parameter ISSEED, try large uneven integer).

  6. number of training epochs (parameter LSTOP, typically 10 - 50).

(It is time-consuming using high values on parameters. If you set N2HID to 50, LEARNC 500, LSTOP 50, perhaps you may need to wait for 15 mins to finish.)


After trying different values for those parameters, you need to decide the best network with the best performance. When you get the best one, you need to write down/remember the run number displayed at the top of the output page (the value looks like: 8421541) as well as the window size and the number of hidden neurons used in that run. You will need them to identify and use the network you have just trained.

Report the performance of your best network

In this part of the exercise you will not train a new network. Instead, you will run your best network, which you trained previously, and generate detailed output needed for comparison with networks made by other groups.

  1. Enter the run number that you saved before as the value of "Network to use";

  2. Change NWSIZE and N2HID to the corresponding values;

  3. Change the value for the number of training samples, LEARNC, into 0;

  4. Change the value of IVIRGN into 'no';

  5. Change the value of IACTIV into 'yes'.

  6. Make sure that ITSKIP is set to 597 and TESTC to 50.

This will put the network in "test mode". The howlite program will use the trained network to predict on the test sequences. Output will be produced for the test sequences only; and the actual network output from the two output units (helix or non-helix) will be shown for each amino acid in each test sequence.

  1. Run the network in this mode

    Examine the output: see how the binary helix/coil decision is made from the winner-takes-all interpretation of the output values.

    Output example:

    93 F H H  0.560 0.444
    The format is: position in sequence, amino acid, category assignment in test file, category assignment by network, output value from 'Helix' neuron, output value from 'non-Helix' neuron. In this case the output from the 'Helix' neuron is larger than the output from the 'non-Helix' neuron, and the amino acid is therefore predicted to be in a helix ("winner-takes-all").

    Q8: What is the output for the residue 18 in the first test sequence?.

  2. Click on the 'REPORT' button at the bottom of the output page.

    Your results will now contribute to the emerging ensemble of networks. Its performance will be discussed at the final session of the exercise (see below).

  3. If time allows you may continue to play around with the run-time parameters, and make additional runs.

Analyze the results

Q9: You have now tried different parameters on the neural networks (mainly: window size, hidden units, training sequences, and training epochs), and you have created your best network. Please make sure you have done the following: wrote down the run number (the ID of networks) displayed at the top of the output page; and reported your ONE best performance by clicking on the 'REPORT' button at the bottom of the output page. Then, Kristoffer will help us to collect the performance from each group. Finally we will evaluate the reported networks to find out which best neural network is the real BEST?