Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Exercise 3: hoWWWlite - Artificial neural network simulator for symbol sequences




Training a neural network predictor: Backgound Information

The topic of this exercise will be to examine how to train a neural network predictor. We have a training data set available containing protein sequences and their corresponding secondary structures, and now we wish to find out whether a neural network can learn from these data and predict the secondary structure of proteins from their amino-acid sequence alone.

A brief overview of neural networks

(For a thourough discussion on neural networks please turn to your textbooks).

Many people tend to view neural networks as black boxes. You pour data in from one side, and results pour out from the other - what happens in between is magic. The purpose of this exercise is to lift part of that veil for you. In short, a typical neural network consists of two or three layers of neurons: The input layer, the hidden layer (optional), and the output layer. Neurons are connected across layers (but never within layers), and each connection has a weight, a measure of the influence each neuron hold over the others.

The neurons of the input layer are responsible for reading the input sequence, but each neuron only examines a tiny part of the input, and thus casts its vote based on a very narrow viewpoint. If no hidden layer is present, the output neurons which are responsible for the final outcome, then make the predictions based solely on an evaluation of the votes cast by the input layer. Such a network will tend to focus on specific amino acids occuring in specific positions, e.g. it might look for whether the amino acid 'A' (Alanine) occurs in the 13'th position of the input sequence.

If, on the other hand, a hidden layer is present, it will serve as a mediator between the input neurons and the output neurons, changing the recommendations of the input neurons as it deems necessary. The purpose is to capture underlying correlations in the data. With a hidden layer present, the network can also evaluate specific combinations of input. Now, our network might only find an 'A' in the 13'th position interesting provided that an 'A' was also present at the 12'th position. Understanding the role of the hidden neurons is key to understanding the true power of neural networks.

The process of training a neural network consists of optimizing the weights (or synapses) of the network, emphasizing some neurons at the expense of others, so that the network becomes more adept at distinguishing between different classes of input.

The problem

The exercise in datadriven neural networks concerns prediction of protein secondary structure from the linear sequence of amino acids. The exercise is carried out using a simple neural-network simulator - howlite - and ready made sequence data constructed by conversion of Brookhaven coordinate data into secondary structure assignments by the Kabsch and Sander DSSP program.

Take a look at the sequence data linking the linear sequence of amino acids and the secondary structure assignments. The file contains 134 protein sequences of varying length with helix/non-helix assignments for each amino acid.

In the exercise the aim is to produce a two category helix/non-helix network, i.e., a neural network that predicts whether a given amino acid belongs to a helix or not. Helix is assigned by H in the data file, all other symbols are treated as non-helix and assigned C in the output from the network. Initially, we will use the first 20 sequences for training and the last 10 sequences for testing. We will not be using the complete set, as this will be too time consuming.



Exercise Part I - The hoWWWlite interface

TIP: The student is advised to keep two browser windows open throughout the exercise. The cookbook (this page) should be loaded into one while the neural network training should take place in the other. The new browser should open automatically when you click on the link below. If it doesn't, right click on the link and select the appropriate item (should be something like 'open in new window').

Training a howlite network

Make yourself familiar with the WWW based neural network simulator:

1.1 Take a brief look at the instructions given on that page, especially the list of run-time parameters.

1.2 When you have checked the different run-time parameters, make a trial training run by clicking on the 'Run' button at the bottom of the main page. Do not modify any parameters this time; just use the default settings. The program will run for 10 learning cycles, and produce a trained neural network. The output of the program will appear in the browser window. It will show a report of the training process with statistics and correlation coefficients for each learning cycle. Start by writing the run number down. You find it directly below the headline, a 7+ digit number which identifies the network you just made. You'll need it later on. Down at the bottom you should find two lines giving the statistics for the cycles which showed optimal performance in the training and test sets. For an explanation of the correlation coefficients and how they relate to network performance, please look here.

1.3 At the bottom of the page with the output from the network, you'll find a button called 'plot'. Clicking on it brings up a series of four plots illustrating the progress of the training process. These are different graphical representations of the output.

Using the network as a predictor

While the correlation coefficient values give some indication of the quality of predictions made by the network, they do not actually give us any predictions! To view the predictions, you'll need the run number from 1.2. (If you closed the output window without writing the number down, simply re-train the network.)

1.4 Open a new neural network simulator and enter the run number that you wrote down before into the box called "Network to use". Then change the value for the 'network mode' parameter, selecting 'prediction'.


This will put the network in "test mode". The howlite program will now use the trained network parameters for predicting on the test sequences. Output will be produced for the test sequences only, and for each amino acid in each test sequence, the actual network output from the two output units (helix and non-helix, assigned H and C respectively) will be shown.

1.5 Now click the run button to run the network in this mode. Examine the output: see how the binary helix/coil decision is made from the winner-takes-all interpretation of the output values.

Output example:
93 F H H  0.560 0.444
The format is: position in sequence, amino acid, category assignment in test file, category assignment by network, output value from 'Helix' neuron, output value from 'non-Helix' neuron. In this case the output from the 'Helix' neuron is larger than the output from the 'non-Helix' neuron, and the amino acid is therefore predicted to be in a helix ("winner-takes-all").


As you'll probably agree, although the network performs better than simple random guessing, there is plenty of room for improvement. Let's see if we can do something about that.



Exercise Part II - Optimizing network performance

This, the final part of the exercise, is structured primarily around a series of multiple choice questions. To get the maximum benefit from this exercise, you should take the time to consider each question carefully before answering. There is no prize for finishing first!

When the exercise is over, we'll make a show of hands and see how many of you figured out the right answers.

(Note that more than one answer for each question may be correct).

2.1 Open a new neural network simulator. Leaving all other values at default, try increasing the number of learning cycles (parameter 'epocs') to 35 and compare the training results with your initial network from 1.2.

Would you conclude that the extra cycles made the network into a better predictor?

A) Yes, some improvement.
B) No difference.
C) Actually, the performance became worse.


2.2 Keeping epocs at '35', try training a larger network with an input window size of '41' and compare it with the results from the small network trained in the previous question.

The larger network has an easier time learning the sequences in the learning set; yet, it does no better on the training set. Realizing this, you conclude that:

A) Larger networks have more weights to optimize, and will learn slower. More learning cycles would help.
B) Larger networks are more prone to overfitting. Rather than learning to recognize structurally relevant patterns, the larger network just memorizes all the training sequences.
C) The test set must be polluted with eccentric proteins having non-standard structures. These cannot be predicted correctly, and thus prevent us from obtaining a higher correlation coefficient.

Bonus question: Looking at the top of the output from how, you should see that the input layer (layer 1) consists of '861' cells (neurons). Yet, the size of the input window we used to create the network was '41' amino acids. Why does a window size '41' translate into '861' neurons?


2.3 Return the input window size to '13', but keep epocs at '35'. Now train two networks, a large one with '30' hidden neurons and a small one with zero (thus eliminating the hidden layer completely).

Comparing the two networks, what can you conclude?

A) The small network peaks at epoch '5', the larger not until epoch '25'. The larger network has more weights to optimize and would benefit from more learning cycles.
B) Increasing the number of hidden neurons increases the number of weights, and larger networks are still more prone to overfitting.
C) Hidden neurons look for underlying correlations in the data, but the training data may have few underlying correlations.


2.4 Re-train the two networks from 2.3, but this time increase the number of training sequences to '60'.

Compare the two networks with each other, and with those from 2.3. Now, what can you conclude?

A) Hardly surprising, more training data improves network performance. However, the larger network could benefit from additional learning cycles.
B) Increasing the number of training sequences makes the network less prone to overfitting, because the additional data makes the training sequences harder to memorize.
C) With the addition of new training sequences, the networks can suddenly reveal hidden correlations in the data.


2.5 By now you should have some feel for the effects different run-time parameters have on network performance. If you've done everything sofar exactly as specified by this cookbook, the best correlation coefficient you've obtained should be '0.4671'. Try to experiment with the hoWWWlite interface and see if you can outdo this.