Artificial neural network training and back-propagation

I this exercise you shall implement the back propagation algorithm for artificial neural network training.

Implementing the algorithms

First, you must make sure that you have the data files needed for the exercise. These files were copied earlier to the NN directory. If you did not do the first ANN exercise, you can copy the data files to the NN directory using the command. NOTE, you should only do this if you DID NOT do the earlier ANN exercise.
cd NN

cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/data/NN/{c000,f000} .

You should now have the files (c000,f000) in the NN directory. The c000 and f000 are data files with peptides for network training.

Next, you must copy the program template to you src directory. (Remember the "." in the end of the command)

cd src
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/code/NN-2/nnbackprop.c .

The program nnbackprop.c trains a neural network using back-propagation. The program reads a sequence encoded peptide file made with the program seq2inp from yesterday, and runs back-propagation to minimize the error between the predicted and target values for each data point. The program takes a list of command-line options

Usage: nnbackprop [-h] [args] inputfile

        [-v]                 0                    Verbose mode
        [-nh int]            2                    Number of hidden neurons
        [-syn filename]      syn.dat              Name of synaps file
        [-s int]             -1                   Seed [-1] Default, [0] Time [>0] Specific seed
        [-nc int]            500                  Number of iterations
        [-eta float]         0.050000             Eta for weight update
        [-ol filename]       trainpred.out        File for training data prediction output
        [-ot filename]       testpred.out         File for test data prediction output
        [-tf filename]                            File with test input
        [-nt int]            10                   Test interval
        [-bl float]          0.000010             Limit for backpropagation
        [-w float]           0.100000             Initial value for weight
        [-dtype int]         0                    Dump type [0] Error [1] Pearson

Most of the options are self explanatory. The -tf gives a file with a sequence encoded test file. These data are used to dump and save the synaps file. The -dtype defines what criteria is used to decide if the synapses are saved. If -dtype 0 [default] then the synapses are dumped when the test error is lowest, otherwise the synapses are dumped when the test Pearsons's correlation is maximal. The -eta defines the step size for the gradient decent minimization, and the -bl defines the threshold values for the absolute value of the error to trigger back-propagation.

Open the file nnbackprop.c in your favorite editor. Go to the main procedure. Make sure you understand the structure of the program.

You shall fill in the missing code (XXXX). Again make sure you understand the structure of the routine, and then fill out the missing code.

Next, compile the program

make nnbackprop

When the compilation is successful copy the code to the bin directory

cp nnbackprop ../bin/

The rehash command updates a table used by the operating system to keep track of all executable programs to include your new program nnbackprop. Note, that this will only work if you work in a tcsh.

Now go to the NN directory. Here, you should have both sparse and blosum sequence encoded versions of the two file c000 and f000 ready from the earlier ANN exercise

ls c000_sp f000_sp c000_bl f000_bl

If you do not have these files, you can copy them using the command

cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/data/NN/{c000_sp,c000_bl,f000_sp,f000_bl} .

NOTE, you should only do this if you DID NOT do yesterdays exercise.

You can now train a network with 10 hidden neurons using sparse sequence encoding using the command

nnbackprop -nh 10 -tf c000_sp -syn syn_sp_my.dat f000_sp

What is the training and test performances of the network (in terms of the Pearson's correlation)? Do the same thing using blosum encoding. Are there any striking differences in the course of the training for the two encoding schemes? Can you understand this difference?

You can check the performance of your code to that of my implementation by comparing your output to the following files

sp.out (Output from sparse training)
bl.out (Output from Blosum training)

Try to train the network using sparse encoding and varying some of the training parameters (-w, -nh and -eta).

Now you are done. You have now made a series of programs similar to the neural network program suite used at CBS to make >20 Science and Nature publications, and attract more that 50 million US $ funding.

If you have more time, could you modify the code to do error minimization on the function E=1/4(O-t)^4?