Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Sequence encoding and feed-forward algorithm

I todays exercise you shall implement two schemes for amino acid encoding and implement the feed forward algorithm for artificial neural network prediction.


Implementing the algorithms


First, make a new directory for todays exercise and copy some data files.

cd
mkdir NN
cd NN
cp /home/projects/mniel/ALGO/data/NN/{c000,f000,syn_sp.dat} .

You should now have three files (c000,f000,syn_sp.dat) in the NN directory. Check this by typing

ls -ltr

The c000 and f000 are data files with peptides for network training, and the syn_sp.dat contains the synaps' from a network training.

Next, you must copy the program templates to you src directory. (Remember the "." in the end of the command)

cd
cd src
cp /home/projects/mniel/ALGO/code/NN-1/seq2inp.c .
cp /home/projects/mniel/ALGO/code/NN-1/nnforward.c .

The first program seq2inp reads a list of peptides of equal length from a file and encodes the amino acids in either sparse encoding (0.90 and 0.05) or by using the -bl option Blosum encoding (The Blosum vector divided with 5.0). The program translates each line into 180 numbers followed by the binding affinity value.

The input file has the format

ILYQVPFSV 0.8532
GGNNSPAVY 0.0
YLDLALMSV 0.8425
HFADPFSCP 0.0
RMYGVLPWI 0.6889
MLQDMAILT 0.5269
SLYFGGICV 0.7819
YLVAYQATV 0.6391
VIHAFQYVI 0.3433
MMWYWGPSL 0.7704

where the first column gives the peptide and the second column gives the binding affinity in rescaled units, so that a value of 1 is perfect binding and a values of 0 is non binding.

The second program nnforward.c reads two files. The first is a file with synaps files (or a single synaps file if called with the -s option). Each synaps file contains the artificial neural network weights. The second file contains the neural network input as generated from the seq2inp program.


seq2inp

Open the file seq2inp.c in your favorite editor. Go to the main procedure. Make sure you understand the structure of the program.

You shall fill in the missing code (XXXX). Again make sure you understand the structure of the routine, and then fill out the missing code.

Next, compile the program

make seq2inp

When the compilation is successful copy the code to the bin directory

cp seq2inp ../bin
rehash

The rehash command updates a table used by the operating system to keep track of all executable programs to include your new program seq2inp.

Now go to the NN directory. First check what is the content of the files c000, f000 by typing

head c000 f000

Next, run the seq2inp program on the file c000

seq2inp c000 | grep -v "#" > c000_sp

Look at the output. How does the output compare to the c000_sp?

Now do the encoding using the Blosum50 matrix

seq2inp -bl c000 | grep -v "#" > c000_bl

Look at the output. How does the output compare to the c000_bl?

When both encoding schemes are working, do the encoding the the f000 file as well.

seq2inp f000 | grep -v "#" > f000_sp
seq2inp -bl f000 | grep -v "#" > f000_bl

nnforward

Now go back to the src directory and open the file nnforward.c. Spend some time to make sure you understand the structure of the program. Fill in the missing code (XXXXXX), and compile the program. Make sure you understand how the program can deal with ensembles of networks using the linked list structure.

                        
make nnforward

When the compilation is successful copy the code to the bin directory

cp nnforward ../bin

Now go to the NN directory. You can now test that you feed forward algorithm works using the syn_sp.dat file. Make sure you understand the content of this file.

  • You have an input with 9 amino acids. How does that relate to the number of neurons in the first layer?
  • How many hidden neurons does the network have?
  • And how many output values?

Can you understand the number of synaps weights in the file

cat syn_sp.dat | grep -v ":" | grep -v TEST | wc
The second column in this command gives the number of weights in the synaps file. Can you make sense of this number (365)?

The synaps was generate from a neural network training using sparse encoding with the seq2inp program. You must hence use sparse encoded input when you use the nnforward program to predict binding.

You can predict binding for the peptides in the file c000 using the command

nnforward -s syn_sp.dat c000_sp | grep -v "#" > c000_sp_pred

How does the output compare to the c000_sp_pred?

What is the predictive performance of the neural network (in terms of the Pearsons correlation)?

cat c000_sp_pred | gawk '{print $1,$3}' | xycorr

What would have happened if you have used Blosum encoding to predict binding with a network trained on sparse encoding?

nnforward -s syn_sp.dat c000_bl | grep -v "#" | gawk '{print $1,$3}' | xycorr

You can now copy a Blosum encoded network from

cp /home/projects/mniel/ALGO/data/NN/syn_bl.dat .

You can predict binding for the peptides in the file c000 using the command

nnforward -s syn_bl.dat c000_bl | grep -v "#" > c000_bl_pred

How does the output compare to the c000_bl_pred?

What is the predictive performance of the neural network (in terms of the Pearsons correlation)?

cat c000_bl_pred | gawk '{print $1,$3}' | xycorr

You can test if combining Blosum and sparse encoding improves the predictive performance

paste c000_sp_pred c000_bl_pred | gawk '{print ($1+$4)/2,$3}' | xycorr
What is the predictive performance of the averaged method?

Now you are done.