|
Sequence encoding and feed-forward algorithm
I todays exercise you shall implement two schemes for amino acid encoding and
implement the feed forward algorithm for artificial neural network prediction.
Implementing the algorithms
First, make a new directory for todays exercise and copy some data files.
cd
mkdir NN
cd NN
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/data/NN/{c000,f000,syn_sp.dat} .
You should now have three files (c000,f000,syn_sp.dat) in the NN directory. Check this by typing
ls -ltr
The c000 and f000 are data files with peptides for network training, and the syn_sp.dat contains the synaps'
from a network training.
Next, you must copy the program templates to you src directory. (Remember the "." in the end of the command)
cd
cd src
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/code/NN-1/seq2inp.c .
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/code/NN-1/nnforward.c .
The first program seq2inp reads a list of peptides of equal length from a file and encodes the amino acids in
either sparse encoding (0.90 and 0.05) or by using the -bl option Blosum encoding (The Blosum vector divided with 5.0).
The program translates each line into 180 numbers followed by the binding affinity value.
The input file has the format
ILYQVPFSV 0.8532
GGNNSPAVY 0.0
YLDLALMSV 0.8425
HFADPFSCP 0.0
RMYGVLPWI 0.6889
MLQDMAILT 0.5269
SLYFGGICV 0.7819
YLVAYQATV 0.6391
VIHAFQYVI 0.3433
MMWYWGPSL 0.7704
where the first column gives the peptide and the second column gives the binding affinity in rescaled units, so that
a value of 1 is perfect binding and a values of 0 is non binding.
The second program nnforward.c reads two files. The first is a file with synaps files (or a single synaps
file if called with the -s option). Each synaps file contains the artificial neural network weights.
The second file contains the neural network input as generated from the seq2inp program.
seq2inp
Open the file seq2inp.c in your favorite editor. Go to the main procedure. Make sure you understand the
structure of the program.
You shall fill in the missing code (XXXX). Again make sure you understand the structure of the routine,
and then fill out the missing code.
Next, compile the program
make seq2inp
When the compilation is successful copy the code to the bin directory
cp seq2inp ../bin
rehash
The rehash command updates a table used by the operating system to keep track of all executable programs to
include your new program seq2inp.
Now go to the NN directory. First check what is the content of the files c000, f000 by typing
head c000 f000
Next, run the seq2inp program on the file c000
seq2inp c000 | grep -v "#" > c000_sp
Look at the output. How does the output compare to the c000_sp?
Now do the encoding using the Blosum50 matrix
seq2inp -bl c000 | grep -v "#" > c000_bl
Look at the output. How does the output compare to the c000_bl?
When both encoding schemes are working, do the encoding the the f000 file as well.
seq2inp f000 | grep -v "#" > f000_sp
seq2inp -bl f000 | grep -v "#" > f000_bl
nnforward
Now go back to the src directory and open the file nnforward.c. Spend some time to make sure you understand the
structure of the program. Fill in the missing code (XXXXXX), and compile the program.
Make sure you understand how the program can deal with ensembles of networks using the linked list structure.
make nnforward
When the compilation is successful copy the code to the bin directory
cp nnforward ../bin
Now go to the NN directory. You can now test that you feed forward algorithm works using the
syn_sp.dat file. Make sure you understand the content of this file.
- You have an input with 9 amino acids. How does that relate to the number of neurons in the first layer?
- How many hidden neurons does the network have?
- And how many output values?
Can you understand the number of synaps weights in the file
cat syn_sp.dat | grep -v ":" | grep -v TEST | wc
The second column in this command gives the number of weights in the synaps file. Can you make sense of this
number (365)?
The synaps was generate from a neural network training using sparse encoding with the seq2inp program.
You must hence use sparse encoded input when you use the nnforward program to predict binding.
You can predict binding for the peptides in the file c000 using the command
nnforward -s syn_sp.dat c000_sp | grep -v "#" > c000_sp_pred
How does the output compare to the c000_sp_pred?
What is the predictive performance of the neural network (in terms of the Pearsons correlation)?
cat c000_sp_pred | gawk '{print $1,$3}' | xycorr
What would have happened if you have used Blosum encoding to predict binding with a network trained on sparse
encoding?
nnforward -s syn_sp.dat c000_bl | grep -v "#" | gawk '{print $1,$3}' | xycorr
You can now copy a Blosum encoded network from
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/data/NN/syn_bl.dat .
You can predict binding for the peptides in the file c000 using the command
nnforward -s syn_bl.dat c000_bl | grep -v "#" > c000_bl_pred
How does the output compare to the c000_bl_pred?
What is the predictive performance of the neural network (in terms of the Pearsons correlation)?
cat c000_bl_pred | gawk '{print $1,$3}' | xycorr
You can test if combining Blosum and sparse encoding improves the predictive performance
paste c000_sp_pred c000_bl_pred | gawk '{print ($1+$4)/2,$3}' | xycorr
What is the predictive performance of the averaged method?
Now you are done.
|