Data redundancy reduction

I todays exercise you shall implement the two data redundancy algorithms Hobohm 1 and Hobohm 2 introduced in the lecture.

Login to CBS using your login and password (only if you do not work on your own laptop). Make a new directory called hobohm by typing

mkdir hobohm
Change to this directory (i.e. cd hohohm) and copy the file data.fsa and move the output from last weeks exercise (data.fsa.aln) to this directory by typing (remember the "." in the end of the command)
cp ../data/data.fsa .
mv ../data/data.fsa.aln .

If you did not get the fasta_align_all to run, you can get a copy of the output file from

cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/data/Align/data.fsa.aln.gz .

Note, this file is compressed. To uncompress the file type

gunzip data.fsa.aln.gz

You can check the content of the data.fsa.aln by typing

head data.fsa.aln

Part of the file content is shown below

ALN 1US0.A 316 2EAB.A  899 type SW_ALN alen 308 221 64  87 score  87 -99.9  -87 -99.900002 
ALN 1US0.A 316 1QWN.A 1045 type SW_ALN alen 343 228 68 115 score  95 -99.9  -95 -99.900002 
ALN 1US0.A 316 1HLR.A  907 type SW_ALN alen 207 162 43  45 score 105 -99.9 -105 -99.900002 
ALN 1US0.A 316 1MXT.A  504 type SW_ALN alen 160 126 37  34 score  62 -99.9  -62 -99.900002 
ALN 1US0.A 316 1MUW.A  386 type SW_ALN alen 190 118 37  72 score  56 -99.9  -56 -99.900002 
ALN 1US0.A 316 1SU8.A  636 type SW_ALN alen 240 172 56  68 score  67 -99.9  -67 -99.900002 
ALN 1US0.A 316 1WUI.L  534 type SW_ALN alen 232 179 53  53 score  90 -99.9  -90 -99.900002 
ALN 1US0.A 316 2BHU.A  602 type SW_ALN alen 179 137 37  42 score  84 -99.9  -84 -99.900002 
ALN 1US0.A 316 2E7Z.A  727 type SW_ALN alen 222 171 44  51 score  97 -99.9  -97 -99.900002 
ALN 1US0.A 316 1JZ7.A 1023 type SW_ALN alen 432 266 80 166 score 103 -99.9 -103 -99.900002 

Each line has the format

ALN Q-name Q-len D-name D-len type Alignment_type alen alignment-length match-len nid ngap score alignment-score -99.9 ...

Implementing the algorithms

Now we are ready to code. First you must copy the program templates to you src directory.

cd $ALGOHOME/src
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/code/hobohm/hobohm1.c .
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/code/hobohm/hobohm2.c .


Open the file hobohm1.c in your favorite editor. Go to the main procedure. Make sure you understand the structure of the program. In particular make sure you understand how the keep list is updated to keep track of the list of unique sequences.

You shall fill in the missing code in the homolog subroutine. Go to the homolog subroutine and find the place with the missing code (XXXX). Again make sure you understand the structure of the routine, and then fill out the missing code.

Next, compile the program

make hobohm1

When the compilation is successful copy the code to the bin directory

cp hobohm1 ../bin

Now go to the hobohm directory, and run the hobohm1 program on the file data.fsa

cd $ALGOHOME/hobohm
hobohm1 data.fsa > hobohm1.out

Look at the output file hobohm1.out. How many unique clusters does the algorithm find?


Now go back to the src directory and open the file hobohm2.c. Spend some time to make sure you understand the structure of the program. Fill in the missing code (XXXXXX), and compile the program

make hobohm2

When the compilation is successful copy the code to the bin directory

cp hobohm2 ../bin

Now go to the hobohm directory. The input to the hobohm2 algorithm is a file in the format

name1 name2 score

where score if the similarity score between the sequences name1 and name2. We can transform the output from the fasta_align_all into this format using the following command

cat data.fsa.aln | gawk '{if ( 2.9*sqrt($9) > $11 ) { h=0} else { h=1} print $2,$4,h}' > data.fsa.aln.fmt

This command takes each line in the file out and checks if 2.9*sqtr(alen) > nid. The syntax $9 refers to the 9th column in the line, and $11 to the 11th column. If 2.9*sqtr(alen) > nid the variable h is set to 0, otherwise it is set to 1. Next name1 ($2), name2 ($4) and h are printed to standard output.

You can now run the hobohm2 algorithm on the file data.fsa.aln.fmt

hobohm2 data.fsa.aln.fmt > hobohm2.out

Look at the output file hobohm2.out. How many unique clusters does the algorithm find? How does the number compare to the number of clusters found using the Hobohm1 algorithm.

Now you are done.