|
Data redundancy reduction
I todays exercise you shall implement the two data redundancy algorithms Hobohm 1 and Hobohm 2 introduced in the
lecture.
Login to Linux cluster using your login and password
Make a new directory called Hobohm by typing
mkdir Hobohm
Change to this directory and copy the file data.fsa and move the output from last weeks exercise (data.fsa.aln)
to this directory by typing (remember the "." in the end of the command)
cp ../data/data.fsa .
mv ../data/data.fsa.aln .
If you did not get the fasta_align_all to run, you can get a copy of the output file from
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/data/Align/data.fsa.aln.gz .
Note, this file is compressed. To uncompress the file type
gunzip data.fsa.aln.gz
You can check the content of the data.fsa.aln by typing
head data.fsa.aln
Part of the file content is shown below
ALN 1US0.A 316 2EAB.A 899 type SW_ALN alen 308 221 64 87 score 87 -99.9 -87 -99.900002
ALN 1US0.A 316 1QWN.A 1045 type SW_ALN alen 343 228 68 115 score 95 -99.9 -95 -99.900002
ALN 1US0.A 316 1HLR.A 907 type SW_ALN alen 207 162 43 45 score 105 -99.9 -105 -99.900002
ALN 1US0.A 316 1MXT.A 504 type SW_ALN alen 160 126 37 34 score 62 -99.9 -62 -99.900002
ALN 1US0.A 316 1MUW.A 386 type SW_ALN alen 190 118 37 72 score 56 -99.9 -56 -99.900002
ALN 1US0.A 316 1SU8.A 636 type SW_ALN alen 240 172 56 68 score 67 -99.9 -67 -99.900002
ALN 1US0.A 316 1WUI.L 534 type SW_ALN alen 232 179 53 53 score 90 -99.9 -90 -99.900002
ALN 1US0.A 316 2BHU.A 602 type SW_ALN alen 179 137 37 42 score 84 -99.9 -84 -99.900002
ALN 1US0.A 316 2E7Z.A 727 type SW_ALN alen 222 171 44 51 score 97 -99.9 -97 -99.900002
ALN 1US0.A 316 1JZ7.A 1023 type SW_ALN alen 432 266 80 166 score 103 -99.9 -103 -99.900002
Each line has the format
ALN Q-name Q-len D-name D-len type Alignment_type alen alignment-length match-len nid ngap score alignment-score -99.9 ...
Implementing the algorithms
Now we are ready to code. First you must copy the program templates to you src directory.
cd
cd src
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/code/hobohm/hobohm1.c .
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/code/hobohm/hobohm2.c .
Hobohm-1
Open the file hobohm1.c in your favorite editor. Go to the main procedure. Make sure you understand the
structure of the program. In particular make sure you understand how the keep list is updated
to keep track of the list of unique sequences.
You shall fill in the missing code in the homolog subroutine. Go to the homolog subroutine and find the place
with the missing code (XXXX). Again make sure you understand the structure of the routine, and then fill out the
missing code.
Next, compile the program
make hobohm1
When the compilation is successful copy the code to the bin directory
cp hobohm1 ../bin
Now go to the hobohm directory, and run the hobohm1 program on the file data.fsa
cd
cd hobohm
hobohm1 data.fsa > hobohm1.out
Look at the output file hobohm1.out. How many unique clusters does the algorithm find?
Hobohm-2
Now go back to the src directory and open the file hobohm2.c. Spend some time to make sure you understand the
structure of the program. Fill in the missing code (XXXXXX), and compile the program
make hobohm2
When the compilation is successful copy the code to the bin directory
cp hobohm2 ../bin
Now go to the hobohm directory. The input to the hobohm2 algorithm is a file in the format
name1 name2 score
where score if the similarity score between the sequences name1 and name2. We can transform the output from
the fasta_align_all into this format using the following command
cat data.fsa.aln | gawk '{if ( 2.9*sqrt($9) > $11 ) { h=0} else { h=1} print $2,$4,h}' > data.fsa.aln.fmt
This command takes each line in the file out and checks if 2.9*sqtr(alen) > nid. The syntax $9 refers to the
9th column in the line, and $11 to the 11th column. If 2.9*sqtr(alen) > nid the variable h is set to 0, otherwise
it is set to 1. Next name1 ($2), name2 ($4) and h are printed to standard output.
You can now run the hobohm2 algorithm on the file data.fsa.aln.fmt
hobohm2 data.fsa.aln.fmt > hobohm2.out
Look at the output file hobohm2.out. How many unique clusters does the algorithm find? How does the number
compare to the number of clusters found using the Hobohm1 algorithm.
Now you are done.
|