Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Sequence alignment

In todays exercise you shall implement a program to perform sequence alignment.

The exercise has four parts

  • Make a program to perform Smith Waterman sequence alignment in O3 time. O3 time means that the execution time increases to the third power of the sequence length.
  • Make a program to perform Smith Waterman sequence alignment in O2 time
  • Compare execution time for the O3 and O2 programs on biological data
  • Make a program to perform all against all sequence alignment of a database of protein sequences
  • You might find some of the information on this link useful for debugging your code Matrix dumps from alignment programs (to be used for debugging)

Connect to the Linux cluster

Login to the CBS server (login.cbs.dtu.dk) using your CBS username (stud0XX) and password.

Smith Waterman in O3 time

First you must copy the program templates of today exercise into your src directory

cd src
cp /home/projects/mniel/ALGO/code/align/fasta_align_O3.c .
cp /home/projects/mniel/ALGO/code/align/fasta_align.c .
cp /home/projects/mniel/ALGO/code/align/fasta_align_all.c .

Now we can get started. The two files fasta_align_O3.c and fasta_align.c contain program templates for Smith Waterman sequence alignment in O3 and O2 time, respectively. Finally the file fasta_align_all.c contains a program template for performing an all against all sequence alignment.

Open the fasta_align_O3.c in your favorite editor (say nedit) and see if you can understand the structure of the program. Check what each subroutine is doing, and make sure you understand how the main program is organized.

Go the the sw_alignment routine and find the line

/* HERE YOU MUST FILL OUT THE MISSING CODE (XXXXX)*/

This is where you must fill in the missing code. At all places where there are XXX you are expected to write some code. Once you believe you have completed the code, save the program and exit the editor. You can now compile the program by typing

make fasta_align_O3

If the compilation went fine without any error messages you are very smart (or very lucky). Most likely the compiler will come out with some error messages. Try to understand these and make the corrections in the code. Once you have a program that can be compiled, you can test the program by typing

fasta_align_O3 ../data/1PLC._.fsa ../data/1PLB._.fsa

The output to the screen should look like

# fasta_align_O3 ../data/1PLC._.fsa ../data/1PLB._.fsa
# Thu Sep 17 21:40:25 2009
# User: stud000
# PWD : /home/people/stud000/src
# Host: Linux life 2.6.16.54-0.2.12-default ia64
# Command line parameters set to:
#       [-gf float]          11.000000            penalty for first gap
#       [-gn float]          1.000000             penalty for next gap
#       [-mal int]           3                    Minimum alignment lenght
#       [-m filename]        /home/people/stud000/data/BLOSUM50  Scoring matrix
#       [-v]                 0                    Verbose mode
#       [-show]              0                    Show matrices
# Read 25 elements on linelist /home/people/stud001/data/BLOSUM50
# Read_realblosum done. Alphabet ARNDCQEGHILKMFPSTWYV
# Number of FASTA entries read 1 from file ../data/1PLC._.fsa
# Number of FASTA entries read 1 from file ../data/1PLB._.fsa
# Gap penalties. fgap: 11.000000. ngap: 1.000000
ALN 1PLC.____mol:aa_ELECTRON_TRANSPORT 99 1PLB.____mol:aa_ELECTRON_TRANSPORT_PROTEIN 97 type SW_ALN alen 98 96 64 2 score 439 -99.9 -439 -99.900002
QAL 1PLC.____m     1 DVLLGADDGSLAFVPSEFSISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLNAKGETFEVALSNKGEYSFYCSPHQGAGMVGKVTVN
DAL 1PLB.____m     1 EVKLGSDDGGLVFSPSSFTVAAGEKITFKNNAGFPHNIVFDEDEVPAGVNAEKISQPE--YLNGAGETYEVTLTEKGTYKFYCEPHAGAGMKGEVTVN

The output consists of some comment lines (lines starting with a #), followed by the alignment. The alignment comes in three lines. The first line starts with the keywork ALN. This line has the format

ALN Q-name Q-len D-name D-len type Alignment_type alen alignment-length match-len nid ngap score alignment-score -99.9 ...

Make sure you understand each of these fields. The next two lines contain the sequence alignment in the format

QAL Q-name q-offset Q-alignment
DAL D-name d-offset D-alignment

Here the q-offset and d-offset are starting location in the query and database sequences, respectively (counting from 0).

When the program is working as it should, you can copy it to the bin directory. Note, you might have another location of your bin directory compared to what is writen below)

cp fasta_align_O3 ~/bin
rehash

Now, you can test how fast the program is by using the following command

time fasta_align_O3 ../data/1PLC._.fsa ../data/test.fsa > test.out1

This command will give the CPU time needed for the program to align the sequence 1PLC._.fsa against all (1055) sequences in the test.fsa file. This might take some time, so you can open another terminal window and continue the exercise here.

The output will look like

144.992u 0.088s 2:25.28 99.8% 

Here, the first value is the CPU time used to do the calculation. Write down the CPU time used by your program.


Smith Waterman in O2 time

Now you can look at the other alignment program. This program implements the Smith Waterman alignment algorithm in O2 time using the method described by Gotoh. Open the file fasta_align.c in your favorite editor and make sure you understand the structure of the program. Check what each subroutine is doing, and make sure you understand how the main program is organized.

Go the the sw_alignment routine and find the line

/* HERE YOU MUST FILL OUT THE MISSING CODE (XXXXX)*/
This is where you must fill in the missing code. At all places where there are XXX you are expected to write some code. Once you believe you have completed the code, save the program and exit the editor. You can now compile the program by typing
make fasta_align
If the compilation went fine without any error messages you can test the program by typing
fasta_align ../data/1PLC._.fsa ../data/1PLB._.fsa

The output should look like

# fasta_align ../data/1PLC._.fsa ../data/1PLB._.fsa
# Thu Sep 17 21:43:22 2009
# User: stud001
# PWD : /home/people/stud001/src
# Host: Linux life 2.6.16.54-0.2.12-default ia64
#       [-gf float]          11.000000            penalty for first gap
#       [-gn float]          1.000000             penalty for next gap
#       [-mal int]           3                    Minimum alignment lenght
#       [-m filename]        /home/people/stud001/data/BLOSUM50  Alignment matrix
#       [-v]                 0                    Verbose mode
# Read 25 elements on linelist /home/people/stud001/data/BLOSUM50
# Read_realblosum done. Alphabet ARNDCQEGHILKMFPSTWYV
# Number of FASTA entries read 1 from file ../data/1PLC._.fsa
# Number of FASTA entries read 1 from file ../data/1PLB._.fsa
# Gap penalties. fgap: 11.000000. ngap: 1.000000
ALN 1PLC.____mol:aa_ELECTRON_TRANSPORT 99 1PLB.____mol:aa_ELECTRON_TRANSPORT_PROTEIN 97 type SW_ALN alen 98 96 64 2 score 439 -99.9 -439 -99.900002
QAL 1PLC.____m     1 DVLLGADDGSLAFVPSEFSISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLNAKGETFEVALSNKGEYSFYCSPHQGAGMVGKVTVN
DAL 1PLB.____m     1 EVKLGSDDGGLVFSPSSFTVAAGEKITFKNNAGFPHNIVFDEDEVPAGVNAEKISQPE--YLNGAGETYEVTLTEKGTYKFYCEPHAGAGMKGEVTVN

The output from the two programs is hence identical.

Copy the compiled program to your bin directory. Note, you might have another location of your bin directory compared to what is writen below)

cp fasta_align ~/bin 
rehash

Now you can repeat the command used to test the speed of the program

time fasta_align ../data/1PLC._.fsa ../data/test.fsa > test.out2

Write down the CPU time used by the program and compare this with the time used by the O3 code. How many times fast does the O2 code run?


Aligning all against all

Now you are ready for the last task: Making a program that can align all against all from a protein database. The file fasta_align_all.c contains such a program. Again make sure you understand the structure of the program. Check what each subroutine is doing, and make sure you understand how the main program is organized. The program is identical to fasta_align except for the central for loops in the main program. Make sure you understand what is going on here.

You can now compile the program by typing

make fasta_align_all

Copy the compiled program to your bin directory. Note, you might have another location of your bin directory compared to what is writen below)

cp fasta_align_all ~/bin
rehash

You can test that this work by going to the data directory and align two sequences

cd
cd data
cat 1PLC._.fsa 1PLB._.fsa | fasta_align_all --

This should give the same output as when you tested the code earlier.

You can now do a large scale calculation where you align 1000 protein sequences against each other. We shall use the output from the calculation in the exercise Thursday, 9. January.

Copy the FASTA data file to you data directory

cd
cd data
cp /home/projects/mniel/ALGO/data/Align/data.fsa .

The data.fsa file contains 1055 protein sequences. Launch an all-against-all alignment of the sequences in the data.fsa file using the fasta_align_all program and save the output in a file called data.fsa.aln

fasta_align_all data.fsa | grep ^ALN > data.fsa.aln &
The "&" in the end of the command will launch the calculation in the background, meaning that you can logout from you account, and the program will keep running. Just leave the code running. It might take some hours to complete. You need the results for the exercise on Thursday, 9. January.

This is it for today!!!