Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Hidden Markov Models

Morten Nielsen (mniel@cbs.dtu.dk)

I todays exercise you shall implement two algorithms for alignment of a sequence to a hidden Markov Model, The Viterbi, and Posterior decoding algorithms.


Implementing the algorithms

First, you must make a new directory for todays exercise and copy some data files.

cd
mkdir HMM
cd HMM
cp /home/projects/mniel/ALGO/data/HMM/casino.* .

You should now have the files casino.hmm, casino.seq, and casino.seqlong in the HMM directory. Check this by typing

ls -ltr

The casino.hmm file is a parameter file describing the hidden Model, and the casino.seq and casino.seqlong files are files with sequences to be used for decoding.

The casino.hmm parameter files has the following format

M= 6 
N= 2
A:
0.95 0.05
0.1 0.90
B:
0.166 0.166 0.166 0.166 0.166 0.167
0.1 0.1 0.1 0.1 0.1 0.5
pi:
0.5 0.5

where M gives the number of symbols in each state, N gives the number of states, A: gives the transition probability matrix, B: gives the symbol emission probabilities in each state, and pi: gives the initial transition probabilities. Make sure to understand this format, and can make sense of all the parameters.

Next, you must copy the program templates to you src directory. (Remember the "." in the end of the command)

cd
cd src
cp /home/projects/mniel/ALGO/code/HMM/viterbi.c .
cp /home/projects/mniel/ALGO/code/HMM/posterior.c .

The first program viterbi.c implements the (guess what!) Viterbi decoding algorithm, and the second program posterior.c implements the forward/backward posterior decoding algorithm.


Viterbi

Open the file viterbi.c in your favorite editor. Most the code is (as usual) pre-written. Go to the main procedure. Make sure you understand the structure of the program.

You shall fill in the missing code (XXXX) in the viterbi subroutine. Again make sure you understand the structure of the routine, and then fill out the missing code.

Next, compile the program

make viterbi

When the compilation is successful copy the code to the bin directory

cp viterbi ../bin/
rehash

The rehash command updates a table used by the operating system to keep track of all executable programs to include your new program viterbi

Now go to the HMM directory. First check what is the content of the files casino.seq, and casino.seqlong.

head casino.seq casino.seqlong

Next, run the viterbi program on the file casino.seq

viterbi casino.hmm casino.seq

Look at the output. How does the output compare to the casino.seq_vit?

Do the same for the casino.seq_long file. How does the output compare to the casino.seqlong_vit?


Posterior decoding

Now go back to the src directory and open the file posterior.c. Spend some time to make sure you understand the structure of the program. Fill in the missing code (XXXXXX). Note, that both the subroutine posterior and the main routine has un-written code. Next compile the program

                        
make posterior

When the compilation is successful, copy the code to the bin directory

cp posterior ../bin
rehash

Now go to the HMM directory. You can now test your posterior decoding algorithm on the two data set from before casino.seq, and casino.seq_long.

How does the output compare to casino.seq_post and casino.seqlong_post?

Make sure you understand how the output from the posterior decoding for the same observation is modified depending on flanking observations.


Profile hidden Markov models, HMMER

Now you are done with the coding. As the last part of the exercise you shall use the HMMER program packages to construct a profile hidden Markov Model to be used to identify remote homology between two protein sequences. If you need some more detailed explanation of the HMMER program suite, you can refer to the HMMER Userguide

As you most likely do not have the hmmer program installed on your own labtop, you must to this exercise at the CBS servers. Login to the login server at CBS using one of the stud0XX accounts not already in use, and do the exercise here.

First you must copy some files to your HMM directory (on the CBS servers)

cd
cd HMM
cp /home/projects/mniel/ALGO/data/HMM/T0391.fsa .

Next you shall run blast to identify sequences similar to your query sequence. You do this by running the command

blastpgp -j 3 -e 0.001 -m 6 -i T0391.fsa -d sp -b 10000000 -v 10000000 > T0391.fsa.blastout

As explained in the lecture the HMMER program cannot directly read the Blast alignment output format. An easy format to obtain is the Stockholm format. This format looks like

# STOCKHOLM 1.0

QUERY DPEISEQDEEKKKYTSVCVGREEDIRKS-ERMTAVVHD-RE--V-V-IF--Y-H-KGE-Y
Q8K2P6 DPEISEQDEEKKKYTSVCVGREEDIRKS-ERMTAVVHD-RE--V-V-IF--Y-H-KGE-Y
Q8TAC1 ----SAQDPEKREYSSVCVGREDDIKKS-ERMTAVVHD-RE--V-V-IF--Y-H-KGE-Y
Q07947 ------------------ILRQSDLPPG-EMQRYEGGS-EP--V-M-VC--N-V-DGE-F
..

QUERY HAMDIRCYHSGG-PLHL-GEI-EDFNGQSCIVCPWHKYKITLATGE--GLYQSINPKDPS
Q8K2P6 HAMDIRCYHSGG-PLHL-GEI-EDFNGQSCIVCPWHKYKITLATGE--GLYQSINPKDPS
Q8TAC1 HAMDIRCYHSGG-PLHL-GDI-EDFDGRPCIVCPWHKYKITLATGE--GLYQSINPKDPS
Q07947 FAVQDTCTHGDW-ALSE-GYL-DGD----VVECTLHFGKFCVRTGK--VKAL------PA
..

QUERY AKP--KWCSKGVK-QR--IHTVKVD-NGNIYVTLSKEPFKCDSDYYATGEFKVIQSSS
Q8K2P6 AKP--KWCSKGVK-QR--IHTVKVD-NGNIYVTLSKEPFKCDSDYYATGEFKVIQSSS
Q8TAC1 AKP--KWCSKGIK-QR--IHTVTVD-NGNIYVTLSNEPFKCDSDFYATGDFKVIKSSS
Q07947 CKP-----------IK--VYPIKIE-GDEVHVDLDNGELK------------------
..
//

We can make a simple GAWK script to transform the Blast output to the Stockholm format. You can copy such a gawk script to the HMM directory

cp /home/projects/mniel/ALGO/data/HMM/do_msa2sto .

You can see how the script is written by typing

cat do_msa2sto

If you know GAWK, you might be able to make some sense of this. If not, do not worry, just run the command. (As I said in my lecture, most of the work of a bioinformatician is spend transforming data formats!)

You can run the script by typing

 
cat T0391.fsa.blastout | do_msa2sto > T0391.fsa.blastout.sto

Now you can do the HMM model construction.

hmmbuild T0391.hmm T0391.fsa.blastout.sto

You can see how the HMM model is constructed in the file T0391.hmm

more T0391.hmm

Spend a little time looking a the model file. Can you understand the structure of the model? How long is the T0391 sequence, and how large in the HMM model?

Now you are ready to search the PDB database for protein structures for a template suitable for protein homology modeling

hmmsearch T0391.hmm /home/databases/pdb/pdb.fsa > T0391.out

Look at the output for the search (open the file T0391.out in some editor). Do you find a good template with a submission data prior to May 2008. You can get the submission date for a PDB entry by typing

getpdb XXXX | less

where the 4 letter code (shown as XXXX) is the PDB code for the hit (the last letter following the "." is the chain identifier).

Can you verify if this PDB entry would have been a good choice as template for homology modeling of the CASP target (remember the CASP competition took place in 2008)? Use the CE (Combinatorial extension) algorithm for this (you can also use the CE web-server)

The CE program unfortunately only runs on the old CBS server. Login to organism.cbs.dtu.dk using your student id and password.

ssh -Y stud0XXorganism.cbs.dtu.dk

Next go to the HMM directory and type

CE 3D89 A XXXX Y
where 3D89 is the PDB code for the CASP target (T0391), A is the chain identifier for the CASP target. XXX is the PDB identifier for the best scoring HMMER template (with submission date prior to may 2008), and Y is the chain identifier.

Remember that a good template match should have low Rmsd (less than 3-4 AA) and Z-score above 3.8.

So do the two PDB files share structural similarities?


Now you done.