Hidden Markov Models
Morten Nielsen (email@example.com)
I todays exercise you shall implement two algorithms for alignment of a sequence to a hidden Markov Model,
The Viterbi, and Posterior decoding algorithms.
Implementing the algorithms
First, you must make a new directory for todays exercise and copy some data files.
cp /home/projects/mniel/ALGO/data/HMM/casino.* .
You should now have the files casino.hmm, casino.seq, and casino.seqlong in the HMM directory.
Check this by typing
The casino.hmm file is a parameter file describing the hidden Model, and the casino.seq and casino.seqlong files are
files with sequences to be used for decoding.
The casino.hmm parameter files has the following format
0.166 0.166 0.166 0.166 0.166 0.167
0.1 0.1 0.1 0.1 0.1 0.5
where M gives the number of symbols in each state, N gives the number of states, A: gives the transition probability
matrix, B: gives the symbol emission probabilities in each state, and pi: gives the initial transition probabilities.
Make sure to understand this format, and can make sense of all the parameters.
Next, you must copy the program templates to you src directory. (Remember the "." in the end of the command)
cp /home/projects/mniel/ALGO/code/HMM/viterbi.c .
cp /home/projects/mniel/ALGO/code/HMM/posterior.c .
The first program viterbi.c implements the (guess what!) Viterbi decoding algorithm, and
the second program posterior.c implements the forward/backward posterior decoding
Open the file viterbi.c in your favorite editor. Most the code is (as usual) pre-written.
Go to the main procedure. Make sure you understand the
structure of the program.
You shall fill in the missing code (XXXX) in the viterbi subroutine.
Again make sure you understand the structure of the routine,
and then fill out the missing code.
Next, compile the program
When the compilation is successful copy the code to the bin directory
cp viterbi ../bin/
The rehash command updates a table used by the operating system to keep track of all executable programs to
include your new program viterbi
Now go to the HMM directory. First check what is the content of the files casino.seq, and casino.seqlong.
head casino.seq casino.seqlong
Next, run the viterbi program on the file casino.seq
viterbi casino.hmm casino.seq
Look at the output. How does the output compare to the casino.seq_vit?
Do the same for the casino.seq_long file. How does the output compare to the casino.seqlong_vit?
Now go back to the src directory and open the file posterior.c. Spend some time to make sure you understand the
structure of the program. Fill in the missing code (XXXXXX). Note, that both the subroutine posterior
and the main routine has un-written code. Next compile the program
When the compilation is successful, copy the code to the bin directory
cp posterior ../bin
Now go to the HMM directory. You can now test your posterior decoding algorithm on the two data
set from before casino.seq, and casino.seq_long.
How does the output compare to casino.seq_post and
Make sure you understand how the output from the posterior decoding for the same observation is modified depending on
Profile hidden Markov models, HMMER
Now you are done with the coding. As the last part of the exercise you shall use the HMMER program packages to
construct a profile hidden Markov Model to be used to identify remote homology between two protein sequences.
If you need some more detailed explanation of the HMMER program suite, you can refer to the
As you most likely do not have the hmmer program installed on your own labtop, you must to this exercise at the CBS servers.
If you have not selected a student account, use stud001 and make a directory
unique for you in the stud001 home directory, i.e;
First you must copy some files to your directory on the CBS servers
cd HMM (or where ever you directory is located)
cp /home/projects/mniel/ALGO/data/HMM/T0391.fsa .
Next you shall run blast to identify sequences similar to your query sequence. You do this by running the command
blastpgp -j 3 -e 0.001 -m 6 -i T0391.fsa -d sp -b 10000000 -v 10000000 > T0391.fsa.blastout
As explained in the lecture the HMMER program cannot directly read the Blast alignment output format. An easy
format to obtain is the Stockholm format. This format looks like
# STOCKHOLM 1.0
We can make a simple GAWK script to transform the Blast output to the Stockholm format. You can copy such a gawk
script to the HMM directory
cp /home/projects/mniel/ALGO/data/HMM/do_msa2sto .
You can see how the script is written by typing
If you know GAWK, you might be able to make some sense of this. If not, do not worry, just run the command.
(As I said in my lecture, most of the work of a bioinformatician is spend transforming data formats!)
You can run the script by typing
cat T0391.fsa.blastout | do_msa2sto > T0391.fsa.blastout.sto
Now you can do the HMM model construction.
hmmbuild T0391.hmm T0391.fsa.blastout.sto
You can see how the HMM model is constructed in the file T0391.hmm
Spend a little time looking a the model file.
Can you understand the structure of the model? How long is the T0391 sequence, and how large in the HMM model?
Now you are ready to search the PDB database for protein structures for a template suitable for
protein homology modeling
hmmsearch T0391.hmm /home/databases/pdb/pdb.fsa > T0391.out
Look at the output for the search (open the file T0391.out in some editor). Do you find a good template with
a submission data prior to May 2008. You can get the submission date for a PDB entry by typing
getpdb XXXX | less
where the 4 letter code (shown as XXXX) is the PDB code for the hit (the last letter following the "." is the
Can you verify if this PDB entry would have been a good choice as template for homology modeling of the CASP target (remember the CASP competition took place in 2008)?
Use the CE (Combinatorial extension) algorithm for this (you can also use the
The CE program unfortunately only runs on the old CBS server. Login to organism.cbs.dtu.dk using your student id and password.
ssh -Y stud0XXorganism.cbs.dtu.dk
Next go to the HMM directory and type
CE 3D89 A XXXX Y
where 3D89 is the PDB code for the CASP target (T0391), A is the chain identifier for the CASP target. XXX is the
PDB identifier for the best scoring HMMER template (with submission date prior to may 2008), and Y is the chain
Remember that a good template match should have low Rmsd (less than 3-4 AA) and Z-score above 3.8.
So do the two PDB files share structural similarities?
Now you done.