|
Hidden Markov Models
Morten Nielsen (mniel@cbs.dtu.dk)
I todays exercise you shall implement two algorithms for alignment of a sequence to a hidden Markov Model,
The Viterbi, and Posterior decoding algorithms.
Implementing the algorithms
First, you must make a new directory for todays exercise and copy some data files.
cd
mkdir HMM
cd HMM
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/data/HMM/casino.* .
You should now have the files casino.hmm, casino.seq, and casino.seqlong in the HMM directory.
Check this by typing
ls -ltr
The casino.hmm file is a parameter file describing the hidden Model, and the casino.seq and casino.seqlong files are
files with sequences to be used for decoding.
The casino.hmm parameter files has the following format
M= 6
N= 2
A:
0.95 0.05
0.1 0.90
B:
0.166 0.166 0.166 0.166 0.166 0.167
0.1 0.1 0.1 0.1 0.1 0.5
pi:
0.5 0.5
where M gives the number of symbols in each state, N gives the number of states, A: gives the transition probability
matrix, B: gives the symbol emission probabilities in each state, and pi: gives the initial transition probabilities.
Make sure to understand this format, and can make sense of all the parameters.
Next, you must copy the program templates to you src directory. (Remember the "." in the end of the command)
cd
cd src
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/code/HMM/viterbi.c .
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/code/HMM/posterior.c .
The first program viterbi.c implements the (guess what!) Viterbi decoding algorithm, and
the second program posterior.c implements the forward/backward posterior decoding
algorithm.
Viterbi
Open the file viterbi.c in your favorite editor. Most the code is (as usual) pre-written.
Go to the main procedure. Make sure you understand the
structure of the program.
You shall fill in the missing code (XXXX) in the viterbi subroutine.
Again make sure you understand the structure of the routine,
and then fill out the missing code.
Next, compile the program
make viterbi
When the compilation is successful copy the code to the bin directory
cp viterbi ../bin/
rehash
The rehash command updates a table used by the operating system to keep track of all executable programs to
include your new program viterbi
Now go to the HMM directory. First check what is the content of the files casino.seq, and casino.seqlong.
head casino.seq casino.seqlong
Next, run the viterbi program on the file casino.seq
viterbi casino.hmm casino.seq
Look at the output. How does the output compare to the casino.seq_vit?
Do the same for the casino.seq_long file. How does the output compare to the casino.seqlong_vit?
Posterior decoding
Now go back to the src directory and open the file posterior.c. Spend some time to make sure you understand the
structure of the program. Fill in the missing code (XXXXXX). Note, that both the subroutine posterior
and the main routine has un-written code. Next compile the program
make posterior
When the compilation is successful, copy the code to the bin directory
cp posterior ../bin
rehash
Now go to the HMM directory. You can now test your posterior decoding algorithm on the two data
set from before casino.seq, and casino.seq_long.
How does the output compare to casino.seq_post and
casino.seqlong_post?
Make sure you understand how the output from the posterior decoding for the same observation is modified depending on
flanking observations.
Profile hidden Markov models, HMMER
Now you are done with the coding. As the last part of the exercise you shall use the HMMER program packages to
construct a profile hidden Markov Model to be used to identify remote homology between two protein sequences.
If you need some more detailed explanation of the HMMER program suite, you can refer to the
HMMER Userguide
As you most likely do not have the hmmer program installed on your own labtop, you must to this exercise at the CBS servers. Login to the login server at CBS using one of the stud0XX accounts not already in use, and do the exercise here.
First you must copy some files to your HMM directory (on the CBS servers)
cd
cd HMM
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/data/HMM/T0391.fsa .
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/data/HMM/T0391.fsa.blastout .
To save you time, I have run the Psi-Blast search for you and saved the output in the file T0391.fsa.blastout.
This file was generated using the command
blastpgp -j 3 -e 0.001 -m 6 -i T0391.fsa -d sp -b 10000000 -v 10000000 > T0391.fsa.blastout
NOTE THIS COMMAND YOU DO NOT NEED TO EXECUTE!
As explained in the lecture the HMMER program cannot directly read the Blast alignment output format. An easy
format to obtain is the Stockholm format. This format looks like
# STOCKHOLM 1.0
QUERY DPEISEQDEEKKKYTSVCVGREEDIRKS-ERMTAVVHD-RE--V-V-IF--Y-H-KGE-Y
Q8K2P6 DPEISEQDEEKKKYTSVCVGREEDIRKS-ERMTAVVHD-RE--V-V-IF--Y-H-KGE-Y
Q8TAC1 ----SAQDPEKREYSSVCVGREDDIKKS-ERMTAVVHD-RE--V-V-IF--Y-H-KGE-Y
Q07947 ------------------ILRQSDLPPG-EMQRYEGGS-EP--V-M-VC--N-V-DGE-F
..
QUERY HAMDIRCYHSGG-PLHL-GEI-EDFNGQSCIVCPWHKYKITLATGE--GLYQSINPKDPS
Q8K2P6 HAMDIRCYHSGG-PLHL-GEI-EDFNGQSCIVCPWHKYKITLATGE--GLYQSINPKDPS
Q8TAC1 HAMDIRCYHSGG-PLHL-GDI-EDFDGRPCIVCPWHKYKITLATGE--GLYQSINPKDPS
Q07947 FAVQDTCTHGDW-ALSE-GYL-DGD----VVECTLHFGKFCVRTGK--VKAL------PA
..
QUERY AKP--KWCSKGVK-QR--IHTVKVD-NGNIYVTLSKEPFKCDSDYYATGEFKVIQSSS
Q8K2P6 AKP--KWCSKGVK-QR--IHTVKVD-NGNIYVTLSKEPFKCDSDYYATGEFKVIQSSS
Q8TAC1 AKP--KWCSKGIK-QR--IHTVTVD-NGNIYVTLSNEPFKCDSDFYATGDFKVIKSSS
Q07947 CKP-----------IK--VYPIKIE-GDEVHVDLDNGELK------------------
..
//
We can make a simple GAWK script to transform the Blast output to the Stockholm format. You can copy such a gawk
script to the HMM directory
cp /usr/opt/www/pub/CBS/courses/27625.algo/exercises/data/HMM/do_msa2sto .
You can see how the script is written by typing
cat do_msa2sto
If you know GAWK, you might be able to make some sense of this. If not, do not worry, just run the command.
(As I said in my lecture, most of the work of a bioinformatician is spend transforming data formats!)
You can run the script by typing
cat T0391.fsa.blastout | do_msa2sto > T0391.fsa.blastout.sto
Now you can do the HMM model construction.
hmmbuild T0391.hmm T0391.fsa.blastout.sto
You can see how the HMM model is constructed in the file T0391.hmm
more T0391.hmm
Spend a little time looking a the model file.
Can you understand the structure of the model? How long is the T0391 sequence, and how large in the HMM model?
Now you are ready to search the PDB database for protein structures for a template suitable for
protein homology modeling
hmmsearch T0391.hmm /home/databases/pdb/pdb.fsa > T0391.out
Look at the output for the search (open the file T0391.out in some editor). Do you find a good template with
a submission data prior to May 2008. You can get the submission date for a PDB entry by typing
getpdb XXXX | less
where the 4 letter code (shown as XXXX) is the PDB code for the hit (the last letter following the "." is the
chain identifier).
Can you verify if this PDB entry would have been a good choice as template for homology modeling of the CASP target (remember the CASP competition took place in 2008)?
Use the CE (Combinatorial extension) algorithm for this (you can also use the
CE web-server)
The CE program unfortunately only runs on the old CBS server. Login to organism.cbs.dtu.dk using your student id and password.
ssh -Y stud0XXorganism.cbs.dtu.dk
Next go to the HMM directory and type
CE 3D89 A XXXX Y
where 3D89 is the PDB code for the CASP target (T0391), A is the chain identifier for the CASP target. XXX is the
PDB identifier for the best scoring HMMER template (with submission date prior to may 2008), and Y is the chain
identifier.
Remember that a good template match should have low Rmsd (less than 3-4 AA) and Z-score above 3.8.
So do the two PDB files share structural similarities?
Now you done.
|