Erik L.L. Sonnhammer, Gunnar von Heijne, and Anders Krogh:
A hidden Markov model for predicting transmembrane helices in protein
sequences.
In Proc. of Sixth Int. Conf. on Intelligent Systems for Molecular Biology,
p 175-182
Ed J. Glasgow, T. Littlejohn, F. Major, R. Lathrop, D. Sankoff, and C. Sensen
Menlo Park, CA: AAAI Press, 1998
Download compressed postscript file
Please cite.
Other material (model, training data, etc) can be found here
.
This is an example (one protein):
>5H2A_CRIGR you can have comments after the ID
MEILCEDNTSLSSIPNSLMQVDGDSGLYRNDFNSRDANSSDASNWTIDGENRTNLSFEGYLPPTCLSILHL
QEKNWSALLTAVVIILTIAGNILVIMAVSLEKKLQNATNYFLMSLAIADMLLGFLVMPVSMLTILYGYRWP
LPSKLCAVWIYLDVLFSTASIMHLCAISLDRYVAIQNPIHHSRFNSRTKAFLKIIAVWTISVGVSMPIPVF
GLQDDSKVFKQGSCLLADDNFVLIGSFVAFFIPLTIMVITYFLTIKSLQKEATLCVSDLSTRAKLASFSFL
PQSSLSSEKLFQRSIHREPGSYTGRRTMQSISNEQKACKVLGIVFFLFVVMWCPFFITNIMAVICKESCNE
HVIGALLNVFVWIGYLSSAVNPLVYTLFNKTYRSAFSRYIQCQYKENRKPLQLILVNTIPALAYKSSQLQA
GQNKDSKEDAEPTDNDCSMVTLGKQQSEETCTDNINTVNEKVSCV
Here is an example:
# ID 5H2A_CRIGR
# Length: 471
# Log-odds: 37.647490 bits
5H2A_CRIGR TMHMM1.0
outside 1 78
5H2A_CRIGR TMHMM1.0
TMhelix 79 101
5H2A_CRIGR TMHMM1.0
inside 102 107
5H2A_CRIGR TMHMM1.0
TMhelix 108 130
5H2A_CRIGR TMHMM1.0
outside 131 148
5H2A_CRIGR TMHMM1.0
TMhelix 149 171
5H2A_CRIGR TMHMM1.0
inside 172 192
5H2A_CRIGR TMHMM1.0
TMhelix 193 215
5H2A_CRIGR TMHMM1.0
outside 216 233
5H2A_CRIGR TMHMM1.0
TMhelix 234 256
5H2A_CRIGR TMHMM1.0
inside 257 325
5H2A_CRIGR TMHMM1.0
TMhelix 326 348
5H2A_CRIGR TMHMM1.0
outside 349 356
5H2A_CRIGR TMHMM1.0
TMhelix 357 379
5H2A_CRIGR TMHMM1.0
inside 380 471
If the whole sequence is labeled as inside or outside, the prediction
is that it contains no membrane
helices. It is probably not wise to interpret it as a prediction
of location.
The prediction gives the most probable location and orientation of transmembrane
helices in the sequence. It is found by an algorithm called N-best (or
1-best in this case) that sums over all paths through the model with the
same location and direction of the helices.
The log-odds score of the predicted structure with respect to the null
model is given (in bits). The higher it is the more confident is
the prediction (there has been no systematic studies on this).
>5H2A_CRIGR
MEILCEDNTSLSSIPNSLMQVDGDSGLYRNDFNSRDANSSDASNWTIDGENRTNLSFEGYLPPTCLSILHLQ
?0 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
EKNWSALLTAVVIILTIAGNILVIMAVSLEKKLQNATNYFLMSLAIADMLLGFLVMPVSMLTILYGYRWPLP
?0 ooooooMMMMMMMMMMMMMMMMMMMMMMMiiiiiiMMMMMMMMMMMMMMMMMMMMMMMoooooooooooooo
SKLCAVWIYLDVLFSTASIMHLCAISLDRYVAIQNPIHHSRFNSRTKAFLKIIAVWTISVGVSMPIPVFGLQ
?0 ooooMMMMMMMMMMMMMMMMMMMMMMMiiiiiiiiiiiiiiiiiiiiiMMMMMMMMMMMMMMMMMMMMMMMo
DDSKVFKQGSCLLADDNFVLIGSFVAFFIPLTIMVITYFLTIKSLQKEATLCVSDLSTRAKLASFSFLPQSS
?0 oooooooooooooooooMMMMMMMMMMMMMMMMMMMMMMMiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
LSSEKLFQRSIHREPGSYTGRRTMQSISNEQKACKVLGIVFFLFVVMWCPFFITNIMAVICKESCNEHVIGA
?0 iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiMMMMMMMMMMMMMMMMMMMMMMMooooooooMMMM
LLNVFVWIGYLSSAVNPLVYTLFNKTYRSAFSRYIQCQYKENRKPLQLILVNTIPALAYKSSQLQAGQNKDS
?0 MMMMMMMMMMMMMMMMMMMiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
KEDAEPTDNDCSMVTLGKQQSEETCTDNINTVNEKVSCV
?0 iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
Predictions are:
i: Inside (cytoplasmic).
M: Transmembrane helix.
O: Outside or non-cytoplasmic for long outside loops (typically
longer than 100).
o: Outside short loop region.
Usually you will not have to distinguish between long and short.
See paper for more details.
The plot is obtained by calculating the total probability that a
residue sits in helix, inside, or outside summed over all possible
paths through the model. Sometimes it seems like the plot and the
prediction are contradictory, but that is because the plot shows probabilities
for each residue, whereas the prediction is the over-all most probable
structure. Therefore the plot should be seen as a complementary source
of information.
Below the plot there are links to
One of the most common mistakes by the program is to reverse the direction of proteins with one TM segment.
It is possible that the log-odds score can be used to distinguish between TM proteins and other proteins, but it is not obvious, and we have not looked into it yet.
Do not use the program to predict whether a non-membrane protein is
cytoplasmic or not.