Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

HMMgene 1.1 web server documentation

  1. Introduction
  2. Using the web server
  3. Sequence format
  4. Output from HMMgene
  5. Options
  6. Annotation
  7. Known Bugs

Introduction

HMMgene is a program for prediction of genes in anonymous DNA.

The methods used are described in the paper:
A. Krogh: Two methods for improving performance of an HMM and their application for gene finding. In Proc. of Fifth Int. Conf. on Intelligent Systems for Molecular Biology, ed. Gaasterland, T. et al., Menlo Park, CA: AAAI Press, 1997, pp. 179-186.

The program predicts whole genes, so the predicted exons always splice correctly. It can predict several whole or partial genes in one sequence, so it can be used on whole cosmids or even longer sequences. HMMgene can also be used to predict splice sites and start/stop codons. If some features of a sequence are known, such as hits to ESTs, proteins, or repeat elements, these regions can be locked as coding or non-coding and then the program will find the best gene structure under these constraints.

The program is based on a hidden Markov model, which is a probabilistic model of the gene structure. This means that all predictions have associated probabilities that reflect how confident it is in the predictions. Apart from reporting the best prediction, HMMgene can also report the N best gene predictions for a sequence. This is useful if the there are several equally likely gene structures and may even indicate alternative splicing.

HMMgene takes an input file with one or more DNA sequences in FASTA format. It also has a few options for changing the default behavior of the program.

The output is a prediction of partial or complete genes in the sequences. The output is in a standardized format that is easily read by other programs, which specifies the location of all the predicted genes and their coding regions and scores for whole genes as well as exon scores.

Using the web server

The server has two forms: You can either submit a local file with sequences in fasta format, or paste your sequence into the window. Then you select which organism the sequence is from and what options you would like to use. Optionally you can also specify known annotation from e.g. database hits.

There are currently models for vertebrate and C. elegans. The vertebrate model is trained entirely on human genes, but it should work reasonably well for other vertebrates.

Sequence format

DNA sequences must be in FASTA format which looks like this example:
>SEQ1 Any text following the identifier is ignored
TCATTGTATCAGAAAGATAAAGAAAAAATAATCGTATTTCAGTACTTCTATACATCCTAAAAGGGAAGAC
GGAACACTTAAGTGGTTGATAAATTTGAAAAGCTGATTAAACATAATAATCACCATGTTGGGGGAAGACA
TAAAAGTCATAAAACAGATTTTTTATAATATTAAAAAAGTGACATGAAAATTATACAATTTTAGAAAGGA
ATATAAAAAGGCAGGAGTTAAAAAATAGTGGGACTAATATCATAGAAAACTATCCATGAGGAAGGTCAAA
TTTATTTTCAACATGTAAAAAGGATAAAGAGTAGAGGTATTTTAAAAATTCACAGATTCTTAATGAGGCA
AATGTTAAAATATGGAACCCAATCTCAGACAAATACATAGAAAGGAGTAAGGGCCAACTCTCATGCATAA
GGTATCCCATCCTATAGCAAATCAGATATATAGGTACGCTTGA

>HS1433PR H. sapiens gene for 14-3-3 protein.
  GAATTCGCGGCGCCGAGAGGGCGCGAGCGGCGGCGCTGCCTGCAGCCTGCAGCCTGCAGCCTCCGGCCGG
  CCGGCGAGCCAGTGCGCGTGCGCGGCGGCGGCCTCCGCAGCGACCGGGGAGCGGACTGACCGGCGGGAGG
  GCTAGCGAGCCAGCGGTGTGAGGCGCGAGGCGAGGCCGAGCCGCGAGCGACATGGGGGACCGGGAGCAGC
  TG
Letters can be upper or lower case. Spaces and other non-letter characters in the sequence are ignored. Letter U is translated to T. All letters not equal to A, C, G, T or U are treated as unknown (N). The sequences can be of any length.

All lines starting with `#' are treated as comment lines, lines starting with `%%' may contain annotation (see below). The execution time of the program is roughly proportional to the sequence length.

Output from HMMgene

The output of the program is in GFF format, which is a sequence annotation format developed with gene finding in mind. It is very simple and therefore it is easy to develop programs in perl or awk to post-process the output. The following is an example of the form it takes with hmmgene.

Note that hmmgene only predicts coding regions. That is, the first exon (`firstex' below) is only the coding part of the first coding exon and similarly for the last exon (`lastex' below). Below a `gene' therefore means the region of the gene from start to stop codon.

SEQ1 HMMgene1.1 firstex 692     702     0.347   +  2    bestparse:cds_1
SEQ1 HMMgene1.1 exon_1  2473    2711    0.421   +  1    bestparse:cds_1
SEQ1 HMMgene1.1 exon_2  2897    3081    0.544   +  0    bestparse:cds_1
SEQ1 HMMgene1.1 exon_3  10376   10563   0.861   +  2    bestparse:cds_1
SEQ1 HMMgene1.1 exon_4  11841   11891   0.857   +  2    bestparse:cds_1
SEQ1 HMMgene1.1 exon_5  12387   12483   0.993   +  0    bestparse:cds_1
SEQ1 HMMgene1.1 exon_6  13076   13211   0.970   +  1    bestparse:cds_1
SEQ1 HMMgene1.1 exon_7  13332   13415   0.926   +  1    bestparse:cds_1
SEQ1 HMMgene1.1 exon_8  13515   13603   1.000   +  0    bestparse:cds_1
SEQ1 HMMgene1.1 exon_9  14180   14235   1.000   +  2    bestparse:cds_1
SEQ1 HMMgene1.1 exon_10 14321   14408   0.999   +  0    bestparse:cds_1
SEQ1 HMMgene1.1 exon_11 14483   14579   0.877   +  1    bestparse:cds_1
SEQ1 HMMgene1.1 exon_12 14697   14764   0.639   +  0    bestparse:cds_1
SEQ1 HMMgene1.1 exon_13 14901   15030   0.835   +  1    bestparse:cds_1
SEQ1 HMMgene1.1 lastex  15643   15704   0.987   +  0    bestparse:cds_1
SEQ1 HMMgene1.1 CDS     692     15704   0.132   +  .    bestparse:cds_1
(the real list is tab separated)

Columns

  1. Sequence identifier
  2. Program name
  3. Prediction (see table below for the meaning).
  4. Beginning
  5. End
  6. Score between 0 and 1
  7. Strand: $+$ for direct and $-$ for complementary
  8. Frame (for exons it is the position of the donor in the frame)
  9. Group to which prediction belong. If several CDS's are found they will be called cds_1, cds_2, etc. `bestparse:' is there because alternative predictions will also be available (see below).
The score that comes with all the exons as well as the entire gene `CDS' above) is a probability, so a value close to one means that the program is fairly certain. (See `Known Bugs'.) The program also outputs some comment lines which are preceeded by `#'.

Predictions

Name Meaning
firstex The coding part of the first coding exon starting with the first base of the start codon.
exon_N The N'th predicted internal coding exon.
lastex The coding part of the last coding exon ending with the last base of the stop codon.
singleex The coding part of an exon in a gene with only one coding exon.
CDS Coding region composed of the exon predictions prior to this line.
START Predicted start codon with position of first and last base (only with signal option).
STOP Predicted stop codon with position of first and last base (only with signals option).
DON Predicted donor site with position of the base before and after the splice site (only with signal option).
ACC Predicted acceptor site with position of the base before and after the splice site (only signal option).

Options

Predict signals

Predict splice sites and start/stop codons associated probabilities. The output format is GFF as above.

The signal prediction is different from most other predictors of splice sites and start/stop, in that only signals that fit well into a whole gene structure is predicted, i.e., the signals are not predicted from the local sequence alone. This yields fewer predictions and usually better, however, if there is an error that frameshifts an actual gene or something like that, the splice sites might be missed as well as the gene.

Alternative predictions

The predicted genes in a sequence are the most probable ones according to the program (or rather the underlying hidden Markov model). It is possible to also see suboptimal predictions. For instance, to see the 3 most probable predictions, hmmgene is run with `3 best predictions' instead of `best prediction'. The program will run approximately 3 times slower in that case.

Because of the slow-down of the program and the large amount of information produced, it is best to use this option on a region, where it is likely that there is only one gene. Then it will be possible to see alternative ways of splicing it together. Although it is quite possible that real alternative splicing can be predicted in this way, this has not yet been investigated. Whether a gene is alternatively spliced or not, it will often be usefull to see the alternative possibilities that might score almost as well as the best prediction.

Annotation

If something is known about one or more of the sequences, it can be specified either in a separate annotation file or in the sequence file. For instance if it is known that SEQ2 is non-coding from base number 105 to 443, the annotation file must contain a line of the form

SEQ2 non-coding 105 443

  • coding
  • non-coding
  • intron
  • non-intron
  • intergenic
  • non-intergenic
Note that these keywords must appear exactly as written here (lower case). An optional + or - at the end of a line indicates direct strand (the direction of the sequence in the file) or the complementary strand.

The same can be specified in the sequence file by preceeding each line with `%%',

%% SEQ2 non-coding 105 443 This has to come before the actual sequence in the file, e.g., all annotation lines can come in the very beginning.

This is very useful if there are database hits to a sequence or if repeats are mapped by some other program. Assume for instance that there is a database hit to base 1503-1594 and alu repeats are found at position 10731-10890 and 13205-13356 in SEQ2. Then one might want to enter the lines

SEQ2 coding 1503 1594 +
SEQ2 non-coding 1503 1594 -
SEQ2 non-coding 10731 10890
SEQ2 non-coding 13205 13356
Here we indicated that the sequence is coding on the direct strand from 1503 to 1594 and non-coding in this region on the complementary strand. The two last lines means that the regions are non-coding on BOTH STRANDS.

Regions specified in the file are not allowed to overlap except on opposite strands. If the annotation you give does not conform to the model, the program will die. This happens for instance if the annotaion you give forces

  • a non-consensus start or stop codon.
  • a donor different from GT, or a an acceptor different from AG.
  • start of an exon less than 25 bases from the beginning of the sequence or an exon extending closer than 10 bases from the end.
  • a stop codon in a coding region.

Known Bugs

For some reason the probability of the final exon is sometimes larger than 1. Usually it is not very much. I can't find the error. Please let me know if it happens in other cases.

If no start codon or stop codon is predicted for a gene (e.g. begins and ends with an intron) the frame information and scores might be wrong.

HMMgene can in principle predict a gene with a stop codon in frame, if splicing happens in the middle of it. I have not yet seen any examples though.


GETTING HELP

Scientific problems:        Technical problems: