Contrary to other areas of sequence analysis, a measure of statistical
significance of a putative gene has not been devised to help in discriminating
real genes from the mass of random open reading frames (ORFs) in prokaryotic
genomes. Therefore, many genomes have too many short ORFs annotated as genes.
In this paper, we present a new automated gene-finding method, EasyGene, which
gives the statistical significance of a predicted gene. The gene finder is
based on a hidden Markov model (HMM) that is automatically estimated for a new
genome. Using extensions of similarities in Swiss-Prot, a high quality
training set of genes is automatically extracted from the genome and used to
estimate the HMM. Putative genes are then scored with the HMM, and based
on score and length of an ORF, the statistical significance is calculated. The
measure of statistical significance for an ORF is the expected number of ORFs
in one megabase of random sequence at the same significance level or better,
where the random sequence has the same statistics as the genome in the sense of
a third order Markov chain.
The result is a flexible gene-finder whose overall
performance matches or exceeds other methods. The entire
pipeline of computer processing from the raw input of a genome or set of
contigs to a list of putative genes with significance is automated, making it
easy to apply EasyGene to newly sequenced organisms. EasyGene with pre-trained
models can be accessed at
http://www.cbs.dk/services/EasyGene.