Current version (1.2)
Large-scale prokaryotic gene prediction and comparison
to genome annotation.
P. Nielsen and A. Krogh.
Bioinformatics: 21:4322-4329,
2005.
Bioinformatics Centre, Institute of Molecular Biology and Physiology,
University of Copenhagen, Universitetsparken 15, 2100 Copenhagen, Denmark
PMID: 16249266
MOTIVATION: Prokaryotic genomes are sequenced and annotated at an increasing
rate. The methods of annotation vary between sequencing groups. It makes genome
comparison difficult and may lead to propagation of errors when questionable
assignments are adapted from one genome to another. Genome comparison either on
a large or small scale would be facilitated by using a single standard for
annotation, which incorporates a transparency of why an open reading frame
(ORF) is considered to be a gene. RESULTS: A total of 143 prokaryotic genomes
were scored with an updated version of the prokaryotic genefinder EasyGene.
Comparison of the GenBank and RefSeq annotations with the EasyGene predictions
reveals that in some genomes up to approximately 60% of the genes may have been
annotated with a wrong start codon, especially in the GC-rich genomes. The
fractional difference between annotated and predicted confirms that too many
short genes are annotated in numerous organisms. Furthermore, genes might be
missing in the annotation of some of the genomes. We predict 41 of 143 genomes
to be over-annotated by >5%, meaning that too many ORFs are annotated as genes.
We also predict that 12 of 143 genomes are under-annotated. These results are
based on the difference between the number of annotated genes not found by
EasyGene and the number of predicted genes that are not annotated in GenBank.
We argue that the average performance of our standardized and fully automated
method is slightly better than the annotation.
Original EasyGene paper
EasyGene - a prokaryotic gene finder that ranks ORFs by statistical
significance.
Thomas Schou Larsen1 and Anders Krogh1,2.
BMC Bioinformatics: 4:21,
2003
1Center for Biological Sequence Analysis BioCentrum,
Technical University of Denmark Building 208, 2800 Lyngby, Denmark
2Present address: The Bioinformatics Centre,
University of Copenhagen Universitetsparken 15, 2100 Copenhagen, Denmark
PMID: 12783628
View the full article
BACKGROUND: Contrary to other areas of sequence analysis, a measure of
statistical significance of a putative gene has not been devised to help in
discriminating real genes from the masses of random Open Reading Frames (ORFs)
in prokaryotic genomes. Therefore, many genomes have too many short ORFs
annotated as genes. RESULTS: In this paper, we present a new automated
gene-finding method, EasyGene, which estimates the statistical significance of
a predicted gene. The gene finder is based on a hidden Markov model (HMM) that
is automatically estimated for a new genome. Using extensions of similarities
in Swiss-Prot, a high quality training set of genes is automatically extracted
from the genome and used to estimate the HMM. Putative genes are then scored
with the HMM, and based on score and length of an ORF, the statistical
significance is calculated. The measure of statistical significance for an ORF
is the expected number of ORFs in one megabase of random sequence at the same
significance level or better, where the random sequence has the same statistics
as the genome in the sense of a third order Markov chain. CONCLUSIONS: The
result is a flexible gene finder whose overall performance matches or exceeds
other methods. The entire pipeline of computer processing from the raw input of
a genome or set of contigs to a list of putative genes with significance is
automated, making it easy to apply EasyGene to newly sequenced organisms.