Exercise: Predicting genes using Prodigal

Mette Voldby Larsen (metteb@cbs.dtu.dk)


To know the sequence of the bases in a genome is all very nice, but on its own, it doesn't provide a lot of information about the properties of the organism harboring the genome. For this, identification of genes and the proteins they encode is needed.

Rasmus Wernerson from CBS, DTU, has developed VirtualRibosome . It is a tool for translating DNA sequences to the corresponding peptide sequence, including an integrated Open Reading Frame (ORF) finder. The tool outputs, however, only the longest ORF, and is as such not optimal for our purpose. We have an entire genome and expect to find many biologically relevant ORFs.

Instead, Salvatore Cosentino, CBS, DTU has set up a webservice running Prodigal. Prodigal was developed in 2010, by Doug Hyatt et al. from Oak Ridge National Laboratory, USA, and is available as open source code. It is considered to be state-of-the-art for predicting prokaryotic genes.

If you'd like, you can read more about Prodigal here: Research paper describing Prodigal.

Now, go to the Prodigal service and use it for predicting proteins in the file with the assembled draft genome that you generated during the exercise Assembling NGS data.

When Prodigal has finished (be patient), download the file containing the predicted proteins.

Have a look at the content in a text editor.

  • How many proteins did Prodigal predict? Is this a reasonable number? (Hint: Reseach paper comparing E. coli genomes).