Exercise M14


The purpose of the first part of this exercise is to demonstrate how to do rRNA and tRNA annotation. The second part of the exercise focuses on benchmarking different annotation methods.

Key tools used in this exercise:
RNAmmer - A program that uses hidden markov models (HMM) to search for ribosomal RNA subunits. It is based on the HMMER package.
tRNAscan-SE - A program developed at Eddy lab, that searches for tRNA sequences. It runs on unix and linux systems.
R - will be used for the comparison of the gene annotations.

Run the fixR script
Do this by downloading the script, opening a terminal and typing:
sh fixR
Then R should work nicely and be integrated with the emacs.editor. Make typing easier
You can consider running the fix_keyboard script again.
sh fix_keyboard
  1. RNA feature annotations

    When we are doing annotation, we will usually work with the bare DNA sequence in fasta format. Thus we will need to extract the DNA sequences from the genbank files, and store them in fasta format. Sadly as of today, writing parsers, that convert files from one format to another, is a central part of a bioinformaticians work. Today, however, we have done like the chefs on TV and prepared one in advance. So go to the data directory and for each genbank file run:
    gbk2fsa FILE.gbk > FILE.fsa
    An aspiring bioinformatician will look for an easier way to do this. You may recall doing something yesterday along the lines of:
    for FILE in *.gbk
    do gbk2fsa $FILE > `basename $FILE .gbk`.fsa
    To understand the above command you should know that *.gbk translates into all the .gbk files. The 'for', 'do' and 'done' defines a loop, and the 'basename' command removes the .gbk ending from the files. You can get more information on the basename command by typing
    man basename
    By the way, you exit the man program with 'q'.

    First we will search for ribosomal RNA using RNAmmer. RNAmmer takes a DNA fasta file as input, and provides output in a variety of formats: fasta, gff, xml and raw hmmsearch output.
    Now we will run RNAmmer on the genomes to find 16S rRNA. You can see the syntax for RNAmmer with
    man rnammer
    So, to make it easy, we will run RNAmmer in another for loop.
    for FILE in *.gbk
    do rnammer -S bac -m ssu -f `basename $FILE .gbk`.ssu.fsa `basename $FILE .gbk`.fsa &
    How many 16S rRNA genes did you find in each genome?

    Now we will run tRNAscan-SE, to find tRNA genes. tRNAscan-SE is an improvement of an older version of tRNAscan by Fichant and Burks. It uses a covariance model, and uses the old tRNAscan and another algorithm as a quick filter to reduce the high computation time of covariance models.
    for FILE in *.gbk
    do tRNAscan-SE -B -o `basename $FILE .gbk`.trnascan  `basename $FILE .gbk`.fsa &
    While tRNAscan is searching, open a browser and point it to the atlas pages.
    Here you should find the codon usage of the bacteria from the exercise.
    When tRNAscanSE has finished, open the *.trnascan files with less, and see if your predictions are in accordance with the codon usage rose plot.

  2. Benchmarking gene finders

    In this part of the exercise, we will compare predictions by Glimmer3 and genemark.
    As you should have seen by now, the prediction of basic RNA genes is relatively straight forward. Looking for protein coding genes is much more challenging. Especially it is difficult to deal with short reading frames. One of the reasons is that it is very easy for short reading frames to appear randomly by simple chance, while it is very improbable for longer reading frames to appear by simple chance. Thus the it is almost always a real gene if you find a very long reading frame.

    In your data directory you will find glimmer and genemark predictions for all the genomes (except for CP000247 and AP009048, actually :). What we are going to do now, is related to what we did in the first exercise. We will extract the lengths of the predicted proteins and make density plots of this.
    To do this, you should download the following script: getlength.pl, and place it in the bin directory. Then make it executable with this command:
    chmod a+x ~/bin/getlength.pl
    Now we will extract the lengths of the various predicted genes.
    for FILE in *.proteins.fsa
    do getlength.pl $FILE > `basename $FILE .fsa`.length
    for FILE in *.GeneMarkHMM-2.6m
    do getlength.pl $FILE > $FILE.length
    for FILE in *.Glimmer3
    do getlength.pl $FILE > $FILE.length
    for FILE in *.easygene
    do getlength.pl $FILE > $FILE.length
    Now download the R-script densities.R and place it in the data directory.
    Now open densities.R in emacs.
    emacs densities.R
    Notice that the R toolbar is present in the top of the emacs window.
    To start an R session, type Ctrl+x 3 then Ctrl+x o to open a new frame in the emacs window and selecting it. Then press the R button in the toolbar. When you see that R has loaded in one of the frames, go back to the frame with the densities.R script, either by clicking with the mouse, or by pressing Ctrl+x o.

    Now you can execute the whole script by typing Ctrl+c Ctrl+b, or you can execute a block at a time with Ctrl+c Ctrl+c. To execute only the line where the curser currently is, type Ctrl+c Ctrl+n.

    Do you see any differences in the length distributions?
    Are there differences between the results you get for the different organisms?

    Answering the above questions concludes the exercise.