|
Exercise M14
OBJECTIVES
The purpose of the first part of this exercise is to demonstrate how to do rRNA and tRNA annotation.
The second part of the exercise focuses on benchmarking different annotation methods.
Key tools used in this exercise:
RNAmmer -
A program that uses hidden markov models (HMM) to search for ribosomal RNA
subunits. It is based on the HMMER package.
tRNAscan-SE -
A program developed at Eddy lab, that searches for tRNA sequences. It runs on unix and linux systems.
R - will be used for the comparison of the gene annotations.
Run the fixR script
Do this by downloading the script, opening a terminal and typing:
su
sh fixR
exit
Then R should work nicely and be integrated with the emacs.editor.
Make typing easier
You can consider running the fix_keyboard script again.
sh fix_keyboard
|
|
-
RNA feature annotations
When we are doing annotation, we will usually work with the bare DNA sequence in fasta format.
Thus we will need to extract the DNA sequences from the genbank files,
and store them in fasta format. Sadly as of today, writing parsers,
that convert files from one format to another, is a central part of a
bioinformaticians work.
Today, however, we have done like the chefs on TV and prepared one in advance.
So go to the data directory and for each genbank file run:
gbk2fsa FILE.gbk > FILE.fsa
An aspiring bioinformatician will look for an easier way to do this. You may recall doing something yesterday along the lines of:
for FILE in *.gbk
do gbk2fsa $FILE > `basename $FILE .gbk`.fsa
done
To understand the above command you should know that *.gbk translates into all the .gbk files. The 'for', 'do' and 'done' defines a loop, and the 'basename' command removes the .gbk ending from the files. You can get more information on the basename command by typing
man basename
By the way, you exit the man program with 'q'.
RNAmmer
First we will search for ribosomal RNA using RNAmmer. RNAmmer takes a DNA fasta
file as input, and provides output in a variety of formats: fasta, gff, xml and raw hmmsearch output.
Now we will run RNAmmer on the genomes to find 16S rRNA. You can see the syntax for RNAmmer with
man rnammer
So, to make it easy, we will run RNAmmer in another for loop.
for FILE in *.gbk
do rnammer -S bac -m ssu -f `basename $FILE .gbk`.ssu.fsa `basename $FILE .gbk`.fsa &
done
wait
How many 16S rRNA genes did you find in each genome?
tRNAscan-SE
Now we will run tRNAscan-SE, to find tRNA genes. tRNAscan-SE is an improvement
of an older version of tRNAscan by Fichant and Burks. It uses a covariance
model, and uses the old tRNAscan and another algorithm as a quick filter to
reduce the high computation time of covariance models.
for FILE in *.gbk
do tRNAscan-SE -B -o `basename $FILE .gbk`.trnascan `basename $FILE .gbk`.fsa &
done
wait
While tRNAscan is searching, open a browser and point it to the atlas pages.
http://www.cbs.dtu.dk/services/GenomeAtlas/versions/beta/show-kingdom.php?kingdom=Bacteria
Here you should find the codon usage of the bacteria from the exercise.
When tRNAscanSE has finished, open the *.trnascan files with less, and see if your predictions are in accordance with the codon usage rose plot.
-
Benchmarking gene finders
In this part of the exercise, we will compare predictions by Glimmer3 and genemark.
As you should have seen by now, the prediction of basic RNA genes is relatively straight forward.
Looking for protein coding genes is much more challenging. Especially it is difficult to deal with short reading
frames. One of the reasons is that it is very easy for short reading frames to appear randomly by simple chance,
while it is very improbable for longer reading frames to appear by simple chance. Thus the it is almost always a real
gene if you find a very long reading frame.
In your data directory you will find glimmer and genemark predictions for all the genomes
(except for CP000247 and AP009048, actually :).
What we are going to do now, is related to what we did in the first exercise. We will extract the lengths of the
predicted proteins and make density plots of this.
To do this, you should download the following script:
getlength.pl, and place it in the bin directory.
Then make it executable with this command:
chmod a+x ~/bin/getlength.pl
Now we will extract the lengths of the various predicted genes.
for FILE in *.proteins.fsa
do getlength.pl $FILE > `basename $FILE .fsa`.length
done
for FILE in *.GeneMarkHMM-2.6m
do getlength.pl $FILE > $FILE.length
done
for FILE in *.Glimmer3
do getlength.pl $FILE > $FILE.length
done
for FILE in *.easygene
do getlength.pl $FILE > $FILE.length
done
Now download the R-script densities.R and place it in the data directory.
Now open densities.R in emacs.
emacs densities.R
Notice that the R toolbar is present in the top of the emacs window.
To start an R session, type Ctrl+x 3 then Ctrl+x o to open a new frame in the
emacs window and selecting it. Then press the R button in the toolbar.
When you see that R has loaded in one of the frames, go back to the frame with
the densities.R script, either by clicking with the mouse, or by pressing Ctrl+x o.
Now you can execute the whole script by typing Ctrl+c Ctrl+b,
or you can execute a block at a time with Ctrl+c Ctrl+c. To execute only the line where the curser currently is, type Ctrl+c Ctrl+n.
Do you see any differences in the length distributions?
Are there differences between the results you get for the different organisms?
Answering the above questions concludes the exercise.
|
|