|
Exercise 2 - rRNA and tRNA predictions - searching protein databases
cont. Predicting coding sequences
At the end of last exercise you predicted the number of coding sequences
in your own bacterium. We will now work with these results, and calculate the coding density
and compare with the box-and-whiskers plot you made last time. Also, we will compare your CDS prediction
with that of the genbank entry.
- Examine ORFs
umask 022
setenv MAKEFILES /home/people/pfh/makesystem/make/Makefile
set accession = AE017262
cp -rL ~www/pub/CBS/dtucourse/27101/files/ex2 ~/
cd ~/ex2
cp ~/ex1/$accession.orfs.fsa .
cp ~/ex1/$accession.fsa .
less $accession.orfs.fsa
- Count the predicted CDS' and report to MySQL
set nprot = `grep '>' $accession.orfs.fsa | wc -l`
echo $nprot
mysql -B -e "insert ignore into MicrobialGenomicsGroup.27101_08 (accession) values ('$accession')"
mysql -B -e "update MicrobialGenomicsGroup.27101_08 set nprot = $nprot where accession = '$accession'"
- Compare with box-and-whiskers plot from last exercise
ghostview ~/ex1/cds.ps
- Compare with genbank annotations
set nprot_annotated = `cat /home/projects/MicrobialGenomicsGroup/genbank_complete/data/$accession.gbk | grep ' CDS ' | wc -l`
echo $nprot_annotated
mysql -B -e "update MicrobialGenomicsGroup.27101_08 set nprot_annotated = $nprot_annotated where accession = '$accession'"
QUESTION Do you predict more or fewer protein coding genes than what is annotated in the genbank file
- Translate ORFs into proteins and upload to common directory
6pack -1 < ~/ex1/$accession.orfs.fsa > $accession.proteins.fsa
cp $accession.proteins.fsa ~pfh/upload
less $accession.proteins.fsa
QUESTION How many coding sequences do you find? Find the phyla of your bacterium, and compare with the box-and-whiskers plot you made last time
- Get genome size
set length = `saco_convert -I fasta -O length $accession.fsa`
echo $length
mysql -B -e "update MicrobialGenomicsGroup.27101_08 set length = $length where accession = '$accession'"
- Compare with box-and-whiskers plot
ghostview ~/ex1/length.ps
QUESTION What is the length of your genome? Find the phyla of your bacterium, and compare with the box-and-whiskers plot you made last time
- Calculate coding density
set coding = `perl -ne 'next unless /CDS_(\d+)\-(\d+)/; $cod += abs($1-$2);END {print $cod}' < ~/ex1/$accession.orfs.fsa`
set coding_density = `perl -e "printf ('%0.2f' , $coding / $length);"`
echo $coding_density
mysql -B -e "update MicrobialGenomicsGroup.27101_08 set coding_density = $coding_density where accession = '$accession'"
- Compare with box-and-whiskers plot
ghostview ~/ex1/coding_density.ps
QUESTION What is the coding density of your bacterium? Find the phyla of your species, and compare with the box-and-whiskers plot you made last time.
Introduction: searching for conserved genes
The 5S rRNA genes constitute a conserved RNA structure which is only partly conserved on the sequence level.
It makes the task of predicting the location of 5S rRNA genes in novel sequences more difficult, when using
sequence based methods. In this exercise you will experience, that one needs to account for observed variation and
not rely on a single templates and a BLAST search when searching in novel sequences.
We have extracted the 5S rRNA gene of E. coli K12, rrnE operon, accession U00096.
We will try to use this as a template for finding 5S rRNA in an example sequence:Mycoplasma genitalium G37.
- Find 5S rRNA gene using E. coli as template
# we download a small genome genome sequence as example
perl getseq.pl L43967 > L43967.fsa
# we convert this into a BLAST database we can search against
formatdb -i L43967.fsa -p F -t L43967
# we make a blast search, finding an Ecoli 5S rRNA gene in the genome we just downloaded
blastall -F no -p blastn -d L43967.fsa -i 5s_Ecoli_u00096.fsa > 5s_Ecoli_u00096.out
less 5s_Ecoli_u00096.out
- Find 5S rRNA gene using an example M. gallisepticum as template
blastall -F no -p blastn -d L43967.fsa -i 5s_Mgallisepticum_AE015450.fsa > 5s_Mgallisepticum_AE015450.out
less 5s_Mgallisepticum_AE015450.out
QUESTION Please comment on these results - are you satisfied with the alignments?
We will now try a different approach using a profile hidden markov model. You can find a quick explanation of HMM here.
The profile HMM accounts for the
variability of a larger number of observed sequences. We have prepared an alignment of multiple 5S rRNA sequences located
in the file 5s_bacteria.fsa and you will now use the software package HMMER
to build, calibrate and test a 5S rRNA profile HMM.
- Build a profile HMM from 5S multiple alignment
# see how the alignment looks
less 5s_bacteria.fsa
# construct an HMM from the alignment
hmmbuild -F 5s_bacteria.hmm 5s_bacteria.fsa
# the model needs to be calibrated - this is done using random sequences
hmmcalibrate 5s_bacteria.hmm
- Examine the HMM
less 5s_bacteria.hmm
- Finally, we search the the genome sequence using the profile HMM:
hmmsearch 5s_bacteria.hmm L43967.fsa > L43967.hmmsearch.out
less L43967.hmmsearch.out
QUESTION Please comment of the result - how does the alignment look now - why do you think this is better?
Predicting rRNA genes
The downside of profile HMM is the speed! Searching a one mega base genome
for a 16S rRNA gene takes roughly 20 minutes. Lagesen and co-authors have implemented a library of HMM using
shorter 'spotter' models, which uses only the most conserved regions to estimate the proximate location of the gene.
Once this is established, it uses a full-length model to accurately determine the start and stop position of the
rRNA gene. This software is called RNAmmer, and is available both as traditional web application and a web service.
We will continue with the example genome of Mycoplasma genitalium G37 and see how this tool works.
Web application method
- Download Mycoplasma genitalium G37
Download the sequence of the Mycoplasma genitalium G37 genome, accession L43967. We have prepared a
link to NCBI for this. Copy the entire fasta sequence into your clip board (CTRL-A, CTRL-C).
- Go to RNAmmer Go to the RNAmmer web site and familiarize yourself with the page.
Paste in (CTRL-V) the genome sequence in the sequence input field, and chose 'Bacteria' as sequence kingdom
- Submit
Submit your job to the server and wait - it will takes only a few minutes. The default output gives the gene annotations,
but the gene sequences as well as more detailed information is available for download.
Web Services
We have prepared a Web Services script, rnammer.pl which simply reads the genome sequence from STDIN.
You will now use this to predict the rRNA genes in your own bacterium:
- Fetch genome sequence, run RNAmmer, and report result
cd ~/ex2
perl getseq.pl L43967 > L43967.fsa
perl rnammer.pl < L43967.fsa > L43967.rrna.fsa
set rrna_count = `grep '>' L43967.rrna.fsa | wc -l`
echo $rrna_count
mysql -B -e "update MicrobialGenomicsGroup.27101_08 set rrna_count = '$rrna_count' where accession='L43967-$user'"
rRNA genes in your bacterium
The members of group stud043 have been assigned the following sequence:
Listeria monocytogenes str. 4b F2365 , accession AE017262
- Run RNAmmer on Listeria monocytogenes str. 4b F2365
cd ~/ex2
set accession = AE017262
perl getseq.pl $accession > $accession.fsa
perl rnammer.pl < $accession.fsa > $accession.rrna.fsa
set rrna_count = `grep '>' $accession.rrna.fsa | wc -l`
echo $rrna_count
mysql -B -e "update MicrobialGenomicsGroup.27101_08 set rrna_count = $rrna_count where accession = '$accession'"
cp $accession.rrna.fsa ~pfh/upload
- Compare with box-and-whiskers plot
perl boxplot -f 5 -x '# rRNA genes' -t "Distribution of genomic properties in 690 complete prokarytic genomes" < genomestat.tab > rrna.ps
gmake rrna.pdf
ghostview ~/ex2/rrna.ps
QUESTION How many rRNA genes do you find? Find the phyla of your bacterium, and compare with the box-and-whiskers plot
tRNA genes in your bacterium
We will use the program tRNA-scan SE to predict the number of tRNA genes in your bacterium. This time, we jump
directly to the web services script:
- Run tRNAscan Web Service and report result
cd ~/ex2
set accession = AE017262
perl trnascan.pl < $accession.fsa > $accession.trna.fsa
set trna_count = `grep '>' $accession.trna.fsa | wc -l`
mysql -B -e "update MicrobialGenomicsGroup.27101_08 set trna_count = $trna_count where accession = '$accession'"
- Compare with box-and-whiskers plot
perl boxplot -f 4 -x '# tRNA genes' -t "Distribution of genomic properties in 690 complete prokarytic genomes" < genomestat.tab > trna.ps
gmake trna.pdf
ghostview ~/ex2/trna.ps
QUESTIONWhat is the function of tRNA genes and how many would you intuitively expect to find? How many tRNA genes did you find - explain if there isa difference. Find the phyla of your bacterium, and compare with the box-and-whiskers plot.
Wrapping up your results
You have now made a draft annotation of your personal bacterium, and we will now create
an annotation file which we will use later.
- Join your annotations into a single file
cat $accession.trna.fsa ~/ex1/$accession.orfs.fsa $accession.rrna.fsa > $accession.all.fsa
perl -ne 'next unless /^>(\w+)_(\d+)\-(\d+)/; my @F = ($1,$2,$3); ($F[1],$F[2]) = ($3,$2) if $3 < $2; $F[3] = "+"; $F[3] = "-" if $3 < $2; \
printf "%-15s%9s%9s %s %s%sto%s\n",$F[0],$F[1],$F[2],$F[3],$F[0],$F[1],$F[2];' < $accession.all.fsa | sort -k2 -n > $accession.ann
less $accession.ann
QUESTION See if you can identify the operons of rRNA genes. The annotation list is sorted by position (done by sort -k2 -n). Writing slash ('/') in less will allow you to search for the text 'rRNA'.
- Assignment of protein function
You have predicted the coding sequences, but apart from this, you do not know much about the function of the genes. A quick
and rough search against sequence databases might give you a hit about what a specific gene of your interest, is responsible for.
Now, you should take (quite arbitrarily) the first gene you have predicted, and figure out what it does, if possible. We will do so by making a BLAST search, first against SwissProt ( a relatively reliable, small database of curated proteins ). If you find no (good) hits
to this database, we continue a search against NR ( a bigger database, containg all sequence database currently available!)
# try this first
saco_convert -I fasta $accession.proteins.fsa | \
head -1 | saco_convert -O fasta | \
blastall -p blastp -d sp | less
# ... and this if you don't get a hit with SwissProt
saco_convert -I fasta $accession.proteins.fsa | \
head -1 | saco_convert -O fasta | \
blastall -p blastp -d nr | less
QUESTION See if you can figure out what this protein is responsible for.
- Predict presence of sigma factors
We have constructed protein based HMMs to find the presence of sigma factors sigma70, sigma54 and ECF sigma factors.
ln -s $accession.proteins.fsa $accession.Gprot.fsa
saco_convert -I fasta -O tab $accession.Gprot.fsa > $accession.Gprot.tab
# 'gmake' will detect, that you have a file called '$accession.Gprot.fsa' and it will
# know which rule should be applied in order to make '$accession.s54.sigmas.fsa'
gmake -j $accession.s54.sigmas.fsa
set s54 = `grep '>' $accession.s54.sigmas.fsa | wc -l`
echo $s54
mysql -B -e "update MicrobialGenomicsGroup.27101_08 set s54 = $s54 where accession = '$accession'"
QUESTION How many matches do you find - you may not find any matches at all
Predict Sigma 70
gmake -j $accession.s70.sigmas.fsa
set s70 = `grep '>' $accession.s70.sigmas.fsa | wc -l`
echo $s70
mysql -B -e "update MicrobialGenomicsGroup.27101_08 set s70 = $s70 where accession = '$accession'"
QUESTION How many matches do you find - you may not find any matches at all
Predict ECF sigma factors
gmake -j $accession.ecf.sigmas.fsa
set ecf = `grep '>' $accession.ecf.sigmas.fsa | wc -l`
echo $ecf
mysql -B -e "update MicrobialGenomicsGroup.27101_08 set ecf = $ecf where accession = '$accession'"
QUESTION How many matches do you find - you may not find any matches at all
Perhaps it is time to look back to see what you have done so far. From the DNA sequence of a bacterial genome, you
have run tools, which has provided you with the following information:
- Genome length in pase pairs
- Predicted protein coding genes
- Predicted tRNA genes
- Predicted rRNA genes
- Calculated the coding density of your genome
- Created an annotation file of all the predicted features (in file $accession.ann)
Next time we will make genome visualizations comparing the proteins of your organism with those predicted
in all of the other groups, look at codon usage diagrams.
|