 |
|
Exercise - Module M16 - BLAST matrices, core/pan genome plots, journal club work
Setup exercise
- Set environment and copy files required for the exercise
umask 022
setenv MAKEFILES /home/databases/genomeatlasdb-3.0_cur/make/Makefile
cp -rL ~www/pub/CBS/courses/thaiworkshop08/m16 ~/
cd ~/m16
BLAST matrix
The blastmatrix perl script performs an all-against-all BLAST
comparison of mulitple organisms. For every organism, it calculates how many
proteins are homologus to all organism in the comparison. Including 4 organisms
in a single comparison will leave 4 x 4 = 16 cells in the matrix. The diagonal
of the matrix represent a special case, since this is equal to the genome
compared to itself. Naturally, when aligning a given gene with itself, it will
have perfect match alignment, and these hits are therefor excluded from the
diagonal. This leaves the diagonal as a measure of the number of paraloges
(homology within genomes) whereas all other cells represent the orthologs
(homology between genomes)
The purpose of this exercise is to gather all the predicted translations of
the other groups and build a BLAST matrix showing the homology btween all organisms.
The input for the script is an XML
file. We have provided a template for this file, and shall add only a section defining which genomes to
be compared. This will depend on who has finished the gene prediction and translation ;-)
- Build the configuration
This is done by the scell script make-bm-config.csh - feel free to examine it using less)
.
The scripts simply checks who has finished the *.proteins.fsa file and then build the configuration
file accordingly.
csh make-bm-config.csh blastmatrix.head.xml > blastmatrix.xml
less blastmatrix.xml
- Running blast matrix program
Running the BLAST matrix program scales badly with the number of proteomes (n2) and
therefor takes a long time to finish for all your genomes. Luckily the program features a
caching system which enables it to reuse results from searches it has already
performed. For those of you who are interrested, the program uses MD5 checksums of the input proteomes
to keep track of the results - this is a very convinient way to keep track of
changes in large data files! We have primed the cache with all the proteins sequences of your genomes
so you can expect significantly shorter run time.
perl blastmatrix-1.2.pl blastmatrix.xml > blastmatrix.ps
gmake blastmatrix.pdf
Open the PDF file:m16/blastmatrix.pdf
QUESTION From this plot, can you identify genomes, which appear to share homology?
Can you find genomes, which has a high degree of paralogs (homology within the genome?)
QUESTION Can you identify the least related proteomes?
Phylogenetic trees from 16S rRNA
We will now use one single 16S rRNA gene from each of the genomes you have been assigned to do
a neighbor-joining tree, using CLUSTAL. A small python script helps to convert the tree file
into a Post Script image you can view. Clustal saves the tree as a .dnd per default.
- Extract all predicted 16S rRNA genes from the other groups
cat ~pfh/upload/*rrna.fsa | saco_convert -I fasta | perl -ne '$line = $_;$accession = $1 \
if $line =~ /input_id=(\S+)/;$seq{$accession} = (split(/\t/ , $line))[1]; \
END {foreach my $accession (keys %seq){ print "$accession\t$seq{$accession}\n";}}' \
| saco_convert -O fasta > all.rrna.fsa
less all.rrna.fsa
- Make a mulitple alignments using CLUSTAL and inspect both alignment and tree file
clustalw all.rrna.fsa
less all.rrna.aln
less all.rrna.dnd
- Translated accession numbers into organism names
Working with longer sequences identifiers (e.g. using organism names) are often
impractical due to CLUSTAL truncating identifiers. This is the reason why we
used accession numbers instead. However, we want now to replace the accession
numbers in the .dnd file with proper organism names. For this, we use
the program sed. We will
connect to a central MySQL database, which contains a list of all the accession
numbers we have distributed to the individual groups. In fact, the query builds a
sed script - see for yourself:
mysql -B -N -e "select concat('s/',accession,'/',replace(substring_index(organism_name,' ',2),' ' ,'_'),'_',replace(grp,'/','_'),'/g') from \
MicrobialGenomicsGroup.thaiworkshop08 as t , pfh_public.genbank_complete_seq \
as s , pfh_public.genbank_complete_prj as p where s.pid = p.pid and \
s.genbank = t.accession" > script.sed
less script.sed
- Letting sed translate accession numbers into organism names
sed -f script.sed < all.rrna.dnd > all.rrna.replace.dnd
less all.rrna.replace.dnd
Now, that was much prettier!
- The provided python script treeps.py will draw the neightbour joining tree from the .dnd file
python treeps.py all.rrna.replace.dnd
gmake all.pdf
Open the PDF file:m16/all.pdf
QUESTION Where is your organism on the tree - what other organism are close?
- Reordering the organism based on the tree
The order of the organism of the BLAST matrix is random. You shall now reorder the organisms of the
BLAST matrix. You shall do this manually (sadly!) by opening the XML configuration in an editor called nedit.
Opening nedit will take some seconds - so please be patient! We have arranged the configuration file so that
you simply have to reorder the lines starting with <entry><length>
cp blastmatrix.xml blastmatrix.reorder.xml
nedit blastmatrix.reorder.xml
perl blastmatrix-1.2.pl blastmatrix.reorder.xml > blastmatrix.reorder.ps
gmake blastmatrix.reorder.pdf
Open the PDF file:m16/blastmatrix.reorder.pdf
Two-dimensional clustering
As a final comparison for today will compare mulitple genomic properties of all the genomes your have been
assigned. In fact, as you have been predicting features of your own genomes, you have uploaded your results
to the central MySQL table, you you will now query this table to obtain the results from the other groups.
- Query MySQL to obtain results from the other groups
mysql -B -e "select concat('#',color),organism_name,nprot,rrna_count,trna_count,length,coding_density,ATcontent,\
if(ISNULL(s54),0,s54) as s54,if(ISNULL(s70),0,s70) as s70,if(ISNULL(ecf),0,ecf) as ECF \
from MicrobialGenomicsGroup.thaiworkshop08 as t, pfh_public.phyla as ph, pfh_public.genbank_complete_seq as seq, \
pfh_public.genbank_complete_prj as prj where prj.pid = seq.pid and accession not like '%-%' and \
genbank = accession and phyla = grp" > compare.tbl
less compare.tbl
- Produce a two-dimensional clustering of raw results
perl heatmap < compare.tbl > compare.ps
gmake compare.pdf
Open the PDF file:m16/compare.pdf
QUESTION Why do you think this looks ugly? Can we directly compare any genomic (and numerical) property in this way?
- Again, run the heatmap script, this time using the scale option which will normalize data within columns:
perl heatmap < compare.tbl -scale > compare.scale.ps
gmake compare.scale.pdf
Open the PDF file:m16/compare.scale.pdf
QUESTION Explain why the normalized plot is different
QUESTION Can you identify any of the genomic properties that appear to be correlated? Explain your observations! Does the dendrogram on this plot match the 16S rRNA tree? Why/why not?
QUESTION Could you come up with suggestions as to other genomic properties which could be relevant to look at and include in this comparison?
Concluding remarks
You have now finished the analysis of the 20 genomes you were assigned monday.
It is perhaps time to look back and catch up on what you did. You were given
a genome sequence in fasta format. From this, you were able to
predict the protein coding genes, the ribosomal RNA genes, the transfer RNA
genes, calculate length, AT content, and coding density. Also, as bonus tasks,
you searched the proteins you predicted with Hidden Markov Models to identify the sigma54,
sigma70, and ECF proteins. Your gene predictions were then mapped on to a
circular atlas representation which shows the homology to the other proteomes of
the other groups as well as some DNA structural properties.
Despite the fact that most of you were not familiar with the
UNIX environment, you all did a really good job carrying out the many tasks of
the exercises!
Group work for journal club: Examine the genome
Having finished the analysis of your personal genome, we will now move on to the
journal club work: Together with the publication, we have assigned you a genbank
accession which is relevant to your article. You can find your accession number
from the table below:
The group are listed in this PDF: Presentation Groups
| Accession | Organism | Phylum |
| Group 1 | AP009384 | Azorhizobium caulinodans ORS571 | Alphaproteobacteria |
| Group 2 | CP000951 | Synechococcus sp. PCC 7002 | Cyanobacteria |
| Group 3 | CU633749 | Cupriavidus taiwanensis | Betaproteobacteria |
| Group 4 | AB297474 | Legionella dumoffii plasmid pLD-TEX-KL | Gammaproteobacteria |
| Group 5 | CP000304 | Pseudomonas stutzeri A1501 | Gammaproteobacteria |
| Group 6 | AM422018 | Candidatus Phytoplasma australiense | Chlamydiae |
| Group 7 | AP009493 | Streptomyces griseus IFO 13350 | Actinobacteria |
| Group 8 | CP000854 | Mycobacterium marinum | Actinobacteria |
We will now perform some of the same analysis you have made the past days - on
the genome for the journal club. You shall replace the
XY999999 accession number with the one from the table
above We fell confident, you are able to recognize these
commands, so we will leave out the explanations and the following is a
suggestion as to which analysis would be relevant for your presentation.
set accession = XY999999
set organism = `getgene $accession | perl -ne 'if (/^ ORGANISM /) {s/^ ORGANISM //; s/^([^ ])[^ ]* ([^ 0-9]+( .*)?)$/$1. $2/; print;}'`
echo "Organism name: $organism"
gunzip -c $accession.proteins.fsa.gz > $accession.proteins.fsa
gunzip -c $accession.ann.gz > $accession.ann
gunzip -c $accession.fsa.gz > $accession.fsa
set nprot_annotated = `grep ^CDS $accession.ann | wc -l`
echo "Number of proteins: $nprot_annotated"
perl trnascan.pl < $accession.fsa > $accession.trna.fsa
set trna_count = `grep '>' $accession.trna.fsa | wc -l`
echo "Number of tRNAs: $trna_count"
perl rnammer.pl < $accession.fsa > $accession.rrna.fsa
set rrna_count = `grep '>' $accession.rrna.fsa | wc -l`
echo "Number of rRNAs: $rrna_count"
foreach mol (5s 16s 23s)
set mol_count = `grep "/mol=$mol" $accession.rrna.fsa | wc -l`
echo "Number of $mol rRNAs: $mol_count"
end
set length = `saco_convert -I fasta -O length $accession.fsa`
echo "Genome length: $length"
set coding = `perl -ne 'next unless /CDS_(\d+)\-(\d+)/; $cod += abs($1-$2);END {print $cod}' < $accession.proteins.fsa`
set coding_density = `perl -e "printf ('%0.2f' , $coding / $length);"`
echo "Coding density: $coding_density"
perl -ne 'uc ; next unless /^([A-Za-z]+)/;for($x=0;$x< length($1);$x++){\
$AT++ if ( substr($1,$x,1) eq "A" or substr($1,$x,1) eq "T"); $TOT++;} END {\
printf ("%0.4f\n",(($AT) / $TOT));}' < $accession.fsa > $accession.ATcontent
set ATcontent = `cat $accession.ATcontent`
echo "Global AT content: $ATcontent"
perl genomeatlas.pl -ref $accession.fsa -t "$organism" \
-dnap "Intrinsic Curvature,Stacking Energy,Position Preference,Global Direct Repeats,Global Inverted Repeats,GC Skew,Percent AT" \
-ann $accession.ann > jourclub_genomeatlas.pdf
Open the PDF file:m16/jourclub_genomeatlas.pdf
BLAST atlas
You may find it relevant to compare your organism with other bacteria. In this case,
you need only the proteins of that organism and to construct a configuration file (specified by
the option -blastcfg blast.cfg, just like exercise M12). To obtain the proteome, you need
to download the genome sequence of the organism in question, which can be done like this:
saco_extract -I genbank -O fasta -t < /home/projects/pfh/genomes/data/$accession/$accession.gbk > $accession.proteins.fsa
less $accession.proteins.fsa
Please be aware, that the BLAST search may take longer time compared to those in the exercises, since these searched
have not been cached by our server.
|
|
This file was last modified Thursday 5th of June 2008 07:48:15 GMT |
|
|
|