Setup exercise

  1. Set environment and copy files required for the exercise
    umask 022
    setenv MAKEFILES /home/databases/genomeatlasdb-3.0_cur/make/Makefile
    cp -rL ~www/pub/CBS/courses/thaiworkshop08/m16 ~/
    cd ~/m16
    

BLAST matrix

The blastmatrix perl script performs an all-against-all BLAST comparison of mulitple organisms. For every organism, it calculates how many proteins are homologus to all organism in the comparison. Including 4 organisms in a single comparison will leave 4 x 4 = 16 cells in the matrix. The diagonal of the matrix represent a special case, since this is equal to the genome compared to itself. Naturally, when aligning a given gene with itself, it will have perfect match alignment, and these hits are therefor excluded from the diagonal. This leaves the diagonal as a measure of the number of paraloges (homology within genomes) whereas all other cells represent the orthologs (homology between genomes)

The purpose of this exercise is to gather all the predicted translations of the other groups and build a BLAST matrix showing the homology btween all organisms.

The input for the script is an XML file. We have provided a template for this file, and shall add only a section defining which genomes to be compared. This will depend on who has finished the gene prediction and translation ;-)
  1. Build the configuration This is done by the scell script make-bm-config.csh - feel free to examine it using less)

    . The scripts simply checks who has finished the *.proteins.fsa file and then build the configuration file accordingly.

    csh make-bm-config.csh blastmatrix.head.xml > blastmatrix.xml
    less blastmatrix.xml
    
  2. Running blast matrix program Running the BLAST matrix program scales badly with the number of proteomes (n2) and therefor takes a long time to finish for all your genomes. Luckily the program features a caching system which enables it to reuse results from searches it has already performed. For those of you who are interrested, the program uses MD5 checksums of the input proteomes to keep track of the results - this is a very convinient way to keep track of changes in large data files! We have primed the cache with all the proteins sequences of your genomes so you can expect significantly shorter run time.
    perl blastmatrix-1.2.pl blastmatrix.xml > blastmatrix.ps
    gmake blastmatrix.pdf
    Open the PDF file:m16/blastmatrix.pdf
    


    QUESTION From this plot, can you identify genomes, which appear to share homology? Can you find genomes, which has a high degree of paralogs (homology within the genome?) QUESTION Can you identify the least related proteomes?

Phylogenetic trees from 16S rRNA

We will now use one single 16S rRNA gene from each of the genomes you have been assigned to do a neighbor-joining tree, using CLUSTAL. A small python script helps to convert the tree file into a Post Script image you can view. Clustal saves the tree as a .dnd per default.
  1. Extract all predicted 16S rRNA genes from the other groups
    cat ~pfh/upload/*rrna.fsa | saco_convert -I fasta | perl -ne '$line = $_;$accession = $1 \
     if $line =~ /input_id=(\S+)/;$seq{$accession} =  (split(/\t/ , $line))[1]; \
     END {foreach my $accession (keys %seq){ print "$accession\t$seq{$accession}\n";}}' \
     | saco_convert -O fasta > all.rrna.fsa
    less all.rrna.fsa
    

  2. Make a mulitple alignments using CLUSTAL and inspect both alignment and tree file
    clustalw all.rrna.fsa
    less all.rrna.aln
    less all.rrna.dnd
    

  3. Translated accession numbers into organism names
    Working with longer sequences identifiers (e.g. using organism names) are often impractical due to CLUSTAL truncating identifiers. This is the reason why we used accession numbers instead. However, we want now to replace the accession numbers in the .dnd file with proper organism names. For this, we use the program sed. We will connect to a central MySQL database, which contains a list of all the accession numbers we have distributed to the individual groups. In fact, the query builds a sed script - see for yourself:
    mysql -B -N -e "select concat('s/',accession,'/',replace(substring_index(organism_name,' ',2),' ' ,'_'),'_',replace(grp,'/','_'),'/g') from \
      MicrobialGenomicsGroup.thaiworkshop08 as t , pfh_public.genbank_complete_seq \
      as s , pfh_public.genbank_complete_prj as p where s.pid = p.pid and \
      s.genbank = t.accession" > script.sed
    less script.sed
    

  4. Letting sed translate accession numbers into organism names
    sed -f script.sed < all.rrna.dnd > all.rrna.replace.dnd
    less all.rrna.replace.dnd
    

    Now, that was much prettier!
  5. The provided python script treeps.py will draw the neightbour joining tree from the .dnd file
    python treeps.py all.rrna.replace.dnd
    gmake all.pdf
    Open the PDF file:m16/all.pdf
    

    QUESTION Where is your organism on the tree - what other organism are close?
  6. Reordering the organism based on the tree
    The order of the organism of the BLAST matrix is random. You shall now reorder the organisms of the BLAST matrix. You shall do this manually (sadly!) by opening the XML configuration in an editor called nedit. Opening nedit will take some seconds - so please be patient! We have arranged the configuration file so that you simply have to reorder the lines starting with <entry><length>
    cp blastmatrix.xml blastmatrix.reorder.xml
    nedit blastmatrix.reorder.xml
    perl blastmatrix-1.2.pl blastmatrix.reorder.xml > blastmatrix.reorder.ps
    gmake blastmatrix.reorder.pdf
    
    Open the PDF file:m16/blastmatrix.reorder.pdf
    

Two-dimensional clustering

As a final comparison for today will compare mulitple genomic properties of all the genomes your have been assigned. In fact, as you have been predicting features of your own genomes, you have uploaded your results to the central MySQL table, you you will now query this table to obtain the results from the other groups.
  1. Query MySQL to obtain results from the other groups
    
    mysql -B -e "select concat('#',color),organism_name,nprot,rrna_count,trna_count,length,coding_density,ATcontent,\
       if(ISNULL(s54),0,s54) as s54,if(ISNULL(s70),0,s70) as s70,if(ISNULL(ecf),0,ecf) as ECF \
       from MicrobialGenomicsGroup.thaiworkshop08 as t, pfh_public.phyla as ph, pfh_public.genbank_complete_seq as seq, \
       pfh_public.genbank_complete_prj as prj  where prj.pid = seq.pid and accession not like '%-%' and \
       genbank = accession and phyla = grp" > compare.tbl
    less compare.tbl
    
    


  2. Produce a two-dimensional clustering of raw results
    perl heatmap < compare.tbl > compare.ps 
    gmake compare.pdf
    Open the PDF file:m16/compare.pdf 
    
    QUESTION Why do you think this looks ugly? Can we directly compare any genomic (and numerical) property in this way?

  3. Again, run the heatmap script, this time using the scale option which will normalize data within columns:
    perl heatmap < compare.tbl -scale > compare.scale.ps 
    gmake compare.scale.pdf
    Open the PDF file:m16/compare.scale.pdf
    
    QUESTION Explain why the normalized plot is different

    QUESTION Can you identify any of the genomic properties that appear to be correlated? Explain your observations! Does the dendrogram on this plot match the 16S rRNA tree? Why/why not?

    QUESTION Could you come up with suggestions as to other genomic properties which could be relevant to look at and include in this comparison?



    Concluding remarks

    You have now finished the analysis of the 20 genomes you were assigned monday. It is perhaps time to look back and catch up on what you did. You were given a genome sequence in fasta format. From this, you were able to predict the protein coding genes, the ribosomal RNA genes, the transfer RNA genes, calculate length, AT content, and coding density. Also, as bonus tasks, you searched the proteins you predicted with Hidden Markov Models to identify the sigma54, sigma70, and ECF proteins. Your gene predictions were then mapped on to a circular atlas representation which shows the homology to the other proteomes of the other groups as well as some DNA structural properties. Despite the fact that most of you were not familiar with the UNIX environment, you all did a really good job carrying out the many tasks of the exercises!

    Group work for journal club: Examine the genome

    Having finished the analysis of your personal genome, we will now move on to the journal club work: Together with the publication, we have assigned you a genbank accession which is relevant to your article. You can find your accession number from the table below:


    The group are listed in this PDF: Presentation Groups

    AccessionOrganismPhylum
    Group 1AP009384Azorhizobium caulinodans ORS571Alphaproteobacteria
    Group 2CP000951Synechococcus sp. PCC 7002Cyanobacteria
    Group 3CU633749Cupriavidus taiwanensisBetaproteobacteria
    Group 4AB297474Legionella dumoffii plasmid pLD-TEX-KLGammaproteobacteria
    Group 5CP000304Pseudomonas stutzeri A1501Gammaproteobacteria
    Group 6AM422018Candidatus Phytoplasma australienseChlamydiae
    Group 7AP009493Streptomyces griseus IFO 13350Actinobacteria
    Group 8CP000854Mycobacterium marinumActinobacteria


    We will now perform some of the same analysis you have made the past days - on the genome for the journal club. You shall replace the XY999999 accession number with the one from the table above We fell confident, you are able to recognize these commands, so we will leave out the explanations and the following is a suggestion as to which analysis would be relevant for your presentation.

    set accession = XY999999
    
    set organism = `getgene $accession | perl -ne 'if (/^  ORGANISM  /) {s/^  ORGANISM  //; s/^([^ ])[^ ]* ([^ 0-9]+( .*)?)$/$1. $2/; print;}'`
    echo "Organism name: $organism"
    
    gunzip -c $accession.proteins.fsa.gz > $accession.proteins.fsa
    gunzip -c $accession.ann.gz > $accession.ann
    gunzip -c $accession.fsa.gz > $accession.fsa
    
    set nprot_annotated = `grep ^CDS $accession.ann | wc -l`
    echo "Number of proteins: $nprot_annotated"
    
    perl trnascan.pl < $accession.fsa > $accession.trna.fsa
    set trna_count = `grep '>' $accession.trna.fsa | wc -l`
    echo "Number of tRNAs: $trna_count"
    
    perl rnammer.pl < $accession.fsa > $accession.rrna.fsa
    set rrna_count = `grep '>' $accession.rrna.fsa | wc -l` 
    echo "Number of rRNAs: $rrna_count"
    
    foreach mol (5s 16s 23s)
     set mol_count = `grep "/mol=$mol" $accession.rrna.fsa | wc -l` 
     echo "Number of $mol rRNAs: $mol_count"
    end
    
    set length = `saco_convert -I fasta -O length $accession.fsa`
    echo "Genome length: $length"
    
    set coding = `perl -ne 'next unless /CDS_(\d+)\-(\d+)/; $cod += abs($1-$2);END {print $cod}' < $accession.proteins.fsa`
    set coding_density = `perl -e "printf ('%0.2f' , $coding / $length);"`
    echo "Coding density: $coding_density"
    
    perl -ne 'uc ; next unless /^([A-Za-z]+)/;for($x=0;$x< length($1);$x++){\
      $AT++ if ( substr($1,$x,1) eq "A" or substr($1,$x,1) eq "T"); $TOT++;} END {\
      printf ("%0.4f\n",(($AT) / $TOT));}' < $accession.fsa > $accession.ATcontent
    set ATcontent = `cat $accession.ATcontent`
    echo "Global AT content: $ATcontent"
    
    perl genomeatlas.pl -ref $accession.fsa -t "$organism" \
     -dnap "Intrinsic Curvature,Stacking Energy,Position Preference,Global Direct Repeats,Global Inverted Repeats,GC Skew,Percent AT" \
     -ann $accession.ann > jourclub_genomeatlas.pdf
    
    Open the PDF file:m16/jourclub_genomeatlas.pdf

    BLAST atlas

    You may find it relevant to compare your organism with other bacteria. In this case, you need only the proteins of that organism and to construct a configuration file (specified by the option -blastcfg blast.cfg, just like exercise M12). To obtain the proteome, you need to download the genome sequence of the organism in question, which can be done like this:
    saco_extract -I genbank -O fasta -t < /home/projects/pfh/genomes/data/$accession/$accession.gbk > $accession.proteins.fsa
    less $accession.proteins.fsa
    

    Please be aware, that the BLAST search may take longer time compared to those in the exercises, since these searched have not been cached by our server.