Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

DNA Structures in Whole Genomes



DTU Ph.D. Course number 27803
Biological Sequence Analysis
Link to the main course web page
David Ussery
Monday, 15 May, 2006






Cookbook for Computer Exercises on Comparative Microbial Genomics



Aim

The aim of this exercise is to give a hands-on experience in genome analysis of bacterial genomes. The tools and techniques applied are developed for use with microbial genomes - both [microbial] eukaryote and prokaryote genomes can be analysed. The Comparative Microbial Genomics group at CBS has developed "DNA atlas" methods to visualize global properties in microbial genomes. The atlas web pages will be introduced, which are an interterface to a database with bacterial, archaeal, eukaryotic, and viral genomic information. The database can be used to retrieve information about sequenced genomes and do comparison of DNA structural properties within genomes. Another way to compare different genomes is to build a "BLAST matrix".

Part I: Acquire General Information

go to http://www.cbs.dtu.dk/services/GenomeAtlas/
  1. How many Archaeal species currently have genomes stored in the database?
  2. What is the total length of bacterial genomes that are currently stored in the database?
  3. Select different phyla by clicking on it. How many Bacterial phlya are stored in the DB?
  4. How many Escherichia coli strains are published?
  5. What is the AT range in the Euryarchaeota?
  6. (hint: You can sort genomes by different properties by clicking on the top arrows.)
  7. What are the genome sizes for the stored Pyrococcus and Methanosarcina species?
  8. Are there differences concerning the 5S rRNA predictions?
  9. Look at the Genome Atlas for Methanosarcina acetivorans. How do the number of repeats compare to other Archaeal genomes?
  10. (hint: try clicking on Main and look into "DNA Analysis related to Segment Table")
  11. How many genomes are there from the enterics?
  12. note: This is Proteobacteria, from the gamma subdivision, and the tax. abbreviation we use for this is "BProt GE" (which you can see in the GenomeAtlas row information for E. coli, for example). So you type in "BProt GE" in the search box, and see what you get - should be a list of 27 organisms.
  13. What are the differences in codon usage for Wiggelsworthia, E. coli, and Sodalis glossinidius.


Part II: More Detailed Analysis

Do a 2-D clustering analysis of the genomes from enteric bacteria

Choose the following criteria to cluster the genomes by: AT content, length, GC-skew, number of genes, coding density (bp/gene), number of tRNAs, number of rRNAs (include 5S, 16S, and 23S), number of sigma factors, and the number of proteins for secretion systems types I-V. Use Euclidian distance, scale colours by column, normalise by center + scaling by std. deviation, with complete agglomeration. (at least for your first try - feel free to alter the parameters and see what happens...)

  • Which genomes cluster close together? Does this make biological sense?
  • Which of the chosen genomic properties cluster together? Again, does the clustering in this direction make sense?
  • List some of the chosen genomic properties which don't show much information in this example. (Although of course they might in other genome comparisons - explain why this might be the case.)

    Part III: Blast Matrics and Identification of Five Unknown Genomes

    Construct a Blast Matrix of the genomes from enteric bacteria

    This part of the exercise will require a unix shell.

    • Open the directory MIC_Phdcourse06 and find the configuration files called "[something]_10p10.xml".
    • You can choose different e-values (indicated by the "10pX.xml" value).
    • Type "perl Matrix [chosen config file value 10pX.xml] > groupname_p10pX.ps"
    • When the file is finished, type ghostview [your groupname_p10pX.ps].

    • A new window will open with the BLASTmatrix.
      You can zoom in the window by clicking on parts of the Matrix. >br> Choose orientation 'landscape', magnification 1-3

    --> List the likely identity of the five genomes, based on the clustering analysis, and other data obtained from this list




    CORRESPONDENCE