Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Comparative Microbial Genomics - #27644

Computer Exercises - Prediction of Highly Expressed Genes in Bacterial Genomes

Comparative Microbial Genomics

Exercise 7; 9 November 2005


BACKGROUND:


The diagram you will produce in this exercise will show the results of protein BLASTs of 16 Proteobacteria genomes. All proteomes have been BLASTed against each other and proteins with 80% or more overlapping alignment, having an E-value of 1·105 or better, have been counted. The fraction of homologous proteins (F, as indicated by the colour scale) in organism A compared to B is calculated as follows:


FAB = 100% · [Number of genes from A identified in B] / [total number of genes in A]


You should note that FAB = FBA only for two organisms A and B having an identical ORF count. In all other cases FAB != FBA. However, for closely related organisms with a nearly similar number of ORFs, FAB ~ FBA.


START EXERCISE:


Important! The passwords for every 'mic' account have been reset. To login, you will need the following information:


- Open ssh (it can be found on your desktop)

- In the 'Host' box, enter the following: genome.cbs.dtu.dk

- In the 'User' box, enter the following: micXX (replace 'XX' with your assigned number)

- The password will be given out in class


After logging in, you will be in your "home" directory (path: /home/micXX/). Enter today's exercise directory (i.e. cd 09Nov2005). You can make sure you are in the correct place by typing pwd ('pwd' stands for present working directory). This should output: /home/micXX/09Nov2005


Check that all of today's GenBank files (.gbk) are present in your genbanks directory (i.e. cd genbanks; ls). Below is the list of the 16 genomes (or GenBank files) we will use in today's exercise:


- Baphidicola_BBp_Main.gbk

- Bpennsylvanicus_BPEN_Main.gbk

- Ecarotovora_SCRI1043_Main.gbk

- Ecoli_042_Main.proteins.fsa

- Ecoli_CFT073_Main.gbk

- Ecoli_E2348_Main.gbk

- Ecoli_K-12_MG1655_Main.gbk

- Ecoli_O157_EDL93_Main.gbk

- Pluminescens_laumondiiTTO1_Main.gbk

- Senterica_Ty2_Main.gbk

- Sflexneri_2457T_Main.gbk

- Sflexneri_2a301_Main.gbk

- Styphimurium_LT2_Main.gbk

- Wglossinidia_Strain_Main.gbk

- Ypestis_KIM_Main.gbk

- Ypseudotuberculosis_IP32953_Main.gbk


The diagram/plot that you will generate uses the Perl script compare.pl. The script take many arguments and we will not go into the details about them. To make life easier for you, we have constructed a shell-script that includes all these arguments. Take a look at this shell-script to get an idea of what it does (i.e. cat run_matrix_script.csh). Note that the script will redirect its output to output.ps (this is your plot file in PostScript format). Also note that the script lists all *.gbk files in the directory genbanks/.


To run this script, execute the following commands:

> ssh organism

> cd 09Nov2005

> csh run_matrix_script.csh


Let this script do its thing. It may take a while.


Once the script is finished, execute the following commands:

> exit

> ghostview -landscape output.ps


From this plot, can you answer the following questions?

  1. Which proteomes are most similar?

  2. Which proteomes are most distant?

  3. Why are the numbers not the same both way (e.g., B. aphidicola vs. E. carotovora is not the same as E. carotovora vs. B. aphidicola)

  4. Which organism has the largest fraction of internal homologues, and what might this mean?






Course Organiser: David W. Ussery  Software questions: Christoph Champ