|
Comparative
Microbial Genomics
Exercise 7; 9 November
2005
BACKGROUND:
The diagram you will produce in this
exercise will show the results of protein BLASTs of 16 Proteobacteria
genomes. All proteomes have been BLASTed against each other and
proteins with 80% or more overlapping alignment, having an E-value
of 1·105 or
better, have been counted. The fraction of homologous proteins (F,
as indicated by the colour scale) in organism A compared to B is
calculated as follows:
FAB
= 100% · [Number of
genes from A identified in B] / [total number of genes in A]
You should note
that FAB = FBA only for two
organisms A and B having an identical ORF count. In all other cases
FAB != FBA.
However, for closely related organisms with a nearly similar number
of ORFs, FAB ~
FBA.
START EXERCISE:
Important!
The passwords for every 'mic' account have been reset. To
login, you will need the following information:
- Open ssh (it can
be found on your desktop)
- In the 'Host'
box, enter the following: genome.cbs.dtu.dk
- In the 'User'
box, enter the following: micXX
(replace 'XX' with your assigned number)
- The password
will be given out in class
After logging in,
you will be in your "home" directory (path: /home/micXX/).
Enter today's exercise directory (i.e. cd
09Nov2005). You can make sure you are in the
correct place by typing pwd
('pwd' stands for present
working directory).
This should output: /home/micXX/09Nov2005
Check that all of
today's GenBank files (.gbk)
are present in your genbanks directory (i.e. cd
genbanks; ls). Below is the list of the 16
genomes (or GenBank files) we will use in today's exercise:
-
Baphidicola_BBp_Main.gbk
-
Bpennsylvanicus_BPEN_Main.gbk
-
Ecarotovora_SCRI1043_Main.gbk
-
Ecoli_042_Main.proteins.fsa
-
Ecoli_CFT073_Main.gbk
-
Ecoli_E2348_Main.gbk
-
Ecoli_K-12_MG1655_Main.gbk
-
Ecoli_O157_EDL93_Main.gbk
-
Pluminescens_laumondiiTTO1_Main.gbk
-
Senterica_Ty2_Main.gbk
-
Sflexneri_2457T_Main.gbk
-
Sflexneri_2a301_Main.gbk
-
Styphimurium_LT2_Main.gbk
-
Wglossinidia_Strain_Main.gbk
-
Ypestis_KIM_Main.gbk
-
Ypseudotuberculosis_IP32953_Main.gbk
The diagram/plot
that you will generate uses the Perl script compare.pl.
The script take many arguments and we will not go into the details
about them. To make life easier for you, we have constructed a
shell-script that includes all these arguments. Take a look at this
shell-script to get an idea of what it does (i.e. cat
run_matrix_script.csh). Note that the script
will redirect its output to output.ps
(this is your plot file in PostScript format). Also note that the
script lists all *.gbk
files in the directory genbanks/.
To run this script,
execute the following commands:
>
ssh organism
>
cd 09Nov2005
>
csh run_matrix_script.csh
Let this script do
its thing. It may take a while.
Once the script is
finished, execute the following commands:
>
exit
>
ghostview -landscape output.ps
From this plot, can
you answer the following questions?
Which
proteomes are most similar?
Which
proteomes are most distant?
Why are the
numbers not the same both way (e.g., B.
aphidicola vs. E. carotovora is not
the same as E. carotovora vs. B. aphidicola)
Which organism has the largest
fraction of internal homologues, and what might this mean?
|