Exercise M10

From comparative genomics to a sequence-based taxonomy

The rapid advances in molecular and genome biology have forced microbiologists to revisit one of the oldest branches of their discipline, systematics. A sequence-based approach in which comparisons between strains are carried out by a computer and not a person at a laboratory bench, is clearly the future of microbial systematics and taxonomy. Multilocus Sequence Analysis is the most promising sequence based method.

This set of exercises will allow you step-by-step to come up with a set of candidate genes for a MLSA study starting from a number of whole genome sequences. It contains the following basic steps:
  1. create protein gene families
  2. determine the core set of genes,
  3. create a clean alignment ready for primer design and phylogenetic analysis.

before you start, you want to make a new folder

mkdir mlsa
cd mlsa

Step 1: Selection of whole genomes

a. Make a selection of approx. 5 genomes, that are closely related, e.g. all publicly available Lactobacillus sp. genomes,
Make a selection of this list:
Only one strain per species! An outgroup can be useful.

b. download the embl_GR files for the genomes selected above to your local directory
e.g. all lactobacillus genomes: AE017198_GR.dat, AL935263_GR.dat, CP000033_GR.dat, CP000233_GR.dat, CR936503_GR.dat

Fetch sequences using the wget command. You can automate this by using a for-loop like this:
for f in AE017198 AL935263 CP000033 CP000233 CR936503
do wget "ftp://ftp.ebi.ac.uk/pub/databases/genome_reviews/dat/cellular/${f}_GR.dat.gz"
unzip files: gunzip *_GR.dat.gz

c. Extract the CDS protein and dna sequences using the script emblGR2seq.pl
emblGR2seq.pl lactobacillus *.dat

here lactobacillus is a 'basename' = a name for the outputfiles that will be created, e.g. lactobacillus datfile(s) = CP000033_GR.dat or .dat for all dat-files at once (*.dat is recommended)

files that will be created by the script for you include: 
basename.annotation = info on CDS
basename.pep	= AAseq of CDS
basename.fas = DNAseq of CDS
basename.lst = CDSid and OrgID

Step 2: compare whole genomes

a. create a blastdatabase from prot sequence files:
 formatdb -t lactobacillus_pep -i lactobacillus.pep -p T -n lactobacillus_pep
b. start an all-against-all blast of protein-sequences

This will take some time! Approx. 30min for 5 Lactobacillus genomes!

blastall -p blastp -i lactobacillus.pep -d lactobacillus.pep -o lactobacillus.blastp -m 8 -e 0.000001 &
you'll have to wait here until the blast is finished. The example for lactobacilli takes about 30 min,
perfect time for a break or to catch up with some unfinished exercises. If you would like to keep track
of your blast and know when it is finished you can do either of these 4 things:

  1. use the command 'top' which will show you all processes running. As long as there is a process called "blastall" in the list, your job is still running.
  2. use the command 'ps' (stands for process) which will show you also the running processes in a shorter format
  3. use the command 'tail -f lactobacillus.blastp' which will show you the output of the blast been written to the file lactobacillus.blastp in real time (although with short interruptions). When the job is done, no more data will be written in this file.
  4. use the command 'wait' to keep your computer waiting until the job is done. When the prompt re-appears the job is done.

Step 3: create gene families

a. Add length of proteins/seqs to the output of the blastfile
addlenght.pl lactobacillus.pep  < lactobacillus.blastp > lactobacillus.blastpL
Q: Can you explain why this info will be essential in the next step?

b. Apply cut-off criteria to create genefamilies
createGF2.pl 0.7 30 < lactobacillus.blastpL > lactobacillus.GF 
0.7 = treshold for alignable length/ query protein length 
30  = treshold for percent identity  
Because createGF.pl only outputs those genes which are in a gene family, you need to make a corrected file of
lactobacillus.GF that includes also the genes which are NOT in a genefamily. You can do this using the script
correctGF.pl as follows:
 correctGF.pl lactobacillus.lst lactobacillus.GF > lactobacillus.cGF 
Result: genes not in a gene family will end up in family labeled with "0"

c. Build an overview table including the gene annotation and gene family
information that will allow you to analyze the gene families in different ways!

Combine 2 files: lactobacillus.annotation and lactobacillus.cGF.
First sort them on id: 
sort -k 1 lactobacillus.annotation > lactobacillus.annotation_S 
sort -k 1 lactobacillus.cGF > lactobacillus.cGF_S 
Now join them into one file by linking them on the id
join lactobacillus.cGF_S lactobacillus.annotation_S  > fam_overview.txt
Q: What is the largest gene family, how many gene families are there (in total and for each
organism), do the gene families make sense (check the product descriptions of genes within a
specific family)

This might be of use to answer these questions, it is a combination of several unix commands
to sort files, to cut out columns, to grep only a subset of lines out of the file and to make a histogram
from the data
sort -n -k 2 lactobacillus.cGF | grep "AE017198" | cut -f 2 | histogram.pl -bin 1 -nocenterbin | sort -nr -k 2 
If you want to know more about histogram.pl visit this URL

Step 4: phylogenetic profile

a. Use script phylprof.pl to create the phylogenetic profile
phylprof.pl lactobacillus.lst lactobacillus.cGF > lactobacillus.phylprof
.lst-file is one of the output file from emblGR2seq.pl
.GF-file is the output file from createGF.pl

b. Analyze the profile

Q: look for these things: duplication, lineage specific gene loss, lineage specific gene expansion,
core set of genes, where are the ORFans?

step 5: Select genes for a multi-locus sequence analysis

a. Identify the non-duplicated core set of genes in the phylogenetic
profile and make a sub-selection of 5 to 7 genes that you want to use for phylogenetic analysis.

You might want to take the length of the genes into account and take genes spread over the genome! b. For each selected gene, make a list of gene id's

for example dnaA that would be these id's:

c. Make a multi-fasta file of these selected genes containing the dna-sequences

using the script fetch_mfasta.pl as follows:
for f in AL935263_1 CP000033_1 CP000233_1 CR936503_1
do fetch_mfasta.pl lactobacillus.fas $f  >> dnaA.fas
-> sequence identifiers should be no longer than 9 characters (make sure these are unique!):

perl -pi -e 's/(\>.{9}).*/\1/g' dnaA.fas

Step 6: Analyze the genes

a. Make an alignment of the dna sequences per gene using clustal:
clustalw dnaA.fas
b. visualize and inspect the alignment:
nedit dnaA.aln
c. Clean up the alignment

Read about the importance of an good sequence alignment in this reference:
Gevers, D. & Coenye, T. (2005) Phylogenetic and genomic analysis. Invited book chapter
in ASM Manual of Environmental Microbiology, 3th edition. Editors: Garland, J.L. & O'Connell,
S. ASM press (In Press) (provided to the scholars prior to publication!)

Editing the alignment can be done in Nedit
nedit dnaA.aln &
Use the Gblocks tool to eliminate poorly aligned positions and divergent regions of a DNA or protein alignment
so that it becomes more suitable for phylogenetic analysis

This tool is available on this URL: Gblocks-link

In order to use this tool, you will have to transfer the alignment from the clustal format
to the fasta format, using the clus2fasta tool:
clus2fasta < dnaA.aln > dnaA_aln.fas
d. Look for the conserved regions suitable for primer design

In order for a gene to be useful as a phylogenetic marker in an MLSA approach, it should allow to create primers.
In this step we will automatically evaluate if a gene is suitable for primer design.
We can automatically identify the most highly conserved regions of proteins using the BLOCKS-MAKER webtool
The output of this program can than be used to automatically predict primers using the CODEHOP webtool

e. make a Maximum Likelihood tree using PHYML

PHYML is a simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood

More info on this link

In order to do use this program, you need to reformat aligned sequences in PHYLIP format
seqret -osformat phylip -sequence dnaA.aln -outseq dnaA.phy 

For DNA sequences:
./phyml sequences file data type sequence format nb data sets nb bootstrapped data sets substitution
model ts/tv ratio prop. invariable sites nb categories gamma parameter starting tree optimise topology
optimise branch lengths and rate parameters

Example :

phyml dnaseqs.phy 0 i 1 100 GTR e e 4 e BIONJ y y 

For amino-acids sequences :
./phyml sequences file data type sequence format nb data sets nb bootstrapped data sets substitution
model prop. invariable sites nb categories gamma parameter starting tree optimise topology optimise
branch lengths and rate parameters
Example :

phyml aminoacidseqs.phy 1 i 1 100 JTT e 4 e  BIONJ y y 
For complete details type 'phyml -h' or see http://atgc.lirmm.fr/phyml/

PHYML produces several results files :
< sequence file name >_phyml_lk.txt : likelihood value(s)
< sequence file name >_phyml_tree.txt : inferred tree(s)
< sequence file name >_phyml_stat.txt : detailed execution stats
< sequence file name >_phyml_boot_trees.txt : bootstrap trees
< sequence file name >_phyml_boot_stats.txt : bootstrap statistics

visualize the tree using njplot
njplot dnaA.phy_phyml_tree.txt
This concludes the exercise, but you can continue to apply these tools and look for good markers...