DistanceBased Methods for Phylogenetic Reconstruction
Exercise written by: Anders Gorm Pedersen
Overview
In today's exercise we will reconstruct phylogenetic trees using a variety
of distancebased methods. Specifically, we will explore two different
optimality criteria (least squares and minimum evolution), and one clustering
method (neighbor joining).
Copy files for today's exercise:
Make sure you're still in today's working directory (condist) and that you already have the
hcv.nexus file there. Now, copy the following file to the dir also: ~gorm/distance/simple.nexus
simple.nexus is an artificial data set that I have constructed. It is identical to
the one you analyzed by hand earlier today in class. We will use it to convince
ourselves that PAUP gets the same result as you.
Analysis of the Simple Data Set
Start PAUP* and load the simple data set:
paup simple.nexus
Select distancebased treereconstruction:
set criterion=distance
Select uncorrected distances under the unweighted least squares criterion:
dset distance=p objective=lsfit power=0
The dset command is used to set various options for the
distancebased methods. Option "distance=p" specifies the use of
"uncorrected sequence distances", i.e., we do not want to correct the observed
distances for multiple substitutions. Note that distances are here reported as
"substitutions per site". This simply means that the number of differences has
been divided by the length of the sequence. You can think of this distance as
the fraction of sites that are different between two sequences.
The option "objective=lsfit" specifies that we want to reconstruct trees
using the least squares optimality criterion. Recall that under least squares we are
trying to find the tree that has the smallest possible deviation between the observed
pairwise distances and the pairwise distances measured along the tree. (The distance
between two taxa measured along the tree is called the "patristic" distance). The
overall fit of the tree is found by (1) computing the difference between each observed
distance and the corresponding patristic distance, (2) squaring this difference (this
way we are sure to obtain a positive number, regardless of whether the observed or the
patristic difference is larger), and (3) adding all the squared differences. The option
"power=0" specifies that we do not want to weight the squared differences according to
branch lengths when computing this fit.
Inspect distance matrix
showdist
This command shows the distance matrix as evaluated under the current criteria.
Q1: make a note of all the pairwise fractional distances. For each value compute the
corresponding absolute distance (this is done by muliplying the fractional distance by the length
of the alignment). Do the numbers agree with the distance matrix you constructed earlier today?
Find best tree using exhaustive search:
alltrees
This data set is sufficiently small that we can search through all
possible trees.
Q2: how many different, unrooted trees with 4 leafs is it possible to construct?
Inspect best tree, and analyze branch lengths:
describetrees all/plot=phylogram brlens=yes label=yes
Does the best tree correspond to the one you constructed?
Q3: We now want to investigate whether the fitted branch lengths correspond to the
observed pairwise distances. First, draw a sketch of the tree. Second, label each branch with the
branch length as listed in the table you just produced with describetrees. Finally, compute
the patristic distance between each pair of species on the tree by adding up the branch lengths of
branches lying on the path between the two taxa. Do the observed pairwise distances (from the
distance matrix in Q1) correspond to the patristic distances in this case?
Analysis of HCV Data Set using Neighbor Joining
Set up analysis fo HCV data set
Using what you have learned before do the following steps:
 Load the file hcv.nexus
 Select distancebased treereconstruction (set the criterion to distance)
 Select uncorrected distances
 Define the following outgroup: 2_1_1 2_1_2 2_1_3 2_1_4 2_1_5 2_1_7 2_1_8 2_1_9 2_1_10
 Set outgroup rooting, and ensure outgroup is printed as monophyletic sister group to ingroup.
Construct a neighbor joining tree based on the HCV data:
nj
This will construct a neighbor joining tree using the active
distance measure (currently set to uncorrected).
Print tree and table of branch lengths:
describetrees 1/plot=phylogram brlens=yes
The neighbor joining tree resembles the trees you previously constructed using
parsimony. Importantly, you should see that the viral sequences from different patients
form distinct clusters. Note that only a single tree is produced. This is
characteristic of clustering methods, which work by following a deterministic algorithm
for constructing a tree from distance data. Clustering algorithms such as neighbor
joining do not have any measure of treegoodness and therefore are not able to identify
sets of equally good trees.
Q4: The present neighbor joining tree was computed without correcting the observed
distances for multiple substitutions. In the phylogram, identify the internal node that is
ancestral to the patient 5 sequences (you will see that internal nodes are labeled with
consecutive numbers), and also the internal node that is one level further down in the tree
(i.e., ancestral to the ancestral node). You will note that the branch connecting these two
nodes is relatively long. Locate the branch in the list of branch lengths and make a note of
the length.
Select correction of multiple substitutions using the Jukes and Cantor model:
dset distance=jc
This causes all observed distances to be corrected using a formula based on
the Jukes and Cantor model of evolution. Recall that under the Jukes and
Cantor model all base frequencies are assumed to be equal (at 0.25), and all
base substitution rates are also assumed to be equal.
Construct a new neighbor joining tree using corrected distances:
nj
Print tree and table of branch lengths:
describetrees 1/plot=phylogram brlens=yes
In this tree all branch lengths have been corrected for (unobserved) multiple
substitutions. That means they are slightly longer than the uncorrected distances, and
this correction is more noticeable for longer branches.
Q5: Again locate the internal node that is ancestral to the patient 5 sequences and also
the immediate ancestor of this node (the node labels are not necessarily the same as before). Now
find the corresponding branch in the table and make a note of the length. Is the corrected branch
length longer than the uncorrected one? What is the percentual increase?
NB  do this: You are currently using neighbor joining to reconstruct the phylogenetic
tree. Below you will also explore the use of least squares and minimum evolution methods. In order to
compare the performance and characteristics of these methods we want to record some informative
numbers. On the page, where you are noting answers to today's questions, you should now make a small
table with two columns (labeled "SSE" and "tree length"), and three rows (labeled "NJ", "least
squares", and "minimum evolution").
Q6: At the end of the list of branch lengths (printed with the describetrees command), you will find the sum of all branch lengths.
This is often called the "length" of the tree. Make a note of this number in the proper place
in your table.
Compute fit of NJ branch lengths to observed branch lengths:
dscores 1/objective=lsfit power=0
The dscores command computes the fit of the tree in memory
according the specified criterion. In this case we are computing the fit between
the observed branch lengths and the branch lengths found by neighbor joining.
The measure used is the sum of squared deviations mentioned above.
Q7: Make a note of the sum of squared errors in the proper place in your table (it is
indicated by "S.S" which is an abbreviation for sum of squares).
Analysis of HCV Data Set Using Least Squares
Select JC corrected distances under the unweighted least squares criterion:
dset distance=jc objective=lsfit power=0
Find the best tree using heuristic searching:
hsearch start=nj swap=tbr
As we have seen previously, the HCV data set is far too big for exhaustive
searching, and we therefore have to resort to heuristic techniques when we are using a
phylogenetic reconstruction method that is based on an optimality criterion. In this
case the starting tree is constructed by neighbor joining, i.e., it should be identical
to the tree we just inspected. The heuristic search (which again uses rearrangements
of the "treebisection and reconnection" type) may take a few minutes and should result
in a set of equally good trees.
Inspect trees:
contree all/strict=no majrule=yes percent=50
This constructs a consensus tree from the set of equally good best trees. Again you should see
that the set of best trees have individual patients clustered separately. Note that while the
Neighbor Joining tree also showed this feature, it did not indicate that there might be any
uncertainty as to the details of the tree. However, by using a method that has an explicit measure of
tree goodness (least squares in this case) you have now learned that there are several equally good
reconstructions of the branch order within the individual patient clusters.
Compute fit of least squares branch lengths to observed branch lengths:
dscores 1/objective=lsfit power=0
Again, we are computing the sum of squared deviations between observed and patristic
pairwise distances. Arbitrarily we have chosen to only do this for tree number 1
("dscores all" would have done it for all trees in memory), but recall that all
trees in memory are equally good, so the results would have been identical to what you
now get.
Q8: Now, compare the result from this analysis with the number you obtained from the
neighbor joining tree above. Has the fit improved? Note the sum of squares in your table.
Find total length of tree:
describetrees 1/plot=no brlens=yes
Q9 Again, find the sum of all branch lengths and note the value in your table.
Analysis of HCV Data Set Using Minimum Evolution
Select JC corrected distances under the minimum evolution criterion:
dset distance=jc objective=me
We now want to explore a different optimality criterion for distancebased
analysis. Under minimum evolution we take the shortest tree to be the best one.
This is very similar to parsimony, but in this case we are using pairwise,
JCcorrected distances as the basis for reconstructing the tree. ME proceeds by
searching through a list of possible trees; for each tested topology the best
set of branch lengths are found by the least squares method, but instead of
finally choosing the tree with the best fit, we instead end up by choosing the
shortest tree.
Find the best tree using heuristic searching starting from a NJ tree:
hsearch start=nj swap=tbr
Inspect trees:
contree all/strict=no majrule=yes percent=50
Again you should see that the set of best trees have individual patients clustered
separately.
Find total length of tree:
describetrees 1/plot=no brlens=yes
Q10: At the end of the table listing branch lengths, you will again find the sum of all
branch lengths. Note the number your table and compare it to the tree lengths you found for the
least squares and neighbor joining trees above. Is the minimum evolution tree shorter than the other
two trees?
Compute fit of minimum evolution branch lengths to observed branch lengths:
dscores 1/objective=lsfit power=0
Q11: Again, we are computing the sum of squared deviations between observed and patristic
pairwise distances. Note the result from this analysis in your table and compare it with the numbers
you obtained from the neighbor joining and least squares analyses above. How is the fit of the ME
tree compared to those two?
