|
Distance-Based Methods for Phylogenetic Reconstruction
Exercise written by: Anders Gorm Pedersen
Overview
In today's exercise we will reconstruct phylogenetic trees using a variety
of distance-based methods. Specifically, we will explore two different
optimality criteria (least squares and minimum evolution), and one clustering
method (neighbor joining).
Log in to your account:
Start by logging in to your group's account.
Construct today's working directory:
mkdir distance
ls -l
The mkdir command constructs a new directory. You should be able to
see it using the ls command.
Change to today's working directory
cd distance
Copy files for today's exercise:
cp ~gorm/distance/hcv.nexus ./hcv.nexus
cp ~gorm/distance/simple.nexus ./simple.nexus
Inspect sequence files:
nedit hcv.nexus
This file contains 41 aligned Hepatitis C virus (HCV) sequences isolated from 5
different patients. Sequences are named in the following way:
Patient_Time_Clone. For instance, the sequence labeled 1_1_5 was isolated from
patient number 1 at time point 1 and is clone number 5 from that patient and that time
point.
The file is in the so-called NEXUS format that is widely used by phylogeny-related
software. The NEXUS format allows for entering of both data and commands. The sequenced
region corresponds to the end of the E1 gene and the beginning of the E2 gene,
surrounding the so-called hyper-variable region 1 (HVR1).
nedit simple.nexus
This is an artificial data set that I have constructed. It is identical to
the one you analyzed by hand earlier today in class. We will use it to convince
ourselves that PAUP gets the same result as you.
Analysis of the Simple Data Set
Start the PAUP* program:
paup
This will give you a prompt ("paup>") where you can write commands to the
program.
Load the simple data set:
execute simple.nexus
This command loads the sequences from the nexus file into memory. If the nexus file
had contained any commands, then these would also have been executed.
Select distance-based tree-reconstruction:
set criterion=distance
PAUP* can reconstruct phylogenies using either parsimony, distance-based, or
likelihood-based criteria. Today we will be using the distance-based methods.
Select uncorrected distances under the un-weighted least squares criterion:
dset distance=p objective=lsfit power=0
The dset command is used to set various options for the
distance-based methods. Option "distance=p" specifies the use of
"uncorrected sequence distances", i.e., we do not want to correct the observed
distances for multiple substitutions. Note that distances are here reported as
"substitutions per site". This simply means that the number of differences has
been divided by the length of the sequence. You can think of this distance as
the fraction of sites that are different between two sequences.
The option "objective=lsfit" specifies that we want to reconstruct trees
using the least squares optimality criterion. Recall that under least squares we are
trying to find the tree that has the smallest possible deviation between the observed
pairwise distances and the pairwise distances measured along the tree. (The distance
between two taxa measured along the tree is called the "patristic" distance). The
overall fit of the tree is found by (1) computing the difference between each observed
distance and the corresponding patristic distance, (2) squaring this difference (this
way we are sure to obtain a positive number, regardless of whether the observed or the
patristic difference is bigger), (3) adding all the squared differences. The option
"power=0" specifies that we do not want to weight the squared differences according to
branch lengths when computing this fit.
Inspect distance matrix
showdist
This command shows the distance matrix as evaluated under the current criteria. Make
sure that the matrix corresponds to the one you constructed earlier today (keeping in
mind that the current distances are displayed as substitutions per site - not as the
total number of differences). You will need to use these numbers later in the exercise
(you can then scroll back to inspect the matrix again).
Find best tree using exhaustive search:
alltrees
The "alltrees" command makes PAUP* find the best tree by exhaustively
searching through all possible trees. Since there are just 3 different, unrooted trees
with 4 taxons this is not a problem in this case, but for larger data sets we will have
to use approximate ("heuristic") methods.
For each possible tree topology PAUP* finds the best set of branch lengths, and then
computes the "sum of squared errors" as a measure of how well the patristic distances
fit the observed, pairwise sequence distances. The best tree is the one with the
smallest sum of squared errors. At the end of the run, PAUP* outputs a histogram giving
the distribution of the sum of squares for all trees (only three in this case).
Inspect best tree, and analyze branch lengths:
describetrees all/brlens=yes label=yes
This command prints a description of all the trees currently in memory along with a
table of branch lengths. Since there is only 1 best tree for this data set we get just
one description. We could have also used the command "describetrees 1" to
explicitly request info on tree number 1 in memory. The option "label=yes"
requests labeling of internal nodes.
Does the best tree correspond to the one you constructed?
We now want to investigate whether the fitted branch lengths correspond to
the observed pairwise distances. First, draw a sketch of the tree in the proper
place on today's results form. Second, label each branch with the branch length
as listed in the table you just produced with describetrees. Finally,
compute the patristic distance between each pair of species on the tree by
adding up the branch lengths of branches lying on the path between the two taxa.
Note all patristic distances on the form also. Does the observed pairwise
distances (from the matrix you printed using showdist before) correspond
to the patristic distances in this case?
Analysis of HCV Data Set using Neighbor Joining
Load HCV data set:
execute hcv.nexus
We have now loaded the HCV data set.
Select distance-based tree-reconstruction:
set criterion=distance
Select uncorrected distances:
dset distance=p
Define outgroup:
outgroup 2_1_1 2_1_2 2_1_3 2_1_4 2_1_5 2_1_7 2_1_8 2_1_9 2_1_10
This puts all nine sequences from patient 2 in the outgroup, which will be
used to place the root of the tree.
Set outgroup rooting and select options for tree printing:
set root=outgroup outroot=monophyl
This makes PAUP* use the outgroup we defined above for the purpose of rooting the
tree. The "outroot=monophyl" command makes PAUP construct a tree where the outgroup is a
monophyletic sister group to the ingroup. (Outroot could also have been set to
"polytomy" or "paraphyl").
Construct a neighbor joining tree based on the HCV data:
nj
This will construct a neighbor joining tree using the active distance measure
(currently set to uncorrected).
Print tree and table of branch lengths:
describetrees 1/plot=phylogram brlens=yes label=yes
This plots the neighbor joining tree along with a list of branch lengths. The plot is
a "phylogram" meaning that branch lengths are used (this is different from a "cladogram"
which would only indicate branching order). Note that the viral sequences from different
patients form distinct clusters.
The present neighbor joining tree was computed without correcting the
observed distances for multiple substitutions. In the phylogram, identify the
internal node that is ancestral to the patient 5 sequences, and also the internal
node that is one level further down in the tree (i.e., ancestral to the
ancestral node). You will note that the branch connecting these two nodes is
relatively long. Locate the corresponding branch in the list of branch lengths
and write down the length on today's results form.
Select correction of multiple substitutions using the Jukes and Cantor model:
dset distance=jc
This causes all observed distances to be corrected using a formula based on
the Jukes and Cantor model of evolution. Recall that under the Jukes and
Cantor model all base frequencies are assumed to be equal (at 0.25), and all
base substitution rates are also assumed to be equal.
Construct a new neighbor joining tree using corrected distances:
nj
Print tree and table of branch lengths:
describetrees 1/plot=phylogram brlens=yes label=yes
In this tree all branch lengths have been corrected for (unobserved) multiple
substitutions. That means they are slightly longer than the uncorrected distances, and
this correction is more noticeable for longer branches.
Again locate the internal node that is ancestral to the patient 5 sequences
and also the immediate ancestor of this node (the node labels are not necessarily
the same as before). Now find the corresponding branch in the table and write
down the length on the form. You should see that the corrected branch length is substantially
longer.
Find total length of tree:
At the end of the table listing branch lengths, you will find the sum of all
branch lengths. This is often called the "length" of the tree. Note down this
number in the appropriate place on the results form - you will need it
later.
Compute fit of NJ branch lengths to observed branch lengths:
dscores 1/objective=lsfit power=0
The dscores command computes the fit of the tree in memory according
to the specified criterion. In this case we are computing the fit between the
observed pairwise sequence-distances and the branch lengths found by neighbor
joining. The measure used is the sum of squared deviations mentioned above.
Again, write down the number on today's form (it is indicated by "S.S." which is
an abbreviation for "sum of squares").
Analysis of HCV Data Set Using Least Squares
Select JC corrected distances under the unweighted least squares criterion:
dset distance=jc objective=lsfit power=0
Find the best tree using heuristic searching:
hsearch start=nj swap=tbr
Unlike the simple data set we investigated earlier, the HCV data set is far too big
for exhaustive searching. We therefore have to resort to heuristic techniques when we
are using a phylogenetic reconstruction method that is based on an optimality
criterion. In this case the starting tree is constructed by neighbor joining, i.e., it
should be identical to the tree we just inspected. Once an initial tree has been
constructed, the heuristic search proceeds by rearrangements of the "tree-bisection and
reconnection" type). The search may take a few minutes and should result in a set of
equally good trees.
Inspect trees:
contree all/strict=no majrule=yes percent=50
This constructs a consensus tree showing monophyletic groups occurring in more than
50% of all trees. Scroll back to see the tree. At each internal node you will find an
indication of how often the corresponding group (meaning all taxa descending from that
internal node) was found in the set of all trees. (Numbers are percentages). The option
percent=50 specifies that we want to see only groups occurring at least 50% of the time
(i.e., we are requesting a "majority rule consensus"). You can increase this value (not
lower it) if you want to set a different cutoff.
The consensus tree shows that groups corresponding to individual patients occur in
all trees (i.e., the nodes that are at the base of viruses originating from any single
patient should all be labeled "100"). You will also note that there may be some
sub-trees where the branching order is now unresolved, meaning that three or more taxa
all split out from the same internal node. These multifurcations show that while more
than 50% of the individual trees had those taxa together as a group (the precise number
is indicated at the internal node), different trees nevertheless disagreed on the exact
branching order within that group.
As you can see, consensus trees are a handy way of summarizing the evidence shared in
a set of trees, and they are therefore useful when a search identifies several good
reconstructed phylogenies. Also note that while the Neighbor Joning tree resembled the
present consensus in many ways (e.g., in the clustering of patients), it did not
indicate that there might be any uncertainty as to the details of the tree. However, by
using a method that has an explicit measure of tree goodness (least squares in this
case) you have now learned that there are several equally good reconstructions of the
branch order within the individual patient clusters.
Compute fit of least squares branch lengths to observed branch lengths:
dscores 1/objective=lsfit power=0
Again, we are computing the sum of squared deviations between observed and
patristic pairwise distances. Arbitrarily we have chosen to only do this for tree
number 1 ("dscores all" would have done it for all trees in memory), but
recall that all trees in memory are equally good, so the results would have been
identical to what you now get. Now, compare the result from this analysis with
the number you obtained from the neighbor joining tree above. Has the fit
improved? Note the sum of squares on today's form.
Find total length of tree:
describetrees 1/plot=no brlens=yes
Again, find the sum of all branch lengths (the tree-length) and note the value
on the form - at the end of today you will compare how well different methods worked.
Analysis of HCV Data Set Using Minimum Evolution
Select JC corrected distances under the minimum evolution criterion:
dset distance=jc objective=me
We now want to explore a different optimality criterion for distance-based
analysis. Under minimum evolution we take the shortest tree to be the best one.
This is very similar to parsimony, but in this case we are using pairwise,
JC-corrected distances as the basis for reconstructing the tree. ME proceeds by
searching through a list of possible trees; for each tested topology the best
set of branch lengths are found by the least squares method, but instead of
finally choosing the tree with the best fit, we instead end up by choosing the
shortest tree.
Find the best tree using heuristic searching starting from a NJ tree:
hsearch start=nj swap=tbr
Inspect trees:
contree all/strict=no majrule=yes percent=50
Again you should see that the set of best trees have individual patients clustered
separately.
Find total length of tree:
describetrees 1/plot=no brlens=yes
At the end of the table listing branch lengths, you will again find the sum of
all branch lengths. Note the number on today's form and compare it to the tree
lengths you found for the least squares and neighbor joining trees above. Is the
minimum evolution tree shorter than the other two trees?
Compute fit of minimum evolution branch lengths to observed branch lengths:
dscores 1/objective=lsfit power=0
Again, we are computing the sum of squared deviations between observed and
patristic pairwise distances. Note the result from this analysis on the form and
compare it with the numbers you obtained from the neighbor joining and least
squares analyses above. How is the fit of the ME tree compared to those two?
Report results:
As soon as you have shown your results form to the teacher you are done for
today. See you again Wednesday!
|