Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Distance-Based Methods for Phylogenetic Reconstruction

Exercise written by: Anders Gorm Pedersen


Overview

In today's exercise we will reconstruct phylogenetic trees using a variety of distance-based methods. Specifically, we will explore two different optimality criteria (least squares and minimum evolution), and one clustering method (neighbor joining).


Getting started

  1. Copy files for today's exercise:

    Make sure you're still in today's working directory (condist) and that you already have the hcv.nexus file there. Now, copy the following file to the dir also: ~gorm/distance/simple.nexus

    simple.nexus is an artificial data set that I have constructed. It is identical to the one you analyzed by hand earlier today in class. We will use it to convince ourselves that PAUP gets the same result as you.


Analysis of the Simple Data Set

  1. Start PAUP* and load the simple data set:

    paup simple.nexus
  2. Select distance-based tree-reconstruction:

    set criterion=distance
  3. Select uncorrected distances under the un-weighted least squares criterion:

    dset distance=p objective=lsfit power=0

    The dset command is used to set various options for the distance-based methods. Option "distance=p" specifies the use of "uncorrected sequence distances", i.e., we do not want to correct the observed distances for multiple substitutions. Note that distances are here reported as "substitutions per site". This simply means that the number of differences has been divided by the length of the sequence. You can think of this distance as the fraction of sites that are different between two sequences.

    The option "objective=lsfit" specifies that we want to reconstruct trees using the least squares optimality criterion. Recall that under least squares we are trying to find the tree that has the smallest possible deviation between the observed pairwise distances and the pairwise distances measured along the tree. (The distance between two taxa measured along the tree is called the "patristic" distance). The overall fit of the tree is found by (1) computing the difference between each observed distance and the corresponding patristic distance, (2) squaring this difference (this way we are sure to obtain a positive number, regardless of whether the observed or the patristic difference is larger), and (3) adding all the squared differences. The option "power=0" specifies that we do not want to weight the squared differences according to branch lengths when computing this fit.

  4. Inspect distance matrix

    showdist

    This command shows the distance matrix as evaluated under the current criteria.

    Q1: make a note of all the pairwise fractional distances. For each value compute the corresponding absolute distance (this is done by muliplying the fractional distance by the length of the alignment). Do the numbers agree with the distance matrix you constructed earlier today?

  5. Find best tree using exhaustive search:

    alltrees

    This data set is sufficiently small that we can search through all possible trees.

    Q2: how many different, unrooted trees with 4 leafs is it possible to construct?

  6. Inspect best tree, and analyze branch lengths:

    describetrees all/plot=phylogram brlens=yes label=yes

    Does the best tree correspond to the one you constructed?

    Q3: We now want to investigate whether the fitted branch lengths correspond to the observed pairwise distances. First, draw a sketch of the tree. Second, label each branch with the branch length as listed in the table you just produced with describetrees. Finally, compute the patristic distance between each pair of species on the tree by adding up the branch lengths of branches lying on the path between the two taxa. Do the observed pairwise distances (from the distance matrix in Q1) correspond to the patristic distances in this case?


Analysis of HCV Data Set using Neighbor Joining

  1. Set up analysis fo HCV data set

    Using what you have learned before do the following steps:

    • Load the file hcv.nexus
    • Select distance-based tree-reconstruction (set the criterion to distance)
    • Select uncorrected distances
    • Define the following outgroup: 2_1_1 2_1_2 2_1_3 2_1_4 2_1_5 2_1_7 2_1_8 2_1_9 2_1_10
    • Set outgroup rooting, and ensure outgroup is printed as monophyletic sister group to ingroup.
  2. Construct a neighbor joining tree based on the HCV data:

    nj

    This will construct a neighbor joining tree using the active distance measure (currently set to uncorrected).

  3. Print tree and table of branch lengths:

    describetrees 1/plot=phylogram brlens=yes

    The neighbor joining tree resembles the trees you previously constructed using parsimony. Importantly, you should see that the viral sequences from different patients form distinct clusters. Note that only a single tree is produced. This is characteristic of clustering methods, which work by following a deterministic algorithm for constructing a tree from distance data. Clustering algorithms such as neighbor joining do not have any measure of tree-goodness and therefore are not able to identify sets of equally good trees.

    Q4: The present neighbor joining tree was computed without correcting the observed distances for multiple substitutions. In the phylogram, identify the internal node that is ancestral to the patient 5 sequences (you will see that internal nodes are labeled with consecutive numbers), and also the internal node that is one level further down in the tree (i.e., ancestral to the ancestral node). You will note that the branch connecting these two nodes is relatively long. Locate the branch in the list of branch lengths and make a note of the length.

  4. Select correction of multiple substitutions using the Jukes and Cantor model:

    dset distance=jc

    This causes all observed distances to be corrected using a formula based on the Jukes and Cantor model of evolution. Recall that under the Jukes and Cantor model all base frequencies are assumed to be equal (at 0.25), and all base substitution rates are also assumed to be equal.

  5. Construct a new neighbor joining tree using corrected distances:

    nj
  6. Print tree and table of branch lengths:

    describetrees 1/plot=phylogram brlens=yes

    In this tree all branch lengths have been corrected for (unobserved) multiple substitutions. That means they are slightly longer than the uncorrected distances, and this correction is more noticeable for longer branches.

    Q5: Again locate the internal node that is ancestral to the patient 5 sequences and also the immediate ancestor of this node (the node labels are not necessarily the same as before). Now find the corresponding branch in the table and make a note of the length. Is the corrected branch length longer than the uncorrected one? What is the percentual increase?

    NB - do this: You are currently using neighbor joining to reconstruct the phylogenetic tree. Below you will also explore the use of least squares and minimum evolution methods. In order to compare the performance and characteristics of these methods we want to record some informative numbers. On the page, where you are noting answers to today's questions, you should now make a small table with two columns (labeled "SSE" and "tree length"), and three rows (labeled "NJ", "least squares", and "minimum evolution").

    Q6: At the end of the list of branch lengths (printed with the describetrees command), you will find the sum of all branch lengths. This is often called the "length" of the tree. Make a note of this number in the proper place in your table.

  7. Compute fit of NJ branch lengths to observed branch lengths:

    dscores 1/objective=lsfit power=0

    The dscores command computes the fit of the tree in memory according the specified criterion. In this case we are computing the fit between the observed branch lengths and the branch lengths found by neighbor joining. The measure used is the sum of squared deviations mentioned above.

    Q7: Make a note of the sum of squared errors in the proper place in your table (it is indicated by "S.S" which is an abbreviation for sum of squares).


Analysis of HCV Data Set Using Least Squares

  1. Select JC corrected distances under the unweighted least squares criterion:

    dset distance=jc objective=lsfit power=0
  2. Find the best tree using heuristic searching:

    hsearch start=nj swap=tbr

    As we have seen previously, the HCV data set is far too big for exhaustive searching, and we therefore have to resort to heuristic techniques when we are using a phylogenetic reconstruction method that is based on an optimality criterion. In this case the starting tree is constructed by neighbor joining, i.e., it should be identical to the tree we just inspected. The heuristic search (which again uses re-arrangements of the "tree-bisection and reconnection" type) may take a few minutes and should result in a set of equally good trees.

  3. Inspect trees:

    contree all/strict=no majrule=yes percent=50

    This constructs a consensus tree from the set of equally good best trees. Again you should see that the set of best trees have individual patients clustered separately. Note that while the Neighbor Joining tree also showed this feature, it did not indicate that there might be any uncertainty as to the details of the tree. However, by using a method that has an explicit measure of tree goodness (least squares in this case) you have now learned that there are several equally good reconstructions of the branch order within the individual patient clusters.

  4. Compute fit of least squares branch lengths to observed branch lengths:

    dscores 1/objective=lsfit power=0

    Again, we are computing the sum of squared deviations between observed and patristic pairwise distances. Arbitrarily we have chosen to only do this for tree number 1 ("dscores all" would have done it for all trees in memory), but recall that all trees in memory are equally good, so the results would have been identical to what you now get.

    Q8: Now, compare the result from this analysis with the number you obtained from the neighbor joining tree above. Has the fit improved? Note the sum of squares in your table.

  5. Find total length of tree:

    describetrees 1/plot=no brlens=yes

    Q9 Again, find the sum of all branch lengths and note the value in your table.


Analysis of HCV Data Set Using Minimum Evolution

  1. Select JC corrected distances under the minimum evolution criterion:

    dset distance=jc objective=me

    We now want to explore a different optimality criterion for distance-based analysis. Under minimum evolution we take the shortest tree to be the best one. This is very similar to parsimony, but in this case we are using pairwise, JC-corrected distances as the basis for reconstructing the tree. ME proceeds by searching through a list of possible trees; for each tested topology the best set of branch lengths are found by the least squares method, but instead of finally choosing the tree with the best fit, we instead end up by choosing the shortest tree.

  2. Find the best tree using heuristic searching starting from a NJ tree:

    hsearch start=nj swap=tbr
  3. Inspect trees:

    contree all/strict=no majrule=yes percent=50

    Again you should see that the set of best trees have individual patients clustered separately.

  4. Find total length of tree:

    describetrees 1/plot=no brlens=yes

    Q10: At the end of the table listing branch lengths, you will again find the sum of all branch lengths. Note the number your table and compare it to the tree lengths you found for the least squares and neighbor joining trees above. Is the minimum evolution tree shorter than the other two trees?

  5. Compute fit of minimum evolution branch lengths to observed branch lengths:

    dscores 1/objective=lsfit power=0

    Q11: Again, we are computing the sum of squared deviations between observed and patristic pairwise distances. Note the result from this analysis in your table and compare it with the numbers you obtained from the neighbor joining and least squares analyses above. How is the fit of the ME tree compared to those two?