Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Distance-Based Methods for Phylogenetic Reconstruction

Exercise written by: Anders Gorm Pedersen


Overview

In today's exercise we will reconstruct phylogenetic trees using a variety of distance-based methods. Specifically, we will explore two different optimality criteria (least squares and minimum evolution), and one clustering method (neighbor joining).


Getting started, description of data

  1. Log in to your account:

    Start by logging in to your group's account.

  2. Construct today's working directory:

    mkdir distance
    ls -l

    The mkdir command constructs a new directory. You should be able to see it using the ls command.

  3. Change to today's working directory

    cd distance
  4. Copy files for today's exercise:

    cp ~gorm/distance/hcv.nexus ./hcv.nexus
    cp ~gorm/distance/simple.nexus ./simple.nexus
  5. Inspect sequence files:

    nedit hcv.nexus

    This file contains 41 aligned Hepatitis C virus (HCV) sequences isolated from 5 different patients. Sequences are named in the following way: Patient_Time_Clone. For instance, the sequence labeled 1_1_5 was isolated from patient number 1 at time point 1 and is clone number 5 from that patient and that time point.

    The file is in the so-called NEXUS format that is widely used by phylogeny-related software. The NEXUS format allows for entering of both data and commands. The sequenced region corresponds to the end of the E1 gene and the beginning of the E2 gene, surrounding the so-called hyper-variable region 1 (HVR1).

    nedit simple.nexus

    This is an artificial data set that I have constructed. It is identical to the one you analyzed by hand earlier today in class. We will use it to convince ourselves that PAUP gets the same result as you.


Analysis of the Simple Data Set

  1. Start the PAUP* program:

    paup

    This will give you a prompt ("paup>") where you can write commands to the program.

  2. Load the simple data set:

    execute simple.nexus

    This command loads the sequences from the nexus file into memory. If the nexus file had contained any commands, then these would also have been executed.

  3. Select distance-based tree-reconstruction:

    set criterion=distance

    PAUP* can reconstruct phylogenies using either parsimony, distance-based, or likelihood-based criteria. Today we will be using the distance-based methods.

  4. Select uncorrected distances under the un-weighted least squares criterion:

    dset distance=p objective=lsfit power=0

    The dset command is used to set various options for the distance-based methods. Option "distance=p" specifies the use of "uncorrected sequence distances", i.e., we do not want to correct the observed distances for multiple substitutions. Note that distances are here reported as "substitutions per site". This simply means that the number of differences has been divided by the length of the sequence. You can think of this distance as the fraction of sites that are different between two sequences.

    The option "objective=lsfit" specifies that we want to reconstruct trees using the least squares optimality criterion. Recall that under least squares we are trying to find the tree that has the smallest possible deviation between the observed pairwise distances and the pairwise distances measured along the tree. (The distance between two taxa measured along the tree is called the "patristic" distance). The overall fit of the tree is found by (1) computing the difference between each observed distance and the corresponding patristic distance, (2) squaring this difference (this way we are sure to obtain a positive number, regardless of whether the observed or the patristic difference is bigger), (3) adding all the squared differences. The option "power=0" specifies that we do not want to weight the squared differences according to branch lengths when computing this fit.

  5. Inspect distance matrix

    showdist

    This command shows the distance matrix as evaluated under the current criteria. Make sure that the matrix corresponds to the one you constructed earlier today (keeping in mind that the current distances are displayed as substitutions per site - not as the total number of differences). You will need to use these numbers later in the exercise (you can then scroll back to inspect the matrix again).

  6. Find best tree using exhaustive search:

    alltrees

    The "alltrees" command makes PAUP* find the best tree by exhaustively searching through all possible trees. Since there are just 3 different, unrooted trees with 4 taxons this is not a problem in this case, but for larger data sets we will have to use approximate ("heuristic") methods.

    For each possible tree topology PAUP* finds the best set of branch lengths, and then computes the "sum of squared errors" as a measure of how well the patristic distances fit the observed, pairwise sequence distances. The best tree is the one with the smallest sum of squared errors. At the end of the run, PAUP* outputs a histogram giving the distribution of the sum of squares for all trees (only three in this case).

  7. Inspect best tree, and analyze branch lengths:

    describetrees all/brlens=yes label=yes

    This command prints a description of all the trees currently in memory along with a table of branch lengths. Since there is only 1 best tree for this data set we get just one description. We could have also used the command "describetrees 1" to explicitly request info on tree number 1 in memory. The option "label=yes" requests labeling of internal nodes.

    Does the best tree correspond to the one you constructed?

    We now want to investigate whether the fitted branch lengths correspond to the observed pairwise distances. First, draw a sketch of the tree in the proper place on today's results form. Second, label each branch with the branch length as listed in the table you just produced with describetrees. Finally, compute the patristic distance between each pair of species on the tree by adding up the branch lengths of branches lying on the path between the two taxa. Note all patristic distances on the form also. Does the observed pairwise distances (from the matrix you printed using showdist before) correspond to the patristic distances in this case?


Analysis of HCV Data Set using Neighbor Joining

  1. Load HCV data set:

    execute hcv.nexus

    We have now loaded the HCV data set.

  2. Select distance-based tree-reconstruction:

    set criterion=distance
  3. Select uncorrected distances:

    dset distance=p
  4. Define outgroup:

    outgroup 2_1_1 2_1_2 2_1_3 2_1_4 2_1_5 2_1_7 2_1_8 2_1_9 2_1_10

    This puts all nine sequences from patient 2 in the outgroup, which will be used to place the root of the tree.

  5. Set outgroup rooting and select options for tree printing:

    set root=outgroup outroot=monophyl

    This makes PAUP* use the outgroup we defined above for the purpose of rooting the tree. The "outroot=monophyl" command makes PAUP construct a tree where the outgroup is a monophyletic sister group to the ingroup. (Outroot could also have been set to "polytomy" or "paraphyl").

  6. Construct a neighbor joining tree based on the HCV data:

    nj

    This will construct a neighbor joining tree using the active distance measure (currently set to uncorrected).

  7. Print tree and table of branch lengths:

    describetrees 1/plot=phylogram brlens=yes label=yes

    This plots the neighbor joining tree along with a list of branch lengths. The plot is a "phylogram" meaning that branch lengths are used (this is different from a "cladogram" which would only indicate branching order). Note that the viral sequences from different patients form distinct clusters.

    The present neighbor joining tree was computed without correcting the observed distances for multiple substitutions. In the phylogram, identify the internal node that is ancestral to the patient 5 sequences, and also the internal node that is one level further down in the tree (i.e., ancestral to the ancestral node). You will note that the branch connecting these two nodes is relatively long. Locate the corresponding branch in the list of branch lengths and write down the length on today's results form.

  8. Select correction of multiple substitutions using the Jukes and Cantor model:

    dset distance=jc

    This causes all observed distances to be corrected using a formula based on the Jukes and Cantor model of evolution. Recall that under the Jukes and Cantor model all base frequencies are assumed to be equal (at 0.25), and all base substitution rates are also assumed to be equal.

  9. Construct a new neighbor joining tree using corrected distances:

    nj
  10. Print tree and table of branch lengths:

    describetrees 1/plot=phylogram brlens=yes label=yes

    In this tree all branch lengths have been corrected for (unobserved) multiple substitutions. That means they are slightly longer than the uncorrected distances, and this correction is more noticeable for longer branches.

    Again locate the internal node that is ancestral to the patient 5 sequences and also the immediate ancestor of this node (the node labels are not necessarily the same as before). Now find the corresponding branch in the table and write down the length on the form. You should see that the corrected branch length is substantially longer.

  11. Find total length of tree:

    At the end of the table listing branch lengths, you will find the sum of all branch lengths. This is often called the "length" of the tree. Note down this number in the appropriate place on the results form - you will need it later.

  12. Compute fit of NJ branch lengths to observed branch lengths:

    dscores 1/objective=lsfit power=0

    The dscores command computes the fit of the tree in memory according to the specified criterion. In this case we are computing the fit between the observed pairwise sequence-distances and the branch lengths found by neighbor joining. The measure used is the sum of squared deviations mentioned above. Again, write down the number on today's form (it is indicated by "S.S." which is an abbreviation for "sum of squares").


Analysis of HCV Data Set Using Least Squares

  1. Select JC corrected distances under the unweighted least squares criterion:

    dset distance=jc objective=lsfit power=0
  2. Find the best tree using heuristic searching:

    hsearch start=nj swap=tbr

    Unlike the simple data set we investigated earlier, the HCV data set is far too big for exhaustive searching. We therefore have to resort to heuristic techniques when we are using a phylogenetic reconstruction method that is based on an optimality criterion. In this case the starting tree is constructed by neighbor joining, i.e., it should be identical to the tree we just inspected. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree-bisection and reconnection" type). The search may take a few minutes and should result in a set of equally good trees.

  3. Inspect trees:

    contree all/strict=no majrule=yes percent=50

    This constructs a consensus tree showing monophyletic groups occurring in more than 50% of all trees. Scroll back to see the tree. At each internal node you will find an indication of how often the corresponding group (meaning all taxa descending from that internal node) was found in the set of all trees. (Numbers are percentages). The option percent=50 specifies that we want to see only groups occurring at least 50% of the time (i.e., we are requesting a "majority rule consensus"). You can increase this value (not lower it) if you want to set a different cutoff.

    The consensus tree shows that groups corresponding to individual patients occur in all trees (i.e., the nodes that are at the base of viruses originating from any single patient should all be labeled "100"). You will also note that there may be some sub-trees where the branching order is now unresolved, meaning that three or more taxa all split out from the same internal node. These multifurcations show that while more than 50% of the individual trees had those taxa together as a group (the precise number is indicated at the internal node), different trees nevertheless disagreed on the exact branching order within that group.

    As you can see, consensus trees are a handy way of summarizing the evidence shared in a set of trees, and they are therefore useful when a search identifies several good reconstructed phylogenies. Also note that while the Neighbor Joning tree resembled the present consensus in many ways (e.g., in the clustering of patients), it did not indicate that there might be any uncertainty as to the details of the tree. However, by using a method that has an explicit measure of tree goodness (least squares in this case) you have now learned that there are several equally good reconstructions of the branch order within the individual patient clusters.

  4. Compute fit of least squares branch lengths to observed branch lengths:

    dscores 1/objective=lsfit power=0

    Again, we are computing the sum of squared deviations between observed and patristic pairwise distances. Arbitrarily we have chosen to only do this for tree number 1 ("dscores all" would have done it for all trees in memory), but recall that all trees in memory are equally good, so the results would have been identical to what you now get. Now, compare the result from this analysis with the number you obtained from the neighbor joining tree above. Has the fit improved? Note the sum of squares on today's form.

  5. Find total length of tree:

    describetrees 1/plot=no brlens=yes

    Again, find the sum of all branch lengths (the tree-length) and note the value on the form - at the end of today you will compare how well different methods worked.


Analysis of HCV Data Set Using Minimum Evolution

  1. Select JC corrected distances under the minimum evolution criterion:

    dset distance=jc objective=me

    We now want to explore a different optimality criterion for distance-based analysis. Under minimum evolution we take the shortest tree to be the best one. This is very similar to parsimony, but in this case we are using pairwise, JC-corrected distances as the basis for reconstructing the tree. ME proceeds by searching through a list of possible trees; for each tested topology the best set of branch lengths are found by the least squares method, but instead of finally choosing the tree with the best fit, we instead end up by choosing the shortest tree.

  2. Find the best tree using heuristic searching starting from a NJ tree:

    hsearch start=nj swap=tbr
  3. Inspect trees:

    contree all/strict=no majrule=yes percent=50

    Again you should see that the set of best trees have individual patients clustered separately.

  4. Find total length of tree:

    describetrees 1/plot=no brlens=yes

    At the end of the table listing branch lengths, you will again find the sum of all branch lengths. Note the number on today's form and compare it to the tree lengths you found for the least squares and neighbor joining trees above. Is the minimum evolution tree shorter than the other two trees?

  5. Compute fit of minimum evolution branch lengths to observed branch lengths:

    dscores 1/objective=lsfit power=0

    Again, we are computing the sum of squared deviations between observed and patristic pairwise distances. Note the result from this analysis on the form and compare it with the numbers you obtained from the neighbor joining and least squares analyses above. How is the fit of the ME tree compared to those two?

  6. Report results:

    As soon as you have shown your results form to the teacher you are done for today. See you again Wednesday!