Phylogenetic Analysis using Parsimony (PAUP)

Exercise written by: Anders Gorm Pedersen


In this exercise you are going to examine three different data sets using the PAUP* program. All analyses will be performed under the parsimony criterion. Data set 1 consists of genomic nucleotide sequences from 9 primate species. This set is sufficiently small that an exhaustive search of all possible trees can be performed. Data set 2 consists of mitochondrial nucleotide sequences from 12 primates. This data set is so large that it would take too long to run through all possible trees, but it is still small enough that branch and bound can be used to perform an exhaustive search. The third and last data set consists of Hepatitis C virus sequences isolated from different patients. This set is much too big for exhaustive searching and we will have to employ heuristic methods to analyze it.

In addition to exploring aspects of parsimonious phylogenetic reconstruction, an important goal of this exercise is to introduce you to the PAUP* interface, and to the different types of manipulations and analyses that can be performed within the program. Later in the course you will use PAUP* for distance-based and maximum likelihood-based phylogenetic reconstruction. Several other programs (e.g., MacClade and MrBayes) use command-line interfaces that are very similar to the one used by PAUP*.

Install software:

Please install the FigTree program (choose version appropriate for your platform).

Also make sure you have Java on your computer (this is required for the FigTree viewer and for the Jalview alignment viewer used in today's exercise).

Getting started, description of data

  1. Log in to your Unix shell account on "padawan".

    Make sure to maximize the window: the analyses we will perform give lots of output to the screen, so having a nice and large shell window makes it easier to keep track of what happens.

  2. Construct working directory, copy files:

    mkdir parsimony1
    cd parsimony1
    cp /home/people/gorm/parsimony1/* .
    ls -l

    The cp command above means "copy everything in /home/people/gorm/parsimony1/ to current dir". Make sure to include the dot (".") at the end of the command - this means "current dir".

  3. Have a look at the data files:

    nedit mhc.fasta

    This file is in the so-called fasta-format, and contains unaligned DNA sequences. Specifically, the sequences are major histo-compatibility complex (MHC) class I genes from nine different primate species. MHC class I genes encode proteins that are involved in the immune response. Now, close the nedit window and have a look at the next data file:

    nedit primate_mtDNA_interleaved.fasta

    This file contains mitochondrial DNA sequences from 12 different primate species. Now have a look at the final data file:

    nedit hcv.fasta

    This file contains Hepatitis C virus (HCV) sequences isolated from 4 different patients. The sequenced region corresponds to the end of the E1 gene and the beginning of the E2 gene, surrounding the so-called hyper-variable region 1 (HVR1). When you've had a look close the nedit window so it doesn't clutter your screen.

Links to data (use these if you'd like to use the data, but you are not enrolled in the course):

Moving data between CBS server and local computer

It is often useful to be able to move data between your own (local) computer and the (remote) CBS servers. That way you can perform analyses where you do some of the work on the local machine (e.g., via a web interface) and some on the servers. Below I will describe three ways of achieving this goal. In the rest of this exercise (and the rest of the course) you are free to use whichever method suits you best.

  1. Secure Copy (SCP) or Secure FTP (SFTP):

    Depending on what platform you work on, there are a number of ways of using the secure copy (SCP) or secure FTP (SFTP) protocols. For instance the free, open source program Cyberduck works well for both Windows and Macs. For Linux one option is Filezilla. It is also possible to use the commandline program "scp" (issuing the command "man scp" in a shell window will give you access to the online man(ual)-pages).

  2. Copy/paste between text editor windows:

    If you can run graphical applications over the network connection from the remote machine (i.e., if you have a working X11-system) you can also use the following low-tech approach:

    • First, open the file you want to transfer in nedit (in a shell window) on the remote CBS server.

    • Secondly, open a text editor on your local machine (e.g., Textedit on Macs, Notepad on Windows).

    • Third, copy and paste the entire content from the remote editor to the editor on your local machine.

    • Fourth, save the file (in plain text format) on your local machine

    This approach can obviously also be used in the reverse direction.

  3. Copy/paste between text editor and web interface:

    If you have a text file on one machine and want to use the data in a web interface on the other machine, then you don't always need to first construct the file there: many web-interfaces have text boxes where you can directly paste the data, thus making the intermediate files unnecessary.

Multiple alignment of MHC class I sequences

In this part of the exercise, we will use the program mafft to make a multiple alignment of the MHC class I sequences.

  1. Make multiple alignment using mafft server on EBI website:

    Open the mafft alignment server at EBI (on your local machine) and transfer the data from the mhc.fasta file on padawan (either via the text box or via the file "Browse" button). Set the sequence type to "Nucleic acid" and start the alignment by clicking "Submit".

    Note 1: The starting point of all phylogenetic analyses is a multiple alignment, and your results will be no better than the quality of that initial alignment.

    Note 2: There are many different multiple alignment programs. EBI provides access to some of these on its Multiple Sequence Alignment page.

  2. Inspect the alignment in graphical viewer:

    When the alignment is done: inspect the alignment by clicking the "Result summary" tab, and then the "Start Jalview" button.

    Q1: There is a gap in the "yellow_baboon" sequence. What is the length of this gap? Why does this length make evolutionary sense?

  3. Convert alignment format to NEXUS:

    The program we will use next does not understand FASTA format, so we will first have to convert the alignment to a format it does understand, namely the so-called NEXUS format.

    On the "Result summary" page click the link under "Alignments in FASTA format". Now transfer the alignment data to the Readseq - biosequence conversion tool at EBI. Select "PAUP|NEXUS" under "Output sequence format" and run the converter by clicking the "Submit" button. Note: you can either download the output to a file or view it in the browser window (select requested option under "Return biosequence data").

  4. Save the alignment on the CBS server:

    Finally, save the data (alignment in NEXUS format) to a file named on the CBS server.

  5. Manually edit the header of the NEXUS file:

    Open the file, find the fourth line of text and change "MISSING=-" into "GAP=-". (We want to make sure the PAUP program can recognize the gaps in the input data. It is surprisingly often the case that different programs will disagree slightly on sequence file formats...)

Phylogenetic Analysis of MHC Class I sequences

  1. Start the PAUP* program:


    This will give you a prompt ("paup>") where you can write commands to the program.

  2. Load the nexus file:


    This command loads the sequences from the nexus file into memory. If the nexus file had contained any commands, then these would also have been executed.

  3. Exclude all alignment columns containing any gaps:

    exclude gapped

    The exclude command can be used to ignore specified columns of the alignment. For instance, "exclude 1-29 560-577" makes PAUP* ignore columns 1 through 29 as well as columns 560 through 577. The special word "gapped" automatically selects all columns containing one or more gaps.

    Q2: How many columns were excluded because of gaps?

  4. Define the outgroup:

    outgroup olive_baboon macaque yellow_baboon

    This defines an outgroup consisting of the three old world monkeys in the data set. The names are those that appear in the file.

    The outgroup will be used to place the root of the tree. The rationale is as follows: our data set consists of sequences from man, from a number of great apes, and from a number of old world monkeys. We know from other evidence that the lineage leading to monkeys branched off before any of the remaining organisms diverged from each other. The root of the tree connecting the organisms investigated here, must therefore be located between the monkeys (the "outgroup") and the rest (the "ingroup")). This way of finding a root is called "outgroup rooting".

  5. Activate outgroup rooting and select how tree will be printed:

    set root=outgroup outroot=monophyl

    This makes PAUP* use the outgroup we defined above for the purpose of rooting the tree. The "outroot=monophyl" command makes PAUP construct a tree where the outgroup is a monophyletic sister group to the ingroup. (Outroot could also have been set to "polytomy" or "paraphyl").

  6. Find the most parsimonious tree by examining all possible trees:


    This makes PAUP* find the best trees by exhaustively searching through all possible trees (the number of unrooted trees with 9 taxons is 135,135). By default PAUP* uses the parsimony criterion for constructing trees. For each possible tree PAUP* finds the length (i.e., the number of mutational events required to explain the data set), and upon finishing this, the best tree is the one with the smallest total length. If there are several trees with the same length, then these are all kept since they are equally good. At the end of the run, PAUP* outputs a histogram giving the distribution of all tree lengths.

    Q3: What is the length of the shortest tree?

    Q4: How many trees with this score was found?

    Q5: How far is the best tree from the second best one?

  7. Examine the best trees


    This prints the length of the trees that are currently in memory. (Actually this command not only prints but computes the lengths, so it can be used to evaluate any tree that has been loaded into memory - also trees built by other methods or other programs).

    describetrees all/plot=cladogram

    This will print descriptions and plots of both trees (The command "describetrees 1" would have printed a description of only the first tree). A cladogram is a type of tree where branch lengths are ignored and only branching order is indicated. You can scroll back to compare the two trees. You may notice that the disagreement between the two trees concerns whether gibbon or orangutan is closer to the root. Based on the information in this data set we can not distinguish between these two possibilities.

    Q6: Note where the human sequence is located: who are our closest relatives according to this tree?

    describetrees all/plot=phylogram

    A phylogram is a tree where branches are drawn with lengths proportional to the number of mutational events that has happened on them. (Note that tree-terminology is not entirely consistent, and there are several other names for this type of plot). Branch lengths are based on the reconstructed location of mutational events (and therefore also on the reconstructed ancestral sequences). If a mutational event could have occurred on any one of a number of branches (in the sense that all these reconstructions give the same tree length), then each of the branches are assigned a fraction of a mutation. For instance, if a mutational event could have been placed on either of two branches, then each of them will be counted as having had 0.5 mutational events.

    Q7: What is the longest branch on the tree?

    showtrees all

    This gives just the cladograms without further descriptions.

  8. Save the trees to a file:

    savetrees brlens=yes

    This saves the trees to a file named "" with indication of the location of the root, and with information about branch lengths.

  9. View trees in FigTree viewer on local machine:

    Open a second shell window on padawan, transfer the data in the file to a file on your local machine, and open it using the FigTree tree viewer. With this graphical viewer you can investigate the tree more closely and prepare publication grade figures. The program has several options for altering the display of the tree, including viewing the tree as unrooted, and altering the rooting interactively. After you've played around with the possibilities for a while you should close the window again.

    If, later in this course, you want to construct a tree figure that is prettier than the ASCII rendition that PAUP gives you, then you need to repeat the steps performed above (save tree using the savetrees command, and subsequently transfer the treefile to your own computer and open in FigTree).

  10. Find the most parsimonious trees using branch and bound:

    Now return to the shell window where you have PAUP running and perform a branch and bound search for the best trees:


    As explained in the lecture (and the book), branch and bound is guaranteed to find the best trees without necessarily searching through all of tree-space. There is therefore actually no reason to use alltrees unless you are explicitly interested in examining suboptimal trees or the distribution of all tree lengths.

    Q8: What is the length of the best tree found using branch and bound? Is that the same value found using exhaustive search?

  11. Examine the branch and bound trees:

    showtrees all

    (Remember: you also have the possibility of using FigTree as explained above).

  12. Load the previously found trees into memory:

    We want to convince ourselves that the branch and bound trees really are identical to the trees found by the alltrees command. In order to do that we must now load the previously found trees back into memory while still retaining the branch and bound tree:

    gettrees mode=7

    (Answer "yes" when asked whether you want to proceed). The mode=7 command means that PAUP* should keep all trees that are currently in memory and append all trees from the file. We now have four trees in memory (the two found by alltrees and the two found by bandb). Now compare the four trees and convince yourself that the same two trees were found by bandb:

    showtrees all

    You can also compare the scores:

    pscores all

    As one final way of establishing that the trees are identical you can compute the "distance" between the four trees in memory:


    This indicates how similar the trees in memory are by computing the so-called symmetric-difference metric. The output includes both a table listing all pairwise differences and a bar-chart giving the distribution of differences. Two trees with identical topology will have a distance of zero.

    Q9: Are the two new trees the same as two previously found trees?

  13. Define a constraint for tree-searching:

    constraints homooran (monophyly)=((human,orangutan));

    This defines a constraint named "homooran" which requires the taxons "human" and "orangutan" to form a monophyletic group. The constraint tree was here given as simply: ((human,orangutan)); The tree is here shown using a notation where a pair of parentheses corresponds to an internal node, while a comma-separated list enclosed by a pair of parentheses indicates the subtrees that branch out from this internal node. The constraint tree shown above is a brief way of defining the full unresolved constraint tree: (gorilla,gibbon,bonobo,chimpanzee,yellow_baboon,olive_baboon,macaque,(human,orangutan));

  14. Find the best tree that satisfies the constraint:

    bandb /constraints=homooran enforce=yes

    This performs a branch and bound search for the best tree that has human and orangutan together as a monophyletic group. Note the option enforce=yes which ensures that the named constraint is applied.

    Q10: Compare the length of this tree with the best tree found above. How many extra steps are required in this tree? (The difference tells you something about how much better one hypothesis is than the other. There are tests that will tell you whether the difference is significant, but we will not get into that at this point in the course).

Phylogenetic Analysis of Mitochondrial DNA Sequences

  1. Delete previously found trees from memory:

  2. Prepare data for analysis:

    We will now analyze the primate mitochondrial data set. Using what you learned above, perform the following steps:

    • Align the sequences present in the file primate_mtDNA_interleaved.fasta
    • Convert alignment to NEXUS format
    • Edit header so gaps are recognized correctly by PAUP
    • Load data into PAUP
    • Exclude gapped columns from data
    • Set outgroup to consist of all the non-hominoid species: Macaca_fuscata M_mulatta M_fascicularis M_sylvanus Saimiri_sciureus Tarsius_syrichta Lemur_catta
    • Activate outgroup rooting and select output of outgroup as monophyletic sister group to the ingroup.
  3. Estimate time needed for exhaustive search:

    We now have 12 taxons in our data set, corresponding to 654,729,075 possible trees. We first want to estimate how long it would take to search through every single one of them, but we do not want to actually wait for the required amount of time. Start the search as indicated below, and after it has run for one or two minutes interrupt the run by pressing CTRL-C.


    Once a minute you will get an indication of how far the search has progressed. Both the percentage of all trees covered so far and the actual number of trees investigated, will be shown.

    Q11: Use the numbers to estimate how long it would take to work through all 654,729,075 possible trees.

  4. Find the best trees using branch and bound:


    Q12: How long did it take to find the best trees by branch and bound? Compare this to the estimated time you would have had to wait for the alltrees command to finish (Q11). The time saved by ignoring suboptimal parts of search space can be quite impressive, but it depends on the structure of your data set.

  5. Compare trees:

    showtrees all / tcompress=yes

    There are two equally parsimonious trees. The tcompress=yes option gives a more condensed output of the treeplot. If you prefer the standard plots then simply issue the command:

    showtrees all / tcompress=no

    As before you can also save the trees to a file and view them with the FigTree viewer.

    As you may have noted, the difference between these two trees lies in whether man is more closely related to gorilla or to chimpanzee (named "Pan" in this data set).

Phylogenetic Analysis of viral DNA Sequences

  1. Delete previously found trees from memory:

  2. Prepare and load the data:

    We will now investigate the data in the file hcv.fasta. Using what you have learned above, prepare the data and load it into paup. This dataset consists of 41 hepatitis C virus sequences obtained from four different patients. The alignment should not contain any gaps (there have been no insertions or deletions).

  3. Enable PAUP* to store an unlimited number of trees:

    set increase=auto
  4. Perform a heuristic search using NNI:

    hsearch start=stepwise addseq=random nreps=20 swap=NNI

    We now have so many sequences that exhaustive searching using alltrees or even bandb is impossible (there are approximately 10^60 possible trees). We therefore have to employ heuristic searching. The above command starts a heuristic search using sequential addition with random addition to construct the initial tree. (In all 20 different random addition sequences are tried). Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "nearest neighbor interchange" type (NNI).

  5. Inspect search result:

    This time the search results in a large number of equally parsimonious trees. The result is summarized in the form of a "tree-island profile". Trees from the same island are much more similar to each other than to trees from other islands. Specifically, a tree-island is defined as a set of trees where all members can reach all other members by one single re-arrangement (the trees are all neighbors).

    Q13: What is the best score found?

    Q14: How many equally parsimonious trees were found in all?

    Q15: How many tree islands are there with the best score?

  6. Perform a heuristic search using SPR:

    hsearch start=stepwise addseq=random nreps=20 swap=SPR

    This starts a heuristic search using rearrangements of the "subtree pruning and re-grafting" type (SPR). SPR is more elaborate than NNI and examines many more neighbors for each tree.

    Q16: Did this search find the same best score as NNI?

  7. Perform a heuristic search using TBR:

    hsearch start=stepwise addseq=random nreps=20 swap=TBR

    This starts a heuristic search using rearrangements of the "tree bisection and reconnection" type (TBR). TBR is more elaborate than both NNI and SPR. TBR is the default swap mode for heuristic searching in PAUP*, and NNI or SPR should mostly be used if you are interested in reducing search time.

    Q17: How many tree islands are there with the best score? Why do you think there are fewer islands with TBR and SPR compared to NNI?

  8. View unrooted tree in FigTree viewer:

    Using what you learned above, save the trees to a file and view them using the FigTree viewer on your local machine. Make sure to select the unrooted view (we have not rooted these trees). Note: if you want to only save tree number 1, then you can do so by using the following option in combination with the savetrees command: from=1 to=1.

    Notice how sequences from a single patient all cluster together, and that even within patients a significant amount of sequence diversity is present.

  9. Exit the PAUP* program:


PAUP* help:

  • PAUP*: command reference document.:
  • PAUP*: quick start tutorial.:
  • From within PAUP* you can get an overview over available commands by typing "help cmds". You can get help for a specific command by entering "command-name ?". For instance: "savetrees ?" will give you a summary of options available for the savetrees command.

Source of data