|
Phylogenetic Analysis using Parsimony (PAUP)
Exercise written by: Anders Gorm Pedersen
Overview
In this exercise you are going to examine three different data sets using the
PAUP* program. All analyses will be performed under the parsimony criterion. Data
set 1 consists of genomic nucleotide sequences from 9 primate species. This set is
sufficiently small that an exhaustive search of all possible trees can be performed.
Data set 2 consists of mitochondrial nucleotide sequences from 12 primates. This
data set is so large that it would take too long to run through all possible trees,
but it is still small enough that branch and bound can be used to perform an exhaustive
search. The third and last data set consists of Hepatitis C virus sequences
isolated from different patients. This set is much too big for exhaustive searching
and we will have to employ heuristic methods to analyze it.
In addition to exploring aspects of parsimonious phylogenetic
reconstruction, an important goal of this exercise is to introduce you to the
PAUP* interface, and to the different types of manipulations and
analyses that can be performed within the program. Later in the course you will
use PAUP* for distance-based and maximum likelihood-based phylogenetic
reconstruction. Several other programs (e.g., MacClade and MrBayes) use
command-line interfaces that are very similar to the one used by
PAUP*.
Please install the
FigTree program (choose version
appropriate for your platform).
Also make sure you have Java
on your computer (this is required for the FigTree viewer and for the Jalview alignment
viewer used in today's exercise).
Log in to your Unix shell account on "padawan".
Make sure to maximize the window: the analyses we will perform give lots of
output to the screen, so having a nice and large shell window makes it easier to
keep track of what happens.
Construct working directory, copy files:
mkdir parsimony1
cd parsimony1
cp /home/people/gorm/parsimony1/* .
ls -l
The cp command above means "copy everything in /home/people/gorm/parsimony1/ to
current dir". Make sure to include the dot (".") at the end of the command - this means "current
dir".
Have a look at the data files:
nedit mhc.fasta
This file is in the so-called fasta-format, and contains unaligned DNA
sequences. Specifically, the sequences are major histo-compatibility complex
(MHC) class I genes from nine different primate species. MHC class I genes
encode proteins that are involved in the immune response. Now, close the
nedit window and have a look at the next data file:
nedit primate_mtDNA_interleaved.fasta
This file contains mitochondrial DNA sequences from 12 different primate
species. Now have a look at the final data file:
nedit hcv.fasta
This file contains Hepatitis C virus (HCV) sequences isolated from 4
different patients. The sequenced region corresponds to the end of the E1 gene
and the beginning of the E2 gene, surrounding the so-called hyper-variable
region 1 (HVR1). When you've had a look close the nedit window so
it doesn't clutter your screen.
Links to data (use these if you'd like to use the data, but you are not enrolled in the course):
It is often useful to be able to move data between your own (local) computer and the
(remote) CBS servers. That way you can perform analyses where you do some of the work on the local
machine (e.g., via a web interface) and some on the servers. Below I will describe three ways of
achieving this goal. In the rest of this exercise (and the rest of the course) you are free
to use whichever method suits you best.
Secure Copy (SCP) or Secure FTP (SFTP):
Depending on what platform you work on, there are a number of ways
of using the secure copy (SCP) or
secure FTP (SFTP) protocols.
For instance the free, open source program Cyberduck works well
for both Windows and Macs. For Linux one option is
Filezilla. It is also possible
to use the commandline program "scp" (issuing the command "man scp" in a
shell window will give you access to the online man(ual)-pages).
Copy/paste between text editor windows:
If you can run graphical applications over the network connection from the remote machine
(i.e., if you have a working X11-system)
you can also use the following low-tech approach:
First, open the file you want to transfer in nedit (in a shell window) on the remote CBS server.
Secondly, open a text editor on your local machine (e.g., Textedit on Macs, Notepad on Windows).
Third, copy and paste the entire content from the remote editor to the editor on your local machine.
Fourth, save the file (in plain text format) on your local machine
This approach can obviously also be used in the reverse direction.
Copy/paste between text editor and web interface:
If you have a text file on one machine and want to use the data in a web interface on the
other machine, then you don't always need to first construct the file there: many web-interfaces
have text boxes where you can directly paste the data, thus making the intermediate files
unnecessary.
In this part of the exercise, we will use the program mafft to make a multiple
alignment of the MHC class I sequences.
Make multiple alignment using mafft server on EBI website:
Open the mafft alignment server at EBI
(on your local machine) and transfer the data from the mhc.fasta file on padawan
(either via the text box or via the file "Browse" button). Set the sequence type to
"Nucleic acid" and start the alignment by clicking "Submit".
Note 1: The starting point of all phylogenetic analyses is a multiple
alignment, and your results will be no better than the quality of that initial
alignment.
Note 2: There are many different multiple alignment programs. EBI provides access to
some of these on its Multiple Sequence Alignment page.
Inspect the alignment in graphical viewer:
When the alignment is done: inspect the alignment by clicking the "Result summary" tab, and then
the "Start Jalview" button.
Q1: There is a gap in the "yellow_baboon" sequence. What is the length
of this gap? Why does this length make evolutionary sense?
Convert alignment format to NEXUS:
The program we will use next does not understand FASTA format, so we will first have to convert
the alignment to a format it does understand, namely the so-called NEXUS format.
On the "Result summary" page click the link under "Alignments in FASTA format". Now transfer the
alignment data to the
Readseq - biosequence conversion tool at EBI.
Select "PAUP|NEXUS" under "Output sequence format" and run the converter by clicking the "Submit"
button. Note: you can either download the output to a file or view it in the browser window
(select requested option under "Return biosequence data").
Save the alignment on the CBS server:
Finally, save the data (alignment in NEXUS format) to a file named mhc.nexus on the CBS server.
Manually edit the header of the NEXUS file:
Open the mhc.nexus file, find the fourth line of text and change "MISSING=-" into
"GAP=-". (We want to make sure the PAUP program can recognize the gaps in the input data.
It is surprisingly often the case that different programs will disagree slightly on sequence file formats...)
Phylogenetic Analysis of MHC Class I sequences
Start the PAUP* program:
paup
This will give you a prompt ("paup>") where you can write
commands to the program.
Load the nexus file:
execute mhc.nexus
This command loads the sequences from the nexus file into memory. If the
nexus file had contained any commands, then these would also have been
executed.
Exclude all alignment columns containing any gaps:
exclude gapped
The exclude command can be used to ignore specified columns of the
alignment. For instance, "exclude 1-29 560-577" makes PAUP*
ignore columns 1 through 29 as well as columns 560 through 577. The special word "gapped"
automatically selects all columns containing one or more gaps.
Q2: How many columns were excluded because of gaps?
Define the outgroup:
outgroup olive_baboon macaque yellow_baboon
This defines an outgroup consisting of the three old world monkeys in the
data set. The names are those that appear in the mhc.nexus file.
The outgroup will be used to place the root of the tree. The rationale is as
follows: our data set consists of sequences from man, from a number of great
apes, and from a number of old world monkeys. We know from other evidence that
the lineage leading to monkeys branched off before any of the remaining
organisms diverged from each other. The root of the tree connecting the
organisms investigated here, must therefore be located between the monkeys (the
"outgroup") and the rest (the "ingroup")). This way of finding a root is called
"outgroup rooting".
Activate outgroup rooting and select how tree will be printed:
set root=outgroup outroot=monophyl
This makes PAUP* use the outgroup we defined above for the purpose of
rooting the tree. The "outroot=monophyl" command makes PAUP construct
a tree where the outgroup is a monophyletic sister group to the ingroup.
(Outroot could also have been set to "polytomy" or "paraphyl").
Find the most parsimonious tree by examining all possible trees:
alltrees
This makes PAUP* find the best trees by exhaustively searching through all
possible trees (the number of unrooted trees with 9 taxons is 135,135). By default PAUP*
uses the parsimony criterion for constructing trees. For each possible tree PAUP*
finds the length (i.e., the number of mutational events required to explain the data
set), and upon finishing this, the best tree is the one with the smallest total
length. If there are several trees with the same length, then these are all kept
since they are equally good. At the end of the run, PAUP* outputs a histogram giving
the distribution of all tree lengths.
Q3: What is the length of the shortest tree?
Q4: How many trees with this score was found?
Q5: How far is the best tree from the second best one?
Examine the best trees
pscores
This prints the length of the trees that are currently in
memory. (Actually this command not only prints but computes the lengths,
so it can be used to evaluate any tree that has been loaded into
memory - also trees built by other methods or other programs).
describetrees all/plot=cladogram
This will print descriptions and plots of both trees (The command
"describetrees 1" would have printed a description of only the first
tree). A cladogram is a type of tree where branch lengths are ignored and only
branching order is indicated. You can scroll back to compare the two trees.
You may notice that the disagreement between the two trees concerns whether
gibbon or orangutan is closer to the root. Based on the information in this
data set we can not distinguish between these two possibilities.
Q6: Note where the human sequence is located: who are our closest relatives
according to this tree?
describetrees all/plot=phylogram
A phylogram is a tree where branches are drawn with lengths proportional to
the number of mutational events that has happened on them. (Note that
tree-terminology is not entirely consistent, and there are several other names
for this type of plot). Branch lengths are based on the reconstructed location
of mutational events (and therefore also on the reconstructed ancestral
sequences). If a mutational event could have occurred on any one of a number of
branches (in the sense that all these reconstructions give the same tree
length), then each of the branches are assigned a fraction of a mutation. For
instance, if a mutational event could have been placed on either of two
branches, then each of them will be counted as having had 0.5 mutational
events.
Q7: What is the longest branch on the tree?
showtrees all
This gives just the cladograms without further descriptions.
Save the trees to a file:
savetrees file=mhcalltrees.nexus brlens=yes
This saves the trees to a file named "mhcalltrees.nexus" with indication of
the location of the root, and with information about branch lengths.
View trees in FigTree viewer on local machine:
Open a second shell window on padawan, transfer the data in the file mhcalltrees.nexus to
a file on your local machine, and open it using the
FigTree tree viewer.
With this graphical viewer you can investigate the tree more closely and prepare publication grade
figures. The program has several options for altering the display of the tree, including viewing
the tree as unrooted, and altering the rooting interactively. After you've played around
with the possibilities for a while you should close the window again.
If, later in this course, you want to construct a tree figure that is prettier than the ASCII
rendition that PAUP gives you, then you need to repeat the steps performed above (save tree
using the savetrees command, and subsequently transfer the treefile to your own computer
and open in FigTree).
Find the most parsimonious trees using branch and bound:
Now return to the shell window where you have PAUP running and perform a branch and
bound search for the best trees:
bandb
As explained in the lecture (and the book), branch and bound is guaranteed
to find the best trees without necessarily searching through all of tree-space.
There is therefore actually no reason to use alltrees unless you are
explicitly interested in examining suboptimal trees or the distribution of all
tree lengths.
Q8: What is the length of the best tree found using branch and bound?
Is that the same value found using exhaustive search?
Examine the branch and bound trees:
showtrees all
(Remember: you also have the possibility of using FigTree as explained above).
Load the previously found trees into memory:
We want to convince ourselves that the branch and bound trees really are
identical to the trees found by the alltrees command. In order to do
that we must now load the previously found trees back into memory while still
retaining the branch and bound tree:
gettrees file=mhcalltrees.nexus mode=7
(Answer "yes" when asked whether you want to proceed).
The mode=7 command means that PAUP* should keep all trees that are
currently in memory and append all trees from the file. We now have four trees
in memory (the two found by alltrees and the two found by
bandb). Now compare the four trees and convince yourself that the same two
trees were found by bandb:
showtrees all
You can also compare the scores:
pscores all
As one final way of establishing that the trees are identical you can
compute the "distance" between the four trees in memory:
treedist
This indicates how similar the trees in memory are by computing the
so-called symmetric-difference metric. The output includes both a table
listing all pairwise differences and a bar-chart giving the distribution of
differences. Two trees with identical topology will have a distance of
zero.
Q9: Are the two new trees the same as two previously found trees?
Define a constraint for tree-searching:
constraints homooran (monophyly)=((human,orangutan));
This defines a constraint named "homooran" which requires the taxons "human" and
"orangutan" to form a monophyletic group. The constraint tree was here given as
simply: ((human,orangutan)); The tree is here shown using a notation where
a pair of parentheses corresponds to an internal node, while a comma-separated list
enclosed by a pair of parentheses indicates the subtrees that branch out from this
internal node. The constraint tree shown above is a brief way of defining the full
unresolved constraint tree: (gorilla,gibbon,bonobo,chimpanzee,yellow_baboon,olive_baboon,macaque,(human,orangutan));
Find the best tree that satisfies the constraint:
bandb /constraints=homooran enforce=yes
This performs a branch and bound search for the best tree that has human and
orangutan together as a monophyletic group. Note the option
enforce=yes which ensures that the named constraint is applied.
Q10: Compare the length of this tree with the best tree found above. How many extra
steps are required in this tree? (The difference tells you something about how
much better one hypothesis is than the other. There are tests that will tell
you whether the difference is significant, but we will not get into that at
this point in the course).
Phylogenetic Analysis of Mitochondrial DNA Sequences
Delete previously found trees from memory:
cleartrees
Prepare data for analysis:
We will now analyze the primate mitochondrial data set. Using what you learned above,
perform the following steps:
- Align the sequences present in the file primate_mtDNA_interleaved.fasta
- Convert alignment to NEXUS format
- Edit header so gaps are recognized correctly by PAUP
- Load data into PAUP
- Exclude gapped columns from data
- Set outgroup to consist of all the non-hominoid species:
Macaca_fuscata M_mulatta M_fascicularis M_sylvanus Saimiri_sciureus Tarsius_syrichta Lemur_catta
- Activate outgroup rooting and select output of outgroup as monophyletic sister group to the ingroup.
Estimate time needed for exhaustive search:
We now have 12 taxons in our data set, corresponding to 654,729,075 possible
trees. We first want to estimate how long it would take to search through every
single one of them, but we do not want to actually wait for the required amount
of time. Start the search as indicated below, and after it has run for one or
two minutes interrupt the run by pressing CTRL-C.
alltrees
Once a minute you will get an indication of how far the search has
progressed. Both the percentage of all trees covered so far and the actual
number of trees investigated, will be shown.
Q11: Use the numbers to estimate how long it would take to work through all
654,729,075 possible trees.
Find the best trees using branch and bound:
bandb
Q12: How long did it take to find the best trees by branch and bound?
Compare this to the estimated time you would have had to wait for the
alltrees command to finish (Q11). The time saved by ignoring suboptimal
parts of search space can be quite impressive, but it depends on the structure of
your data set.
Compare trees:
showtrees all / tcompress=yes
There are two equally parsimonious trees. The tcompress=yes option gives
a more condensed output of the treeplot. If you prefer the standard plots then
simply issue the command:
showtrees all / tcompress=no
As before you can also save the trees to a file and view them with the FigTree viewer.
As you may have noted, the difference between these two trees lies in whether
man is more closely related to gorilla or to chimpanzee (named "Pan" in this
data set).
Phylogenetic Analysis of viral DNA Sequences
Delete previously found trees from memory:
cleartrees
Prepare and load the data:
We will now investigate the data in the file hcv.fasta.
Using what you have learned above, prepare the data and load it into paup.
This dataset consists of 41 hepatitis C virus sequences obtained from four
different patients. The alignment should not contain any gaps (there have been no
insertions or deletions).
Enable PAUP* to store an unlimited number of trees:
set increase=auto
Perform a heuristic search using NNI:
hsearch start=stepwise addseq=random nreps=20 swap=NNI
We now have so many sequences that exhaustive searching using alltrees or
even bandb is impossible (there are approximately 10^60 possible trees). We
therefore have to employ heuristic searching. The above command starts a
heuristic search using sequential addition with random addition to construct
the initial tree. (In all 20 different random addition sequences are tried).
Once an initial tree has been constructed, the heuristic search proceeds by
rearrangements of the "nearest neighbor interchange" type (NNI).
Inspect search result:
This time the search results in a large number of equally parsimonious
trees. The result is summarized in the form of a "tree-island profile". Trees
from the same island are much more similar to each other than to trees from
other islands. Specifically, a tree-island is defined as a set of trees where
all members can reach all other members by one single re-arrangement (the
trees are all neighbors).
Q13: What is the best score found?
Q14: How many equally parsimonious trees were found in all?
Q15: How many tree islands are there with the best score?
Perform a heuristic search using SPR:
hsearch start=stepwise addseq=random nreps=20 swap=SPR
This starts a heuristic search using rearrangements of the "subtree pruning
and re-grafting" type (SPR). SPR is more elaborate than NNI and examines many
more neighbors for each tree.
Q16: Did this search find the same best score as NNI?
Perform a heuristic search using TBR:
hsearch start=stepwise addseq=random nreps=20 swap=TBR
This starts a heuristic search using rearrangements of the "tree bisection
and reconnection" type (TBR). TBR is more elaborate than both NNI and SPR. TBR
is the default swap mode for heuristic searching in PAUP*, and NNI or SPR
should mostly be used if you are interested in reducing search time.
Q17: How many tree islands are there with the best score? Why do you think
there are fewer islands with TBR and SPR compared to NNI?
View unrooted tree in FigTree viewer:
Using what you learned above, save the trees to a file and view them using the
FigTree viewer on your local machine. Make sure to select the unrooted view (we have not
rooted these trees). Note: if you want to only save tree number 1, then
you can do so by using the following option in combination with the savetrees command:
from=1 to=1.
Notice how sequences from a single patient all cluster together, and that even within patients
a significant amount of sequence diversity is present.
Exit the PAUP* program:
quit
PAUP* help:
- PAUP*: command reference document.:
- PAUP*: quick start tutorial.:
- From within PAUP* you can get an overview over available commands by typing "help cmds".
You can get help for a specific command by entering "command-name ?". For instance:
"savetrees ?" will give you a summary of options available for the savetrees
command.
|