Phylogenetic Analysis Using Maximum Likelihood
Exercise written by: Anders Gorm Pedersen
Overview, biological background
Today's exercise will focus on maximum likelihood-based methods, but we will
also touch briefly on some issues related to aligning coding DNA. We will explore
these questions using two data sets. We already discussed the maximum likelihood
method in class, but if you want to read a brief introduction you can find one
at the bottom of this page
Data set number 1 consists of an alignment of full length mitochondrial DNA from human (53
sequences), chimpanzee (1 sequence), bonobo (1 sequence), and Neanderthal (1 sequence). The
Neanderthal DNA was extracted from archaeological material, specifically 38,000 year old bones
found at Vindija in Croatia (all sequence data was taken from this paper: Green et al., Cell, 2008).
The classical view emerging from anatomical and archaeological studies places
Neanderthals as a different species from Homo sapiens. This is in agreement with
the "Out-of-Africa hypothesis", which states that Neanderthals coexisted without
mating with modern humans who originated in Africa somewhere between 100,000 to
200,000 years ago. There is, however, also anatomical and paleontological research
which supports the so-called "multi-regional hypothesis", which propounds that some
populations of archaic Homo evolved into modern human populations in many
geographical regions. Consequently, Neanderthals could have contributed to the
genetic pool of present-day Europeans. We will use the present data set to
consider this issue.
The second data set consists of a set of unaligned DNA sequences from HIV-1,
subtype B. The sequences are approximately 500 bp long, and correspond to a region
surrounding the hyper-variable V3 region in the gene encoding the gp120 surface
Like other retroviruses, particles of HIV are made up of 2 copies of a
single-stranded RNA genome packaged inside a protein core, or capsid. The core
particle also contains viral proteins that are essential for the early steps
of the virus life cycle, such as reverse transcription and integration. A lipid
envelope, derived from the infected cell, surrounds the core particle.
Embedded in this envelope are the surface glycoproteins of HIV: gp120 and
gp41. The gp120 protein is crucial for binding of the virus particle to target
cells, while gp41 is important for the subsequent fusion event. It is the
specific affinity of gp120 for the CD4 protein that targets HIV to those cells
of the immune system that express CD4 on their surface (e.g., T-helper
lymphocytes, monocytes, and macrophages).
The role gp120 plays in infection and the fact that it is situated on the surface of the HIV
particle, means it is an obvious target for the immune response. That means that there may be a
considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the
gp120 epitopes have been substituted. Today you will construct a maximum likelihood tree that we will
use in a later exercise to investigate whether you can detect such a selective pressure on parts of
gp120, again using maximum likelihood methods.
Log in to your Unix shell account on "organism".
Make sure to maximize the window: the analyses we will perform give lots of
output to the screen, so having a nice and large shell window makes it easier
to keep track of what happens.
Construct todays working directory, copy files:
cp ~gorm/likelihood/neanderthal.nexus ./neanderthal.nexus
cp ~gorm/likelihood/gp120.DNA.fasta ./gp120.DNA.fasta
cp ~gorm/likelihood/codeml.ctl ./codeml.ctl
Neanderthal DNA: Combined Maximum Likelihood and Distance-Based Analysis
The data set that we will examine here, is so large that it is difficult to
perform a full maximum likelihood analysis (at least in the time available in class),
and we therefore need to use other
techniques. Here, we will use a distance-based approach. To do so, we would first
like to correct the observed distance data for multiple (hidden) substitutions.
We have reason to believe that the Kimura 2 parameter model is a fair
description of how the sequences evolve (i.e., transitions and transversions have
separate rates). We furthermore have evidence that different sites evolve at quite
different rates, and we want to model this using a gamma distribution. (Although we
will not discuss the issue further at this point, it is important to realize that there
are techniques for stringent selection of the best model, and that one should
never just randomly select one. We will return to such techniques later in the
course when we discuss the likelihood ratio test). For now, however, you should just
accept that K2P + gamma is an adequate model for the present data set.
Now, we could use the command dset distance=k2p rates=gamma to request
that each pairwise distance should be estimated using the K2P correction with
gamma distributed rates. The problem with this approach is that correction is then
done independently for each pair of sequences, and we would therefore be ignoring
the fact that all investigated sequences share a large part of their evolutionary
history. Importantly, we would not take into account that the
transition/transversion ratio and the gamma shape parameter may have been constant
over the entire tree, but would instead obtain independent estimates for each
pair of sequences. Below we will see how we can find corrected distances while
assuming that R and the gamma shape parameter have been constant over the
Briefly, the procedure is as follows: (1) Build neighbor joining tree using
JC-corrected distances. (2) Use this tree as the basis for estimating one common
set of model parameters using maximum likelihood. (3) Use the estimated set of
parameter values to make maximum likelihood-based distance estimates. (4) Use
these ML estimated distances as the basis for constructing a new, improved neighbor joining
Start PAUP, construct neighbor joining tree:
Using what you've learned previously do the following steps
- Load the file neanderthal.nexus into PAUP
- Define outgroup to consist of these taxa: Pan_troglodytes Pan_paniscus
- Activate outgroup rooting and set outgroup to be printed as monophyletic sister group to ingroup
- Set the tree-reconstuction criterion to distance-based
- Ensure the distances are corrected for multiple substitutions using the Jukes and Cantor model
- Make a neighbor joining tree
Make maximum likelihood estimation of model parameters for entire set of sequences
(this command will take a few minutes to finish):
lscores 1/nst=2 basefreq=equal tratio=estimate rates=gamma shape=estimate
The lscores command computes the likelihood of a specified model based
on the currently active data set. (Recall that the likelihood is the probability
of the observed data given the model). This can be useful in its own right if we
want to manually compare the fit of individual models to the data set (remember
how we used the equivalent commands pscores and dscores to
compare different trees under the parsimony and distance criteria). However, the
command can also be used to estimate model parameters (except for tree topology),
and this is how we will use it here.
In order to use the lscores command, we must give a detailed
description of the model whose parameters we want to estimate. Since this is the
first time we do this, I will give a rather thorough description of each part of
First, "lscores 1" specifies that we want to use tree number 1
in memory (this is the NJ tree that we found above). Recall that tree topology is
considered a model parameter in maximum likelihood phylogenetic analysis.
Second, we specify that we want a model with two different types of substitution
rates (nst=2) and where the frequency of each base is 25%
(basefreq=equal). You will recognize this as the K2P model. Note that, by
default, PAUP assumes that nst=2 means that we want to make a distinction
between transitions and transversions. It is also possible to specify models with
two types of substitutions that are NOT transitions and transversions
respectively. One example would be: lscores 1/nst=6 rmatrix=estimate rclass=(a
a a b b b). I will not explain this example in detail at this point.
Third, we request that the transition/transversion ratio should be estimated
from the data (tratio=estimate).
Finally, we specify that we want to use a model where substitution rates at
different sites follow a gamma distribution (rates=gamma), and that we
want the shape of this distribution to also be estimated from the data
To sum up, we are finding the maximum likelihood estimates of the
transition/transversion ratio and the gamma shape parameter, based on the NJ tree
and all the available sequences. A list showing how other important models can be
specified is included at the end of this page. The estimates will be printed to
screen upon completion of the lscores command.
Q1: make a note of the estimated value of the parameters kappa (transition/transversion rate ratio)
and Shape (the gamma distribution shape parameter).
Plot estimated gamma distribution:
This command plots the gamma distribution with the shape parameter you just
estimated (instead of "previous" you could have specified the actual value - or any
other value for that matter). In addition to the gamma plot, you will also get a
table showing the discrete approximation that the program is using in order to
reduce the computational burden. By default PAUP approximates the distribution
using 4 rate categories, but this can be specified also (use "lset ?" to
investigate possible options). Each category is represented by its mean value, and
is defined in such a way that it contains exactly 25% of the distribution. This
means that categories which are near the peak of the distribution are relatively
narrow, while categories near the tail are much wider.
Q2: Include a plot of the gamma distribution in your report.
Use the parameter estimates to compute a new set of corrected distances:
dset dist=ml objective=lsfit power=2
lset nst=2 basefreq=equal tratio=previous rates=gamma shape=previous
describetree 1/plot=phylogram brlens=yes label=no
The dset command specifies that we want to use maximum likelihood to
estimate the distances, and that the tree should be constructed using least
squares with inverse square weighting. The lset command specifies just
what model we should be using for this estimation, using the same syntax as the
lscores command described above. Here we have specified the K2P model with
gamma distributed rates. We have furthermore requested that the parameter values
should be set to the values we just estimated above (tratio=previous
shape=previous). You could also have written the values explicitly. Finally,
the nj command finds the neighbor joining tree based on the distances thus
computed. (Normally you would probably have wanted to use hsearch to find
a least squares tree, but we do not want to wait 10-20 minutes right now).
Q3: Are the Neanderthal sequences placed inside or outside the group
of human sequences?
We have now used the information in the entire data set as the basis for finding
one common set of model parameters (here we estimated the "tratio" and "shape"
parameters). These estimated parameter values were then used for correcting the
observed distances. This approach helps reduce the noise considerably, and
furthermore guarantees that we will not run into the problem of undefined corrected
distance when D_obs >= 0.75.
Assess support for the placement of Neanderthal:
bootstrap search=nj nreps=500 grpfreq=no
This command will use "bootstrapping" to investigate the support for different
branches in the tree. It will take a few minutes to finish. We will return to
bootstrapping and other methods of testing phylogenetic hypotheses later, but
briefly, bootstrapping works by constructing a number of data sets (here N=500)
which have been derived from the original alignment but with noise added. From each
of these derived data sets a NJ tree is then built using the current settings
(i.e., we are still using the K2P + gamma parameters estimated above). Finally, a
consensus tree is built from the 500 resulting trees. The frequency with which any
bipartition occurs gives some idea about how well supported it is by the data.
This is NOT a number with any clear statistical meaning, but as a rule of thumb
values above 70% indicate suppoprt for the branch.
Q4: What is the support for all human sequences being grouped together (thus excluding
the neanderthal)? Does your analysis support the out-of-Africa hypothesis?
Analysis of viral data set: alignment of coding DNA.
DNA sequences are a lot less informative than protein
sequences and for this reason it is always preferable to align coding DNA in translated
form. The simple fact that proteins are built from 20 amino acids while DNA only
contains four different bases, means that the 'signal-to-noise ratio' in protein
sequence alignments is much better than in alignments of DNA. Besides this
information-theoretical advantage, protein alignments also benefit from the information
that is implicit in empirical substitution matrices such as BLOSUM-62. Taken together
with the generally higher rate of synonymous mutations over non-synonymous ones, this
means that the phylogenetic signal disappears much more rapidly from DNA sequences than
from the encoded proteins. It is therefore preferable to align coding DNA at the amino
However, in the context of molecular evolution, DNA alignments retain a lot of
useful information regarding silent mutations. Especially the ratio between
silent and non-silent substitutions is informative. We would therefore like to
construct a multiple alignment at the DNA level, but using information at the protein
level, and the RevTrans server does exactly that.
RevTrans takes as input an unaligned set of DNA sequences, automatically
translates them to the equivalent amino acid sequences, constructs a multiple
alignment of the protein sequences, and finally uses the protein alignment as a
template for constructing a multiple DNA alignment that is in accordance with the
protein alignment. This also means that gaps are always inserted in groups of three so
reading frames are kept in order.
The RevTrans server can be accessed using the following link:
Construct RevTrans alignment:
Using what you've learned previously, do the following steps.
- Align the data in the file nedit gp120.DNA.fasta using the RevTrans server.
- Convert resulting alignment to format understandable by PAUP
- Save alignment file on CBS server
Analysis of viral data set: maximum likelihood tree.
In this part of the exercise we will examine a data set for which a full
maximum likelihood analysis (where all model parameters are optimized
simultaneously) is impractical: due to the size of the data set, the full
analysis would take too long for class. We will instead
use an iterative approach to construct a
maximum likelihood tree. This approach is useful for data sets that are too big
for full maximum likelihood reconstruction,
but not so large that distance-based reconstruction is necessary.
Start full maximum likelihood analysis in separate PAUP window:
As mentioned, the full maximum likelihood analysis will take too long for us to
use it in class. However, we will nevertheless start such an analysis so you (1) can
see how such an analysis is performed, (2) can get an intuition for how much longer
this will take compared to the iterative approach explained below.
In the analysis performed here, we will assume that the sequences are evolving according to the
HKY model with gamma distributed rates, and we will request that the transition/transversion rate
ratio and the gamma shape parameter are estimated from data. To start the full maximum likelihood
analysis, using heuristic searching with subtree pruning and regrafting, you should
first load the RevTrans aligned data into PAUP, and then issue the following commands:
lset nst=2 tratio=estimate basefreq=empirical variant=HKY rates=gamma shape=estimate
Now place this window somewhere in the background and let it run while you finish the rest of
Start iterative maximum likelihood approach:
The heuristic search started above attempts to find good estimates for all model parameters
simultaneously. The investigated model parameters here include: tree topology, branch lengths,
transition/transversion rate ratio, and gamma shape parameter. Exploration of such a large
parameter space is difficult, and may take prohibitively long. In order to speed up the
maximum likelihood analysis in such cases, it is possible to instead use an approach
where you optimize first some of the parameters while the rest are kept fixed,
and then optimize the rest, while the first are kept fixed. After repeating these two steps a
number of times your solution will hopefully converge to a stable set of parameter estimates.
The outline of the approach used here is as follows: first we will construct a
neighbor joining tree using JC correction. We will then enter the iterative part of the process.
In the first step we fix the tree that has just been estimated in the previous step, and use it as
the basis for estimating parameters in our chosen model. In step two, we fix the parameters that
have just been estimated in the previous step, and use these estimates while searching for the
best tree by maximum likelihood. This two-step procedure is repeated for a number of steps or
until some stop criterion is met. By limiting the number of parameters that have to be estimated
simultaneously, we will save a considerable amount of computer time. To get started, do the
- Open a new window and cd to today's working directory
- Start PAUP and load the RevTrans-aligned HIV sequences
Construct neighbor joining tree:
Using what you've learned previously, do the following steps:
- Set the tree-reconstuction criterion to distance-based
- Ensure the distances are corrected for multiple substitutions using the Jukes and Cantor model
- Make a neighbor joining tree
Change criterion to likelihood:
Estimate parameters and tree using iterative approach:
lscores 1/nst=2 tratio=estimate basefreq=empirical variant=HKY rates=gamma shape=estimate
This specifies the HKY model with gamma distributed rates, and requests that
the transition/transversion ratio and the gamma shape parameter should be
estimated, while keeping the tree topology fixed.
Q5: We want to keep track of the progress: note the -ln L value.
lset nst=2 tratio=previous basefreq=empirical variant=HKY rates=gamma shape=previous
hsearch start=1 swap=spr
The lset command again specifies the HKY model with gamma distributed
rates, and requests that the transition/transversion ratio and the gamma shape
parameter should be set to the previously estimated values. The hsearch
command starts a heuristic search for the maximum likelihood tree using those
(fixed) parameter values but estimating topology and branch lengths. The starting
tree is the first of the trees we found in the previous round, and we are using
re-arrangements of the subtree pruning and regrafting type. This is a compromise:
SPR is faster than TBR (tree bisection and reconnection), but less thorough. We
could also have used the much faster nearest neighbor interchange (NNI), but this
is much less thorough. (If you get very impatient then you might want to try using
NNI instead of SPR). This step will take about 10-20 minutes. Progress is printed to
screen every minute or so. Note that if the -ln L drops 5 points, then the
probability of the data is e^5 = 148 times higher according to the new parameter
Q6: Again note the negative log likelihood of the model when the
search is finished.
Q7: Now, repeat steps 1 and 2 once more. Each time note the -lnL. (If you have the
time you may want to try even more iterations to see if the likelihood keeps
Save tree in phylip format:
savetree from=1 to=1 file=hivtree.phylip format=phylip
The options from=1 to=1 ensure that only one tree will be saved to
Q8:Plot the tree using FigTree and include in report.
Export alignment in phylip format:
export file=gp120data.phylip format=phylip
Check progress of full analysis:
Q9:Has the other analysis finished? Has it reached the state where a log likelihood is
reported? Does it seem like it will finish soon?
The tree you just constructed will be used in a later exercise where
we will use maximum likelihood methods to search for positively selected
sites in the HIV data sets.
Very briefly, the idea with maximum likelihood is the following: You have some
observed data and a probabilistic model of how that data was generated. (Having a
probabilistic model means you are able to compute the probability of any
possible outcome). Your goal is to infer the values of model parameters based on
the observed data. Under maximum likelihood, the best estimate of the parameter
values is taken to be those that give the highest probability of producing the
observed outcome. The likelihood of a model is therefore the probability of the
observed data given that model: L(Model) = P(Data|Model), where "Model" means a
specific set of parameter values.
For instance: your data is the observation that 59 out of 100 coin tosses came
out heads. Your probabilistic model could then simply be that there is a
probability p of getting heads, a probability (1-p) of getting tails, and the
probability of getting h heads out of a total of n tosses can be found using the
binomial distribution. For a given value of p (say, p=0.52) you could then compute
the probability of any possible outcome (h=0, h=1, h=2, ..., h=100). This is a
probability distribution and therefore sums to one. The probability of the outcome
you actually observed (h=59) given the current parameter value (p=0.52) is the
likelihood of the current parameter value. The maximum likelihood estimate
of p is then the value which gives the highest probability of producing h=59 (in
the present case that would be the rather obvious estimate p=0.59). Note that for
each value of p you are only interested in one single value from the entire
probability distribution (namely the one corresponding to the observed data). When
you choose the best parameter value by maximum likelihood, you are therefore
comparing probabilities across different probability distributions.
In phylogenetic analysis using maximum likelihood, the observed data is most
often taken to be the set of aligned sequences. Typical model parameters are the
substitution rate matrix, the tree topology, and the branch lengths, but more
complicated models can have additional parameters (the gamma distribution shape
parameter for instance). One or more of these parameter values can be estimated
from the data. While interest has classically focused on the inferred tree,
estimates of other model parameters can also be of interest (e.g., the
transition/transversion ratio and branch lengths).
Note that treating the alignment as if it was observed data is of course a gross
simplification, which means we are ignoring any uncertainty about how the
sequences should be aligned. We really ought to think of the individual sequences
as the observed data, and inference of the alignment should then be part of the
model fitting process. There is ongoing research on this topic (for instance
"tree-HMMs") but we are not quite there yet.
In order to demonstrate the general principle behind model specification I have listed the
descriptions for a few of the most widely used models below. Make sure you understand why the
lset commands below correspond to the various models:
Jukes and Cantor model (JC):
lset nst=1 basefreq=equal
Felsenstein 81 model (F81):
lset nst=1 basefreq=empirical
The command basefreq=empirical means that PAUP should use the average
base frequencies in the data set. The command basefreq=estimate would
instead have used maximum likelihood to find the best values. However, since the
computational burden is increased for every additional free parameter the
empirical approach is often useful.
lset nst=2 basefreq=equal tratio=estimate
lset nst=2 basefreq=empirical tratio=estimate variant=HKY
Felsenstein 84 model (F84):
lset nst=2 basefreq=empirical tratio=estimate variant=F84
General time reversible model:
lset nst=6 basefreq=empirical rmatrix=estimate
The identifier rmatrix refers to the 6-parameter substitution
lset nst=6 basefreq=empirical rmatrix=estimate rclass=(a b a a c a)
If rmatrix=estimate, then the option rclass allows the user to
define a sub-model of the 6-parameter substitution matrix. The syntax of the
rclass command is as follows: rclass=(AC AG AT CG CT GT). This
is alphabetical order. The above model specification therefore means that all
transversions have the same rate (AC=AT=CG=GT=a), and that the two transition types
have each their individual rates (AG=b, CT=c).