Phylogenetic Analysis Using Maximum Likelihood

Exercise written by: Anders Gorm Pedersen

Overview, biological background

Today's exercise will focus on maximum likelihood-based methods, but we will also touch briefly on some issues related to aligning coding DNA. We will explore these questions using two data sets. We already discussed the maximum likelihood method in class, but if you want to read a brief introduction you can find one at the bottom of this page

Neanderthal DNA:

Data set number 1 consists of an alignment of full length mitochondrial DNA from human (53 sequences), chimpanzee (1 sequence), bonobo (1 sequence), and Neanderthal (1 sequence). The Neanderthal DNA was extracted from archaeological material, specifically 38,000 year old bones found at Vindija in Croatia (all sequence data was taken from this paper: Green et al., Cell, 2008).

The classical view emerging from anatomical and archaeological studies places Neanderthals as a different species from Homo sapiens. This is in agreement with the "Out-of-Africa hypothesis", which states that Neanderthals coexisted without mating with modern humans who originated in Africa somewhere between 100,000 to 200,000 years ago. There is, however, also anatomical and paleontological research which supports the so-called "multi-regional hypothesis", which propounds that some populations of archaic Homo evolved into modern human populations in many geographical regions. Consequently, Neanderthals could have contributed to the genetic pool of present-day Europeans. We will use the present data set to consider this issue.


The second data set consists of a set of unaligned DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hyper-variable V3 region in the gene encoding the gp120 surface glycoprotein.

Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. Today you will construct a maximum likelihood tree that we will use in a later exercise to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

Getting started.

  1. Log in to your Unix shell account on "organism".

    Make sure to maximize the window: the analyses we will perform give lots of output to the screen, so having a nice and large shell window makes it easier to keep track of what happens.

  2. Construct todays working directory, copy files:

    mkdir likelihood
    cd likelihood
    cp ~gorm/likelihood/ ./
    cp ~gorm/likelihood/gp120.DNA.fasta ./gp120.DNA.fasta
    cp ~gorm/likelihood/codeml.ctl ./codeml.ctl

Neanderthal DNA: Combined Maximum Likelihood and Distance-Based Analysis

The data set that we will examine here, is so large that it is difficult to perform a full maximum likelihood analysis (at least in the time available in class), and we therefore need to use other techniques. Here, we will use a distance-based approach. To do so, we would first like to correct the observed distance data for multiple (hidden) substitutions.

We have reason to believe that the Kimura 2 parameter model is a fair description of how the sequences evolve (i.e., transitions and transversions have separate rates). We furthermore have evidence that different sites evolve at quite different rates, and we want to model this using a gamma distribution. (Although we will not discuss the issue further at this point, it is important to realize that there are techniques for stringent selection of the best model, and that one should never just randomly select one. We will return to such techniques later in the course when we discuss the likelihood ratio test). For now, however, you should just accept that K2P + gamma is an adequate model for the present data set.

Now, we could use the command dset distance=k2p rates=gamma to request that each pairwise distance should be estimated using the K2P correction with gamma distributed rates. The problem with this approach is that correction is then done independently for each pair of sequences, and we would therefore be ignoring the fact that all investigated sequences share a large part of their evolutionary history. Importantly, we would not take into account that the transition/transversion ratio and the gamma shape parameter may have been constant over the entire tree, but would instead obtain independent estimates for each pair of sequences. Below we will see how we can find corrected distances while assuming that R and the gamma shape parameter have been constant over the investigated tree.

Briefly, the procedure is as follows: (1) Build neighbor joining tree using JC-corrected distances. (2) Use this tree as the basis for estimating one common set of model parameters using maximum likelihood. (3) Use the estimated set of parameter values to make maximum likelihood-based distance estimates. (4) Use these ML estimated distances as the basis for constructing a new, improved neighbor joining tree.

  1. Start PAUP, construct neighbor joining tree:

    Using what you've learned previously do the following steps

    • Load the file into PAUP
    • Define outgroup to consist of these taxa: Pan_troglodytes Pan_paniscus
    • Activate outgroup rooting and set outgroup to be printed as monophyletic sister group to ingroup
    • Set the tree-reconstuction criterion to distance-based
    • Ensure the distances are corrected for multiple substitutions using the Jukes and Cantor model
    • Make a neighbor joining tree
  2. Make maximum likelihood estimation of model parameters for entire set of sequences (this command will take a few minutes to finish):

    lscores 1/nst=2 basefreq=equal tratio=estimate rates=gamma shape=estimate

    The lscores command computes the likelihood of a specified model based on the currently active data set. (Recall that the likelihood is the probability of the observed data given the model). This can be useful in its own right if we want to manually compare the fit of individual models to the data set (remember how we used the equivalent commands pscores and dscores to compare different trees under the parsimony and distance criteria). However, the command can also be used to estimate model parameters (except for tree topology), and this is how we will use it here.

    In order to use the lscores command, we must give a detailed description of the model whose parameters we want to estimate. Since this is the first time we do this, I will give a rather thorough description of each part of the command.

    First, "lscores 1" specifies that we want to use tree number 1 in memory (this is the NJ tree that we found above). Recall that tree topology is considered a model parameter in maximum likelihood phylogenetic analysis.

    Second, we specify that we want a model with two different types of substitution rates (nst=2) and where the frequency of each base is 25% (basefreq=equal). You will recognize this as the K2P model. Note that, by default, PAUP assumes that nst=2 means that we want to make a distinction between transitions and transversions. It is also possible to specify models with two types of substitutions that are NOT transitions and transversions respectively. One example would be: lscores 1/nst=6 rmatrix=estimate rclass=(a a a b b b). I will not explain this example in detail at this point.

    Third, we request that the transition/transversion ratio should be estimated from the data (tratio=estimate).

    Finally, we specify that we want to use a model where substitution rates at different sites follow a gamma distribution (rates=gamma), and that we want the shape of this distribution to also be estimated from the data (shape=estimate).

    To sum up, we are finding the maximum likelihood estimates of the transition/transversion ratio and the gamma shape parameter, based on the NJ tree and all the available sequences. A list showing how other important models can be specified is included at the end of this page. The estimates will be printed to screen upon completion of the lscores command.

    Q1: make a note of the estimated value of the parameters kappa (transition/transversion rate ratio) and Shape (the gamma distribution shape parameter).

  3. Plot estimated gamma distribution:

    gammaplot shape=previous

    This command plots the gamma distribution with the shape parameter you just estimated (instead of "previous" you could have specified the actual value - or any other value for that matter). In addition to the gamma plot, you will also get a table showing the discrete approximation that the program is using in order to reduce the computational burden. By default PAUP approximates the distribution using 4 rate categories, but this can be specified also (use "lset ?" to investigate possible options). Each category is represented by its mean value, and is defined in such a way that it contains exactly 25% of the distribution. This means that categories which are near the peak of the distribution are relatively narrow, while categories near the tail are much wider.

    Q2: Include a plot of the gamma distribution in your report.

  4. Use the parameter estimates to compute a new set of corrected distances:

    dset dist=ml objective=lsfit power=2
    lset nst=2 basefreq=equal tratio=previous rates=gamma shape=previous
    describetree 1/plot=phylogram brlens=yes label=no

    The dset command specifies that we want to use maximum likelihood to estimate the distances, and that the tree should be constructed using least squares with inverse square weighting. The lset command specifies just what model we should be using for this estimation, using the same syntax as the lscores command described above. Here we have specified the K2P model with gamma distributed rates. We have furthermore requested that the parameter values should be set to the values we just estimated above (tratio=previous shape=previous). You could also have written the values explicitly. Finally, the nj command finds the neighbor joining tree based on the distances thus computed. (Normally you would probably have wanted to use hsearch to find a least squares tree, but we do not want to wait 10-20 minutes right now).

    Q3: Are the Neanderthal sequences placed inside or outside the group of human sequences?

    We have now used the information in the entire data set as the basis for finding one common set of model parameters (here we estimated the "tratio" and "shape" parameters). These estimated parameter values were then used for correcting the observed distances. This approach helps reduce the noise considerably, and furthermore guarantees that we will not run into the problem of undefined corrected distance when D_obs >= 0.75.

  5. Assess support for the placement of Neanderthal:

    bootstrap search=nj nreps=500 grpfreq=no

    This command will use "bootstrapping" to investigate the support for different branches in the tree. It will take a few minutes to finish. We will return to bootstrapping and other methods of testing phylogenetic hypotheses later, but briefly, bootstrapping works by constructing a number of data sets (here N=500) which have been derived from the original alignment but with noise added. From each of these derived data sets a NJ tree is then built using the current settings (i.e., we are still using the K2P + gamma parameters estimated above). Finally, a consensus tree is built from the 500 resulting trees. The frequency with which any bipartition occurs gives some idea about how well supported it is by the data. This is NOT a number with any clear statistical meaning, but as a rule of thumb values above 70% indicate suppoprt for the branch.

    Q4: What is the support for all human sequences being grouped together (thus excluding the neanderthal)? Does your analysis support the out-of-Africa hypothesis?

Analysis of viral data set: alignment of coding DNA.

DNA sequences are a lot less informative than protein sequences and for this reason it is always preferable to align coding DNA in translated form. The simple fact that proteins are built from 20 amino acids while DNA only contains four different bases, means that the 'signal-to-noise ratio' in protein sequence alignments is much better than in alignments of DNA. Besides this information-theoretical advantage, protein alignments also benefit from the information that is implicit in empirical substitution matrices such as BLOSUM-62. Taken together with the generally higher rate of synonymous mutations over non-synonymous ones, this means that the phylogenetic signal disappears much more rapidly from DNA sequences than from the encoded proteins. It is therefore preferable to align coding DNA at the amino acid level.

However, in the context of molecular evolution, DNA alignments retain a lot of useful information regarding silent mutations. Especially the ratio between silent and non-silent substitutions is informative. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

RevTrans takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment that is in accordance with the protein alignment. This also means that gaps are always inserted in groups of three so reading frames are kept in order.

The RevTrans server can be accessed using the following link:

Construct RevTrans alignment:

Using what you've learned previously, do the following steps.

  • Align the data in the file nedit gp120.DNA.fasta using the RevTrans server.
  • Convert resulting alignment to format understandable by PAUP
  • Save alignment file on CBS server

Analysis of viral data set: maximum likelihood tree.

In this part of the exercise we will examine a data set for which a full maximum likelihood analysis (where all model parameters are optimized simultaneously) is impractical: due to the size of the data set, the full analysis would take too long for class. We will instead use an iterative approach to construct a maximum likelihood tree. This approach is useful for data sets that are too big for full maximum likelihood reconstruction, but not so large that distance-based reconstruction is necessary.

  1. Start full maximum likelihood analysis in separate PAUP window:

    As mentioned, the full maximum likelihood analysis will take too long for us to use it in class. However, we will nevertheless start such an analysis so you (1) can see how such an analysis is performed, (2) can get an intuition for how much longer this will take compared to the iterative approach explained below.

    In the analysis performed here, we will assume that the sequences are evolving according to the HKY model with gamma distributed rates, and we will request that the transition/transversion rate ratio and the gamma shape parameter are estimated from data. To start the full maximum likelihood analysis, using heuristic searching with subtree pruning and regrafting, you should first load the RevTrans aligned data into PAUP, and then issue the following commands:

    set criterion=likelihood
    lset nst=2 tratio=estimate basefreq=empirical variant=HKY rates=gamma shape=estimate
    hsearch swap=spr

    Now place this window somewhere in the background and let it run while you finish the rest of the exercise.

  2. Start iterative maximum likelihood approach:

    The heuristic search started above attempts to find good estimates for all model parameters simultaneously. The investigated model parameters here include: tree topology, branch lengths, transition/transversion rate ratio, and gamma shape parameter. Exploration of such a large parameter space is difficult, and may take prohibitively long. In order to speed up the maximum likelihood analysis in such cases, it is possible to instead use an approach where you optimize first some of the parameters while the rest are kept fixed, and then optimize the rest, while the first are kept fixed. After repeating these two steps a number of times your solution will hopefully converge to a stable set of parameter estimates.

    The outline of the approach used here is as follows: first we will construct a neighbor joining tree using JC correction. We will then enter the iterative part of the process. In the first step we fix the tree that has just been estimated in the previous step, and use it as the basis for estimating parameters in our chosen model. In step two, we fix the parameters that have just been estimated in the previous step, and use these estimates while searching for the best tree by maximum likelihood. This two-step procedure is repeated for a number of steps or until some stop criterion is met. By limiting the number of parameters that have to be estimated simultaneously, we will save a considerable amount of computer time. To get started, do the following:

    • Open a new window and cd to today's working directory
    • Start PAUP and load the RevTrans-aligned HIV sequences
  3. Construct neighbor joining tree:

    Using what you've learned previously, do the following steps:

    • Set the tree-reconstuction criterion to distance-based
    • Ensure the distances are corrected for multiple substitutions using the Jukes and Cantor model
    • Make a neighbor joining tree
  4. Change criterion to likelihood:

    set criterion=likelihood
  5. Estimate parameters and tree using iterative approach:

    Step 1:

    lscores 1/nst=2 tratio=estimate basefreq=empirical variant=HKY rates=gamma shape=estimate

    This specifies the HKY model with gamma distributed rates, and requests that the transition/transversion ratio and the gamma shape parameter should be estimated, while keeping the tree topology fixed.

    Q5: We want to keep track of the progress: note the -ln L value.

    Step 2:

    lset nst=2 tratio=previous basefreq=empirical variant=HKY rates=gamma shape=previous
    hsearch start=1 swap=spr

    The lset command again specifies the HKY model with gamma distributed rates, and requests that the transition/transversion ratio and the gamma shape parameter should be set to the previously estimated values. The hsearch command starts a heuristic search for the maximum likelihood tree using those (fixed) parameter values but estimating topology and branch lengths. The starting tree is the first of the trees we found in the previous round, and we are using re-arrangements of the subtree pruning and regrafting type. This is a compromise: SPR is faster than TBR (tree bisection and reconnection), but less thorough. We could also have used the much faster nearest neighbor interchange (NNI), but this is much less thorough. (If you get very impatient then you might want to try using NNI instead of SPR). This step will take about 10-20 minutes. Progress is printed to screen every minute or so. Note that if the -ln L drops 5 points, then the probability of the data is e^5 = 148 times higher according to the new parameter estimates!

    Q6: Again note the negative log likelihood of the model when the search is finished.

  6. Repeat iteration:

    Q7: Now, repeat steps 1 and 2 once more. Each time note the -lnL. (If you have the time you may want to try even more iterations to see if the likelihood keeps improving).

  7. Save tree in phylip format:

    savetree from=1 to=1 file=hivtree.phylip format=phylip

    The options from=1 to=1 ensure that only one tree will be saved to file.

    Q8:Plot the tree using FigTree and include in report.

  8. Export alignment in phylip format:

    export file=gp120data.phylip format=phylip
  9. Check progress of full analysis:

    Q9:Has the other analysis finished? Has it reached the state where a log likelihood is reported? Does it seem like it will finish soon?

The tree you just constructed will be used in a later exercise where we will use maximum likelihood methods to search for positively selected sites in the HIV data sets.

Introduction to maximum likelihood

Very briefly, the idea with maximum likelihood is the following: You have some observed data and a probabilistic model of how that data was generated. (Having a probabilistic model means you are able to compute the probability of any possible outcome). Your goal is to infer the values of model parameters based on the observed data. Under maximum likelihood, the best estimate of the parameter values is taken to be those that give the highest probability of producing the observed outcome. The likelihood of a model is therefore the probability of the observed data given that model: L(Model) = P(Data|Model), where "Model" means a specific set of parameter values.

For instance: your data is the observation that 59 out of 100 coin tosses came out heads. Your probabilistic model could then simply be that there is a probability p of getting heads, a probability (1-p) of getting tails, and the probability of getting h heads out of a total of n tosses can be found using the binomial distribution. For a given value of p (say, p=0.52) you could then compute the probability of any possible outcome (h=0, h=1, h=2, ..., h=100). This is a probability distribution and therefore sums to one. The probability of the outcome you actually observed (h=59) given the current parameter value (p=0.52) is the likelihood of the current parameter value. The maximum likelihood estimate of p is then the value which gives the highest probability of producing h=59 (in the present case that would be the rather obvious estimate p=0.59). Note that for each value of p you are only interested in one single value from the entire probability distribution (namely the one corresponding to the observed data). When you choose the best parameter value by maximum likelihood, you are therefore comparing probabilities across different probability distributions.

In phylogenetic analysis using maximum likelihood, the observed data is most often taken to be the set of aligned sequences. Typical model parameters are the substitution rate matrix, the tree topology, and the branch lengths, but more complicated models can have additional parameters (the gamma distribution shape parameter for instance). One or more of these parameter values can be estimated from the data. While interest has classically focused on the inferred tree, estimates of other model parameters can also be of interest (e.g., the transition/transversion ratio and branch lengths).

Note that treating the alignment as if it was observed data is of course a gross simplification, which means we are ignoring any uncertainty about how the sequences should be aligned. We really ought to think of the individual sequences as the observed data, and inference of the alignment should then be part of the model fitting process. There is ongoing research on this topic (for instance "tree-HMMs") but we are not quite there yet.

Model specification in PAUP:

In order to demonstrate the general principle behind model specification I have listed the descriptions for a few of the most widely used models below. Make sure you understand why the lset commands below correspond to the various models:

Jukes and Cantor model (JC):

lset nst=1 basefreq=equal

Felsenstein 81 model (F81):

lset nst=1 basefreq=empirical

The command basefreq=empirical means that PAUP should use the average base frequencies in the data set. The command basefreq=estimate would instead have used maximum likelihood to find the best values. However, since the computational burden is increased for every additional free parameter the empirical approach is often useful.

K2P model:

lset nst=2 basefreq=equal tratio=estimate

HKY model:

lset nst=2 basefreq=empirical tratio=estimate variant=HKY

Felsenstein 84 model (F84):

lset nst=2 basefreq=empirical tratio=estimate variant=F84

General time reversible model:

lset nst=6 basefreq=empirical rmatrix=estimate

The identifier rmatrix refers to the 6-parameter substitution matrix.

Tamura-Nei model:

lset nst=6 basefreq=empirical rmatrix=estimate rclass=(a b a a c a)

If rmatrix=estimate, then the option rclass allows the user to define a sub-model of the 6-parameter substitution matrix. The syntax of the rclass command is as follows: rclass=(AC AG AT CG CT GT). This is alphabetical order. The above model specification therefore means that all transversions have the same rate (AC=AT=CG=GT=a), and that the two transition types have each their individual rates (AG=b, CT=c).