Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Phylogenetic Analysis Using Maximum Likelihood

Exercise written by: Anders Gorm Pedersen


Overview

Today's exercise will focus on maximum likelihood-based methods, but we will also touch briefly on some issues related to aligning coding DNA. We will explore these questions using two unaligned data sets (MHC and HIV sequences). We already discussed the maximum likelihood method in class, but if you want to read a brief introduction you can find one at the bottom of this page


Getting started, description of data.

  1. Log in to your account:

    Start by logging in to your group's account.

  2. Construct todays working directory:

    mkdir likelihood
    ls -l

    The mkdir command constructs a new directory. You should be able to see it using the ls command.

  3. Change to today's working directory

    cd likelihood
  4. Copy files for today's exercise:

    cp ~gorm/likelihood/gp120.fasta ./gp120.fasta
    cp ~gorm/likelihood/mhc.fasta ./mhc.fasta
  5. Inspect sequence files:

    nedit mhc.fasta

    This file is in the so-called fasta-format, and contains unaligned DNA sequences. Specifically, the sequences are major histo-compatibility complex (MHC) class I genes from six different primate species. MHC class I genes encode proteins that are involved in the immune response. Now, close the nedit window and have a look at the next data file:

    nedit gp120.fasta

    The second data set consists of a set of unaligned DNA sequences from HIV-1, subtype B, again in fasta-format. The sequences are approximately 500 bp long, and correspond to a region surrounding the hyper-variable V3 region in the gene encoding the gp120 surface glycoprotein.


Multiple alignment of MHC class I sequences

In this part of the exercise, we will use the program ClustalX to make a multiple alignment of the MHC class I sequences.

  1. Start ClustalX

    clustalx &

    This will start the multiple alignment program ClustalX, which is a graphic version of the popular ClustalW alignment program. The ampersand ("&") at the end of the command makes the program run "in the background".

  2. Load the sequence file

    File ==> Load sequences ==> select the file mhc.fasta
  3. Start the multiple alignment:

    Alignment ==> Do complete alignment ==> ALIGN
  4. Inspect the alignment:

    When the alignment is done you can inspect the result by scrolling along the sequences in the ClustalX window. In this case the starting point was a set of very similar sequences so not much has changed, but you will notice that gaps have been inserted here and there.

    Note: The starting point of all phylogenetic analyses is a multiple alignment, and your results will be no better than the quality of that initial alignment. Although you do not need to do so in this exercise, it is always a good idea to manually check your alignment for errors. To this end Clustalx gives you the opportunity of interactively realigning selected residue-ranges or selected sequences.

  5. Save the alignment in NEXUS format:

    File ==> Save sequences as... ==> NEXUS ==> OK
    (Make sure the output file is named "mhc.nxs".)

    The PAUP* program can not read files in the native Clustal alignment format. However, it does understand the NEXUS format, so we have to convert the alignment file before proceeding to the phylogenetic analysis.

  6. Quit ClustalX:

    File ==> Quit


Phylogenetic Analysis of MHC Class I sequences

In this part of the exercise you will find the maximum likelihood tree for the MHC data, and briefly compare that tree to an alternative phylogeny.

  1. Start the PAUP* program:

    paup

    This will give you a prompt ("paup>") where you can write commands to the program.

  2. Load the nexus file:

    execute mhc.nxs

    This command loads the sequences from the nexus file into memory. If the nexus file had contained any commands, then these would also have been executed.

  3. Exclude all alignment columns containing any gaps:

    exclude gapped

    The exclude command can be used to ignore specified columns of the alignment. For instance, "exclude 1-29 560-577" would make PAUP* ignore columns 1 through 29 as well as columns 560 through 577. The special word "gapped" automatically selects all columns containing one or more gaps.

  4. Define the outgroup:

    outgroup olive_baboon macaque

    This defines an outgroup consisting of the two old world monkeys in the data set. The names are those that appear in the mhc.nxs file.

    The outgroup will be used to place the root of the tree. The rationale is as follows: our data set consists of sequences from man, from a number of great apes, and from two old world monkeys. We know from other evidence that the lineage leading to monkeys branched off before any of the remaining organisms diverged from each other. The root of the tree connecting the organisms investigated here, must therefore be located between the monkeys (the "outgroup") and the rest (the "ingroup")). This way of finding a root is called "outgroup rooting").

  5. Activate outgroup rooting and select how tree will be printed:

    set root=outgroup outroot=monophyl

    This makes PAUP* use the outgroup we defined above for the purpose of rooting the tree. The "outroot=monophyl" command makes PAUP construct a tree where the outgroup is a monophyletic sister group to the ingroup. (Outroot could also have been set to "polytomy" or "paraphyl").

  6. Set the criterion to likelihood:

    set criterion=likelihood
  7. Select the Kimura 2-parameter model with gamma-distributed rates:

    We have reason to believe that the Kimura 2 parameter model is a fair description of how these sequences evolve (i.e., transitions and transversions have separate rates). We furthermore have evidence that different sites evolve at quite different rates, and we want to model this using a gamma distribution. (Although we will not discuss the issue further at this point, it is important to realize that there are techniques for stringent selection of the best model, and that one should never just randomly select one. We will return to such techniques later when we discuss the likelihood ratio test). For now, however, you should just accept that K2P + gamma is an adequate model for the present data set.

    lset nst=2 basefreq=equal tratio=estimate rates=gamma shape=estimate

    The "lset" command is used to set the model under the likelihood criterion. Specifically, we have here indicated that we want a model with two different types of substitution rates (nst=2) where the frequency of each base is 25% (basefreq=equal). This is exactly the K2P model. Note that, by default, PAUP assumes that nst=2 means that we want to make a distinction between transitions and transversions. It is also possible to specify models with two types of substitutions that are NOT transitions and transversions respectively. One example would be: lset nst=6 rmatrix=estimate rclass=(a a a b b b). I will not explain this example in detail at this point.

    Secondly, we request that the transition/transversion ratio should be estimated from the data (tratio=estimate).

    Finally, we specify that we want to use a model where substitution rates at different sites follow a gamma distribution (rates=gamma), and that we want the shape of this distribution to also be estimated from the data (shape=estimate).

  8. Find the maximum likelihood tree by examining all possible trees:

    alltrees

    This makes PAUP* find the best trees by exhaustively searching through all possible trees (there are 105 unrooted trees with 6 taxons). For each possible tree PAUP* finds the likelihood (i.e., the probability of the data given the model parameters), and upon finishing this, the best tree is the one with the highest likelihood. At the end of the run, PAUP* outputs a histogram giving the distribution of all likelihoods. Note that PAUP* outputs the negative logarithm of the likelihood (-lnL), meaning that smaller values are better. (Confusing - I know...)

  9. Examine the best tree

    describetrees 1/plot=phylogram

    This shows the maximum likelihood tree. What is the closest relative of man according to this tree? Put the answer on today's results form, and also note the negative lnL for this tree.

  10. Define a constraint for tree-searching:

    constraints homogib (monophyly)=((human,gibbon));

    This defines a constraint named "homogib" which requires the taxons "human" and "gibbon" to form a monophyletic group.

  11. Find the best tree that satisfies the constraint:

    alltrees /constraints=homogib enforce=yes

    This performs an exhaustive search for the best tree that has human and gibbon together as a monophyletic group. Note the option enforce=yes which ensures that the named constraint is applied. Also note that there are only 15 possible trees which conform to this constraint.

    describetrees 1/plot=cladogram

    Note the -lnL of this tree on the results form, and compare the value with that of the best tree found above. The difference tells you something about how much better one hypothesis is than the other. (Recall that the likelihood is the probability of the data given the model, so if the -lnL drops by 5 points, this means the probability of the data is e^5 = 148 times higher). There are tests that will tell you whether the difference is significant, but we will not get into that at this point in the course.


Analysis of viral data set: alignment of coding DNA.

In this part of the exercise you will learn how to align coding DNA while taking into account the reading frame and the encoded protein sequence.

  1. Quit PAUP* program:

    quit
  2. Contact RevTrans server, enter sequence:

    The RevTrans server can be accessed using the following link:

    http://www.cbs.dtu.dk/services/RevTrans/

    Use the browse button in the field "Upload file containing DNA sequence" to locate and select the file "gp120.fasta" in your "likelihood" directory.

  3. Align DNA sequences while taking protein level into account:

    On the RevTrans page, start alignment by clicking "Submit query".

    Here is what is going on: DNA sequences are a lot less informative than protein sequences and for this reason it is always preferable to align coding DNA in translated form. The simple fact that proteins are built from 20 amino acids while DNA only contains four different bases, means that the 'signal-to-noise ratio' in protein sequence alignments is much better than in alignments of DNA. Besides this information-theoretical advantage, protein alignments also benefit from the information that is implicit in empirical substitution matrices such as BLOSUM-62. Taken together with the generally higher rate of synonymous mutations over non-synonymous ones, this means that the phylogenetic signal disappears much more rapidly from DNA sequences than from the encoded proteins. It is therefore preferable to align coding DNA at the amino acid level.

    However, in the context of molecular evolution, DNA alignments retain a lot of useful information regarding silent mutations. Especially the ratio between silent and non-silent substitutions is informative. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

    RevTrans takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment that is in accordance with the protein alignment. This means that gaps are always inserted in groups of three so reading frames are kept in order.

  4. Save alignment in file:

    Right-click "Save results" ==> Click "Save link as..." ==> Save file with name "gp120.aln" in the "likelihood" directory.
  5. Convert the clustal file to NEXUS format:

    clustalx gp120.aln

    This opens the alignment in clustalx. Notice how gaps have been inserted in groups of three (codon sized). Now, use the following:

    File ==> Save sequences as... ==> NEXUS ==> OK

    to save the alignment in NEXUS format. Make sure the file is named gp120.nxs. This file will be used by PAUP in the next part of this exercise. Close the clustalx window and move on to the final party of today's exercise.


Analysis of viral data set: maximum likelihood tree.

In this part of the exercise we will use an iterative approach to construct a maximum likelihood tree. This approach is useful for data sets that are too big for full maximum likelihood reconstruction (where all parameters are optimized simultaneously). (Note that the present data could probably be analyzed using the full maximum likelihood approach in a matter of hours or days, but we do not want to wait quite that long. Furthermore, it is useful to be familiar with the approach outlined here).

First we will construct a neighbor joining tree using JC correction. We will then enter the iterative part of the process. In the first step we fix the tree and use it as the basis for estimating the other parameters in our chosen model. In step two, we fix these parameters and use these estimates while searching for the best tree by maximum likelihood. This two-step procedure is repeated for a number of steps or until some stop criterion is met. By limiting the number of parameters that have to be estimated simultaneously, we will save a considerable amount of computer time.

  1. Start PAUP:

    paup
  2. Load data file:

    execute gp120.nxs
  3. Select distance criterion:

    set criterion=distance
  4. Specify that we want to use JC model correction:

    dset dist=jc
  5. Construct neighbor joining tree:

    nj
  6. Change criterion to likelihood:

    set criterion=likelihood
  7. Estimate parameters and tree using iterative approach:

    Step 1:

    lscores 1/nst=2 tratio=estimate basefreq=empirical variant=HKY rates=gamma shape=estimate

    This specifies the HKY model with gamma distributed rates, and requests that the transition/transversion ratio and the gamma shape parameter should be estimated, while keeping the tree topology fixed.

    We want to keep track of the progress: note the -ln L value on the results form.

    Step 2:

    lset nst=2 tratio=previous basefreq=empirical variant=HKY rates=gamma shape=previous
    hsearch start=1 swap=spr

    The lset command again specifies the HKY model with gamma distributed rates, and requests that the transition/transversion ratio and the gamma shape parameter should be set to the values you just estimated during step 1 ("tratio=previous", "shape = previous"). The hsearch command starts a heuristic search for the maximum likelihood tree using those (fixed) parameter values but estimating topology and branch lengths. The starting tree is the first of the trees we found in the previous round, and we are using re-arrangements of the subtree pruning and regrafting type. This is a compromise: SPR is faster than TBR (tree bisection and reconnection), but less thorough. We could also have used the much faster nearest neighbor interchange (NNI), but this is much less thorough. (If you get very impatient then you might want to try using NNI instead of SPR). This step will take a couple of minutes. Progress is printed to screen every minute or so.

    Again note the negative log likelihood of the model when the search is finished.

  8. Repeat iteration:

    Now, repeat steps 1 and 2 once more. Each time note the -lnL. (If you have the time you may want to try even more iterations to see if the likelihood keeps improving).

  9. Save tree in phylip format:

    savetree from=1 to=1 file=hivtree.phylip format=phylip

    The options from=1 to=1 ensure that only one tree will be saved to file.

  10. Export alignment in phylip format:

    export file=gp120data.phylip format=phylip
  11. Quit PAUP:

    quit

The tree you just constructed will be used in a later exercise where we will use maximum likelihood methods to search for positively selected sites in the HIV data sets.


Introduction to maximum likelihood

Very briefly, the idea with maximum likelihood is the following: You have some observed data and a probabilistic model of how that data was generated. (Having a probabilistic model simply means you are able to compute the probability of any possible outcome). Your goal is to infer the values of model parameters based on the observed data. Under maximum likelihood, the best estimate of the parameter values is taken to be those that give the highest probability of producing the observed outcome. The likelihood of a model is therefore the probability of the observed data given that model: L(Model) = P(Data|Model), where "Model" means a specific set of parameter values.

For instance: your data is the observation that 59 out of 100 coin tosses came out heads. Your probabilistic model could then simply be that there is a probability p of getting heads, a probability (1-p) of getting tails, and the probability of getting h heads out of a total of n tosses can be found using the binomial distribution. For a given value of p (say, p=0.52) you could then compute the probability of any possible outcome (h=0, h=1, h=2, ..., h=100). This is a probability distribution and therefore sums to one. The probability of the outcome you actually observed (h=59) given the current parameter value (p=0.52) is the likelihood of the current parameter value. The maximum likelihood estimate of p is then the value which gives the highest probability of producing h=59 (in the present case that would be the rather obvious estimate p=0.59). Note that for each value of p you are only interested in one single value from the entire probability distribution (namely the one corresponding to the observed data). When you choose the best parameter value by maximum likelihood, you are therefore comparing probabilities across different probability distributions.

In phylogenetic analysis using maximum likelihood, the observed data is most often taken to be the set of aligned sequences. Typical model parameters are the substitution rate matrix, the tree topology, and the branch lengths, but more complicated models can have additional parameters (the gamma distribution shape parameter for instance). One or more of these parameter values can be estimated from the data. While interest has classically focused on the inferred tree, estimates of other model parameters can also be of interest (e.g., the transition/transversion ratio and branch lengths).

Note that treating the alignment as if it was observed data is of course a gross simplification, which means we are ignoring any uncertainty about how the sequences should be aligned. We really ought to think of the individual sequences as the observed data, and inference of the alignment should then be part of the model fitting process. There is ongoing research on this topic (for instance "tree-HMMs") but we are not quite there yet.


Model specification in PAUP:

Above, you saw how to specify the K2P model (lset nst=2 basefreq=equal tratio=estimate). In order to demonstrate the general principle behind model specification I have listed the descriptions for a few of the most widely used models below. Make sure you understand why the lset commands below correspond to the various models:

Jukes and Cantor model (JC):

lset nst=1 basefreq=equal

Felsenstein 81 model (F81):

lset nst=1 basefreq=empirical

The command basefreq=empirical means that PAUP should use the average base frequencies in the data set. The command basefreq=estimate would instead have used maximum likelihood to find the best values. However, since the computational burden is increased for every additional free parameter the empirical approach is often useful.

HKY model:

lset nst=2 basefreq=empirical tratio=estimate variant=HKY

Felsenstein 84 model (F84):

lset nst=2 basefreq=empirical tratio=estimate variant=F84

General time reversible model:

lset nst=6 basefreq=empirical rmatrix=estimate

The identifier rmatrix refers to the 6-parameter substitution matrix.

Tamura-Nei model:

lset nst=6 basefreq=empirical rmatrix=estimate rclass=(a b a a c a)

If rmatrix=estimate, then the option rclass allows the user to define a sub-model of the 6-parameter substitution matrix. The syntax of the rclass command is as follows: rclass=(AC AG AT CG CT GT). This is alphabetical order. The above model specification therefore means that all transversions have the same rate (AC=AT=CG=GT=a), and that the two transition types have each their individual rates (AG=b, CT=c).