|
Phylogenetic Analysis Using Maximum Likelihood
Exercise written by: Anders Gorm Pedersen
Overview
Today's exercise will focus on maximum likelihood-based methods, but we will also touch
briefly on some issues related to aligning coding DNA. We will explore these questions
using two unaligned data sets (MHC and HIV sequences). We already discussed the maximum
likelihood method in class, but if you want to read a brief introduction you can find one
at the bottom of this page
Log in to your account:
Start by logging in to your group's account.
Construct todays working directory:
mkdir likelihood
ls -l
The mkdir command constructs a new directory. You should be able to
see it using the ls command.
Change to today's working directory
cd likelihood
Copy files for today's exercise:
cp ~gorm/likelihood/gp120.fasta ./gp120.fasta
cp ~gorm/likelihood/mhc.fasta ./mhc.fasta
Inspect sequence files:
nedit mhc.fasta
This file is in the so-called fasta-format, and contains unaligned DNA sequences.
Specifically, the sequences are major histo-compatibility complex (MHC) class I genes from
six different primate species. MHC class I genes encode proteins that are involved in the
immune response. Now, close the nedit window and have a look at the next data file:
nedit gp120.fasta
The second data set consists of a set of unaligned DNA sequences from HIV-1, subtype B,
again in fasta-format. The sequences are approximately 500 bp long, and correspond to a
region surrounding the hyper-variable V3 region in the gene encoding the gp120 surface
glycoprotein.
In this part of the exercise, we will use the program ClustalX to make a multiple
alignment of the MHC class I sequences.
Start ClustalX
clustalx &
This will start the multiple alignment program ClustalX, which is a graphic version of
the popular ClustalW alignment program. The ampersand ("&") at the end of the command makes
the program run "in the background".
Load the sequence file
File ==> Load sequences ==> select the file mhc.fasta
Start the multiple alignment:
Alignment ==> Do complete alignment ==> ALIGN
Inspect the alignment:
When the alignment is done you can inspect the result by scrolling along the
sequences in the ClustalX window. In this case the starting point was a set of very
similar sequences so not much has changed, but you will notice that gaps have been
inserted here and there.
Note: The starting point of all phylogenetic analyses is a multiple
alignment, and your results will be no better than the quality of that initial
alignment. Although you do not need to do so in this exercise, it is always a good idea
to manually check your alignment for errors. To this end Clustalx gives you the
opportunity of interactively realigning selected residue-ranges or selected
sequences.
Save the alignment in NEXUS format:
File ==> Save sequences as... ==> NEXUS ==> OK
(Make sure the output file is named "mhc.nxs".)
The PAUP* program can not read files in the native Clustal alignment format.
However, it does understand the NEXUS format, so we have to convert the
alignment file before proceeding to the phylogenetic analysis.
Quit ClustalX:
File ==> Quit
Phylogenetic Analysis of MHC Class I sequences
In this part of the exercise you will find the maximum likelihood tree for
the MHC data, and briefly compare that tree to an alternative phylogeny.
Start the PAUP* program:
paup
This will give you a prompt ("paup>") where you can write
commands to the program.
Load the nexus file:
execute mhc.nxs
This command loads the sequences from the nexus file into memory. If the
nexus file had contained any commands, then these would also have been
executed.
Exclude all alignment columns containing any gaps:
exclude gapped
The exclude command can be used to ignore specified columns of the
alignment. For instance, "exclude 1-29 560-577" would make PAUP*
ignore columns 1 through 29 as well as columns 560 through 577. The special word "gapped"
automatically selects all columns containing one or more gaps.
Define the outgroup:
outgroup olive_baboon macaque
This defines an outgroup consisting of the two old world monkeys in the
data set. The names are those that appear in the mhc.nxs file.
The outgroup will be used to place the root of the tree. The rationale is as
follows: our data set consists of sequences from man, from a number of great
apes, and from two old world monkeys. We know from other evidence that
the lineage leading to monkeys branched off before any of the remaining
organisms diverged from each other. The root of the tree connecting the
organisms investigated here, must therefore be located between the monkeys (the
"outgroup") and the rest (the "ingroup")). This way of finding a root is called
"outgroup rooting").
Activate outgroup rooting and select how tree will be printed:
set root=outgroup outroot=monophyl
This makes PAUP* use the outgroup we defined above for the purpose of
rooting the tree. The "outroot=monophyl" command makes PAUP construct
a tree where the outgroup is a monophyletic sister group to the ingroup.
(Outroot could also have been set to "polytomy" or "paraphyl").
Set the criterion to likelihood:
set criterion=likelihood
Select the Kimura 2-parameter model with gamma-distributed rates:
We have reason to believe that the Kimura 2 parameter model is a fair description of how
these sequences evolve (i.e., transitions and transversions have separate rates). We
furthermore have evidence that different sites evolve at quite different rates, and we want
to model this using a gamma distribution. (Although we will not discuss the issue further
at this point, it is important to realize that there are techniques for stringent selection
of the best model, and that one should never just randomly select one. We will return to
such techniques later when we discuss the likelihood ratio test). For now, however, you
should just accept that K2P + gamma is an adequate model for the present data set.
lset nst=2 basefreq=equal tratio=estimate rates=gamma shape=estimate
The "lset" command is used to set the model under the likelihood
criterion. Specifically, we have here indicated that we want a model with two different
types of substitution rates (nst=2) where the frequency of each base is 25%
(basefreq=equal). This is exactly the K2P model. Note that, by default, PAUP assumes that
nst=2 means that we want to make a distinction between transitions and transversions. It is
also possible to specify models with two types of substitutions that are NOT transitions
and transversions respectively. One example would be: lset nst=6 rmatrix=estimate
rclass=(a a a b b b). I will not explain this example in detail at this point.
Secondly, we request that the transition/transversion ratio should be estimated from the
data (tratio=estimate).
Finally, we specify that we want to use a model where substitution rates at different
sites follow a gamma distribution (rates=gamma), and that we want the shape of this
distribution to also be estimated from the data (shape=estimate).
Find the maximum likelihood tree by examining all possible trees:
alltrees
This makes PAUP* find the best trees by exhaustively searching through all possible
trees (there are 105 unrooted trees with 6 taxons). For each possible tree PAUP* finds the
likelihood (i.e., the probability of the data given the model parameters), and upon
finishing this, the best tree is the one with the highest likelihood. At the end of the
run, PAUP* outputs a histogram giving the distribution of all likelihoods. Note that
PAUP* outputs the negative logarithm of the likelihood (-lnL), meaning that smaller
values are better. (Confusing - I know...)
Examine the best tree
describetrees 1/plot=phylogram
This shows the maximum likelihood tree. What is the closest relative of man according
to this tree? Put the answer on today's results form, and also note the negative lnL for this
tree.
Define a constraint for tree-searching:
constraints homogib (monophyly)=((human,gibbon));
This defines a constraint named "homogib" which requires the taxons "human"
and "gibbon" to form a monophyletic group.
Find the best tree that satisfies the constraint:
alltrees /constraints=homogib enforce=yes
This performs an exhaustive search for the best tree that has human and gibbon
together as a monophyletic group. Note the option enforce=yes which ensures that
the named constraint is applied. Also note that there are only 15 possible trees which
conform to this constraint.
describetrees 1/plot=cladogram
Note the -lnL of this tree on the results form, and compare the value with that of the best
tree found above. The difference tells you something about how much better one hypothesis is than
the other. (Recall that the likelihood is the probability of the data given the model, so if the
-lnL drops by 5 points, this means the probability of the data is e^5 = 148 times higher). There
are tests that will tell you whether the difference is significant, but we will not get into that
at this point in the course.
Analysis of viral data set: alignment of coding DNA.
In this part of the exercise you will learn how to align coding DNA while
taking into account the reading frame and the encoded protein sequence.
Quit PAUP* program:
quit
Contact RevTrans server, enter sequence:
The RevTrans server can be accessed using the following link:
http://www.cbs.dtu.dk/services/RevTrans/
Use the browse button in the field "Upload file containing DNA sequence" to
locate and select the file "gp120.fasta" in your "likelihood" directory.
Align DNA sequences while taking protein level into account:
On the RevTrans page, start alignment by clicking "Submit query".
Here is what is going on: DNA sequences are a lot less informative than protein
sequences and for this reason it is always preferable to align coding DNA in translated
form. The simple fact that proteins are built from 20 amino acids while DNA only
contains four different bases, means that the 'signal-to-noise ratio' in protein
sequence alignments is much better than in alignments of DNA. Besides this
information-theoretical advantage, protein alignments also benefit from the information
that is implicit in empirical substitution matrices such as BLOSUM-62. Taken together
with the generally higher rate of synonymous mutations over non-synonymous ones, this
means that the phylogenetic signal disappears much more rapidly from DNA sequences than
from the encoded proteins. It is therefore preferable to align coding DNA at the amino
acid level.
However, in the context of molecular evolution, DNA alignments retain a lot of
useful information regarding silent mutations. Especially the ratio between
silent and non-silent substitutions is informative. We would therefore like to
construct a multiple alignment at the DNA level, but using information at the protein
level, and the RevTrans server does exactly that.
RevTrans takes as input an unaligned set of DNA sequences, automatically
translates them to the equivalent amino acid sequences, constructs a multiple
alignment of the protein sequences, and finally uses the protein alignment as a
template for constructing a multiple DNA alignment that is in accordance with the
protein alignment. This means that gaps are always inserted in groups of three so
reading frames are kept in order.
Save alignment in file:
Right-click "Save results" ==> Click "Save link as..." ==> Save file with name
"gp120.aln" in the "likelihood" directory.
Convert the clustal file to NEXUS format:
clustalx gp120.aln
This opens the alignment in clustalx. Notice how gaps have been inserted in
groups of three (codon sized). Now, use the following:
File ==> Save sequences as... ==> NEXUS ==> OK
to save the alignment in NEXUS format. Make sure the file is named gp120.nxs.
This file will be used by PAUP in the next part of this exercise. Close the clustalx
window and move on to the final party of today's exercise.
Analysis of viral data set: maximum likelihood tree.
In this part of the exercise we will use an iterative approach to construct a
maximum likelihood tree. This approach is useful for data sets that are too big for
full maximum likelihood reconstruction (where all parameters are optimized
simultaneously). (Note that the present data could probably be analyzed using the full
maximum likelihood approach in a matter of hours or days, but we do not want to wait
quite that long. Furthermore, it is useful to be familiar with the approach outlined
here).
First we will construct a neighbor joining tree using JC correction. We will then
enter the iterative part of the process. In the first step we fix the tree and use it
as the basis for estimating the other parameters in our chosen model. In step two, we
fix these parameters and use these estimates while searching for the best tree by
maximum likelihood. This two-step procedure is repeated for a number of steps or until
some stop criterion is met. By limiting the number of parameters that have to be
estimated simultaneously, we will save a considerable amount of computer time.
Start PAUP:
paup
Load data file:
execute gp120.nxs
Select distance criterion:
set criterion=distance
Specify that we want to use JC model correction:
dset dist=jc
Construct neighbor joining tree:
nj
Change criterion to likelihood:
set criterion=likelihood
Estimate parameters and tree using iterative approach:
Step 1:
lscores 1/nst=2 tratio=estimate basefreq=empirical variant=HKY rates=gamma shape=estimate
This specifies the HKY model with gamma distributed rates, and requests that
the transition/transversion ratio and the gamma shape parameter should be
estimated, while keeping the tree topology fixed.
We want to keep track of the progress: note the -ln L value on the results form.
Step 2:
lset nst=2 tratio=previous basefreq=empirical variant=HKY rates=gamma shape=previous
hsearch start=1 swap=spr
The lset command again specifies the HKY model with gamma distributed
rates, and requests that the transition/transversion ratio and the gamma shape
parameter should be set to the values you just estimated during step 1
("tratio=previous", "shape = previous"). The hsearch command starts a
heuristic search for the maximum likelihood tree using those (fixed) parameter values
but estimating topology and branch lengths. The starting tree is the first of the trees
we found in the previous round, and we are using re-arrangements of the subtree pruning
and regrafting type. This is a compromise: SPR is faster than TBR (tree bisection and
reconnection), but less thorough. We could also have used the much faster nearest
neighbor interchange (NNI), but this is much less thorough. (If you get very impatient
then you might want to try using NNI instead of SPR). This step will take a couple of
minutes. Progress is printed to screen every minute or so.
Again note the negative log likelihood of the model when the search is
finished.
Repeat iteration:
Now, repeat steps 1 and 2 once more. Each time note the -lnL. (If you have
the time you may want to try even more iterations to see if the likelihood keeps
improving).
Save tree in phylip format:
savetree from=1 to=1 file=hivtree.phylip format=phylip
The options from=1 to=1 ensure that only one tree will be saved to
file.
Export alignment in phylip format:
export file=gp120data.phylip format=phylip
Quit PAUP:
quit
The tree you just constructed will be used in a later exercise where
we will use maximum likelihood methods to search for positively selected
sites in the HIV data sets.
Very briefly, the idea with maximum likelihood is the following: You have some
observed data and a probabilistic model of how that data was generated. (Having a
probabilistic model simply means you are able to compute the probability of any
possible outcome). Your goal is to infer the values of model parameters based on
the observed data. Under maximum likelihood, the best estimate of the parameter
values is taken to be those that give the highest probability of producing the
observed outcome. The likelihood of a model is therefore the probability of the
observed data given that model: L(Model) = P(Data|Model), where "Model" means a
specific set of parameter values.
For instance: your data is the observation that 59 out of 100 coin tosses came
out heads. Your probabilistic model could then simply be that there is a
probability p of getting heads, a probability (1-p) of getting tails, and the
probability of getting h heads out of a total of n tosses can be found using the
binomial distribution. For a given value of p (say, p=0.52) you could then compute
the probability of any possible outcome (h=0, h=1, h=2, ..., h=100). This is a
probability distribution and therefore sums to one. The probability of the outcome
you actually observed (h=59) given the current parameter value (p=0.52) is the
likelihood of the current parameter value. The maximum likelihood estimate
of p is then the value which gives the highest probability of producing h=59 (in
the present case that would be the rather obvious estimate p=0.59). Note that for
each value of p you are only interested in one single value from the entire
probability distribution (namely the one corresponding to the observed data). When
you choose the best parameter value by maximum likelihood, you are therefore
comparing probabilities across different probability distributions.
In phylogenetic analysis using maximum likelihood, the observed data is most
often taken to be the set of aligned sequences. Typical model parameters are the
substitution rate matrix, the tree topology, and the branch lengths, but more
complicated models can have additional parameters (the gamma distribution shape
parameter for instance). One or more of these parameter values can be estimated
from the data. While interest has classically focused on the inferred tree,
estimates of other model parameters can also be of interest (e.g., the
transition/transversion ratio and branch lengths).
Note that treating the alignment as if it was observed data is of course a gross
simplification, which means we are ignoring any uncertainty about how the
sequences should be aligned. We really ought to think of the individual sequences
as the observed data, and inference of the alignment should then be part of the
model fitting process. There is ongoing research on this topic (for instance
"tree-HMMs") but we are not quite there yet.
Above, you saw how to specify the K2P model (lset nst=2
basefreq=equal tratio=estimate). In order to demonstrate the general principle behind model
specification I have listed the descriptions for a few of the most widely used
models below. Make sure you understand why the lset commands below
correspond to the various models:
Jukes and Cantor model (JC):
lset nst=1 basefreq=equal
Felsenstein 81 model (F81):
lset nst=1 basefreq=empirical
The command basefreq=empirical means that PAUP should use the average
base frequencies in the data set. The command basefreq=estimate would
instead have used maximum likelihood to find the best values. However, since the
computational burden is increased for every additional free parameter the
empirical approach is often useful.
HKY model:
lset nst=2 basefreq=empirical tratio=estimate variant=HKY
Felsenstein 84 model (F84):
lset nst=2 basefreq=empirical tratio=estimate variant=F84
General time reversible model:
lset nst=6 basefreq=empirical rmatrix=estimate
The identifier rmatrix refers to the 6-parameter substitution
matrix.
Tamura-Nei model:
lset nst=6 basefreq=empirical rmatrix=estimate rclass=(a b a a c a)
If rmatrix=estimate, then the option rclass allows the user to
define a sub-model of the 6-parameter substitution matrix. The syntax of the
rclass command is as follows: rclass=(AC AG AT CG CT GT). This
is alphabetical order. The above model specification therefore means that all
transversions have the same rate (AC=AT=CG=GT=a), and that the two transition types
have each their individual rates (AG=b, CT=c).
|