|
Likelihood Ratio Tests
Exercise written by: Anders Gorm Pedersen
Overview
Today's exercise will focus on the use of likelihood ratio tests (LRTs) in a
biological/phylogenetic context. Specifically, we will look at a number of examples where we use
LRTs to decide whether a parameter-rich model of sequence evolution (the "alternative model") fits
a nucleotide data set significantly better than a simpler model which has fewer parameters, (the
"null model"). Since each model essentially corresponds to a hypothesis about how the sequences
have evolved, we can use this approach as a way of testing biological hypotheses. The type of
questions that one can ask (and answer) in this way include, for instance, "Do these sequences
evolve according to a molecular clock?" and "Is there evidence of positive selection acting on
these sequences?".
Briefly, a likelihood ratio test is performed as follows. We are comparing two models. The
simple (null) model has N_0 free parameters. The parameter-rich (alternative)
model has N free parameters (where N > N_0). First, the simple
model is fitted to the data and its (maximal) log-likelihood recorded (denoted
lnL_0). Then the parameter-rich model is fitted to the data and its likelihood
recorded (lnL). Third, twice the difference in log-likelihoods is computed - this
is the test statistic: D = 2(lnL - lnL_0). The parameter-rich model will always
have a better fit, due to the extra parameters and will therefore have the highest log-likelihood,
so this is a positive number. Note that PAUP reports the negative log likelihood value
(which we will denote nLL). In the present context you could therefore
rearrange the expression above to the following: D = 2(nLL_0 - nLL), in order to
get the signs right. This is the form we will use in todays exercise.
Finally, the test-statistic can be used to find the corresponding p-value by
comparing it to a chi-squared distribution with N - N_0 degrees
of freedom. (If the null model has 10 free parameters and the alternative model
has 13, then the statistic should be compared to a chi-squared distribution
with 13 - 10 = 3 degrees of freedom). This way one can test
whether a parameter-rich model is significantly better than a simpler
model. It should be noted that this statistic only follows a chi-squared
distribution under certain specific circumstances, and there are cases where one
might have to determine the null-distribution empirically (e.g., using a technique
known as parametric bootstrapping).
One important condition that has to be fulfilled before one can use an LRT to compare two
models, is that the models should be "nested". This means that the simpler model must be a
constrained version of the parameter-rich model. For instance, the Jukes and Cantor model (where
all substitution types have the same rate-parameter and where all nucleotides have the same
frequency), is nested within the Felsenstein 81 model (where all substitutions have the same
rate-parameter, but where the nucleotides all have different frequencies): seen this way, the JC
model is a case of the F81 model where the individual base frequencies have been constrained to be
identical.
Log in to your account and construct todays working directory:
mkdir lrt
ls -l
Change to today's working directory
cd lrt
Copy files for today's exercise:
cp ~gorm/lrt/primatemitDNA.nexus ./primatemitDNA.nexus
cp ~/likelihood/gp120data.phylip ./gp120data.phylip
cp ~/likelihood/hivtree.phylip ./hivtree.phylip
cp ~gorm/lrt/modelblock3.gorm ./modelblock3.gorm
cp ~gorm/lrt/codeml.ctl ./codeml.ctl
The file "primatemitDNA.nexus" contains an aligned set of
mitochondrial DNA sequences from man, chimpanzee, gorilla, orangutan and gibbon.
Mitochondria are cellular organelles that are bounded by a lipid membrane and
contain their own genome. Mitochondrial DNA is related to certain bacterial
genomes, and it is believed that the original mitochondrium was a primitive
bacterial cell that was engulfed by an early ancestor of eukaryotic cells and
that the pair subsequently went on to form a constant symbiotic relationship.
Mitochondrial DNA has a higher rate of substitution than nuclear DNA. This makes
it useful for investigating phylogenetic relationships between closely related
species, such as the five primates included in the present data set.
The files "gp120data.phylip" and "hivtree.phylip" were
constructed by you earlier today (we are copying them from your
likelihood directory), and contain an HIV alignment and the
corresponding maximum likelihood tree respectively.
We will here use likelihood ratio tests to analyze a number of questions
regarding these data.
Comparing models of evolution
Our first use of likelihood ratio tests will be to investigate which of two models of evolution
that best fits a set of DNA sequences. We will test the Jukes and Cantor model (null model) vs.
the Kimura 2 parameter model (alternative model). Recall that the difference between these two
models is that under K2P transitions and transversions have different rates. Both models assume
equal base frequencies. (Testing only two models is a slightly artificial situation, and later in
this exercise we will return to a more comprehensive scheme where we test 56 models
systematically.)
Start program:
paup
Load sequences:
execute primatemitDNA.nexus
This file contains mitochondrial DNA sequences from 5 different primate species, and has been
aligned in advance.
Define outgroup:
outgroup Gibbon
Activate outgroup rooting and select how tree will be printed:
set root=outgroup outroot=monophyl
Construct neighbor joining tree using JC model of evolution:
set criterion=distance
dset distance=jc objective=me basefreq=equal
nj breakties=random
The idea is now to use the lscores command to fit two different models to the
neighbor joining tree you just constructed. You can then use the reported log-likelihoods
to perform an LRT that will show which of the two models fit the data better.
Fit JC model, find negative log-likelihood:
set criterion=likelihood
lscores 1/ nst=1 basefreq=equal
This command uses the first (and only) tree in memory as the basis for fitting the JC model
to the sequence data. Write down the reported negative log-likelihood (nLL_0)
on today's result form.
Fit K2P model, find negative log-likelihood:
lscores 1/ nst=2 basefreq=equal tratio=est rates=equal
This command fits the K2P model to the data, again using the nj tree in memory as the
basis. Again write down the negative log-likelihood (nLL).
Compute test statistic:
The test statistic is twice the difference between the two negative log-likelihoods:
D = 2(nLL_0 - nLL). Note the value on the form.
Find number of degrees of freedom:
To test significance we first need to determine the number of degrees of
freedom for the chi-squared test we will now perform. In this case that is
easy: K2P has one extra free parameter compared to JC (K2P has two different
rate parameters where JC has one), so we should use 1 (one) degree of
freedom.
Note that in addition to the substitution parameters there is also one free
parameter for each branch length. However, since we are comparing models on the
same tree the two models will have the same number of branch length
parameters. (In case you are interested: an unrooted tree with n tips contains
2n-3 branches, so the present 5-tip tree therefore has 2 x 5 - 3 =
7 branches, and the branch length for each of these is one free
parameter).
Determine significance:
You now have the necessary information to determine whether the alternative
model (K2P) is significantly better than the null model (JC), taking into
account that K2P uses one additional free parameter. You will do this by
comparing the test statistic computed above with this chi-squared table. Use 1 degree of freedom
and a significance-level of 0.001. Which model is selected by the LRT? How does
this fit in with what you know about mitochondrial DNA? Make sure to
note the results of the LRT on today's result form.
Selecting the best model using the modeltest program:
We will now extend the analysis of this data set to more than just two
possible models. We will do this using a set of predefined PAUP-commands and the program
modeltest which has been specifically designed to perform LRTs according to
a hierarchical system.
Reset PAUP to default settings, clear trees from memory:
factory
cleartrees
Re-set relevant options:
outgroup Gibbon
set root=outgroup outroot=monophyl
Fit 56 models, record negative log-likelihoods in scorefile:
execute modelblock3.gorm
While PAUP is busy fitting the 56 investigated models, we will have a brief
look at what exactly is going on here. The idea in modeltest is as follows: we
are interested in using the model that best describes our data without having
more parameters than strictly necessary. In modeltest this goal is achieved by
fitting a series of increasingly more complex models to the data, each time
testing whether the more complicated model is significantly better than the
simple model, and stopping the procedure when it is not. Specifically, this is
done according to a pre-defined hierarchy of LRTs that is described in the modeltest documentation (see figure
1 on page 6).
As you have noticed, the first step in this process in fact uses PAUP to fit
all the models. The PAUP-commands specifying which models to fit can be found
in the file modelblock3.gorm that you started executing above. Now, have a look at the modelblock file.
This file contains a set of PAUP commands along with several helpful comments
(comments are enclosed in brackets: [This is a PAUP comment]). As mentioned in a
previous exercise, it is possible to control PAUP by placing commands in a
separate text file and then executing the file from within PAUP. This approach
is especially useful when the same set of commands have to be applied to
several data sets.
The commands in this file essentially do the same thing that we did manually above: first a
neighbor joining tree is constructed using the JC model. Then a set of 56 different models are
fitted to the data (using the nj tree as the basis). For each model, the estimated model
parameters and the negative log-likelihood are written to a file named "model.scores". In addition
to varying sets of substitution rate parameters, some of these models also include extra
parameters that take into account the presence of different rates between sites. This is done in
two ways: (1) by fitting a gamma distribution of rates, and (2) by allowing for a proportion of
constant ("invariable") sites.
Inspect result:
Open a second shell window on evolution and cd to the lrt
directory. Now have a look at the file containing information about the fitted
models:
nedit model.scores &
For each model this file lists the negative log-likelihood and all estimated model parameters
(excluding branch lengths).
Use modeltest program to perform LRTs:
As mentioned, modeltest uses the list of negative log likelihoods in
the score-file to select the model that best fits the sequence data. This is
done according to the hierarchy of LRTs shown in the documentation (see above).
Make sure you are in the free shell-window (the one where PAUP is not running)
and then enter the command below:
modeltest3.7 < model.scores > modeltest.results
This command executes the modeltest program using "model.scores" as
the input file and "modeltest.results" as the output file.
Inspect result:
Again make sure you are in the free shell-window.
nedit modeltest.results &
This file contains the output of the modeltest program. At the top are the results
of the individual LRTs, followed by an indication of the model that was finally
selected. What model is selected based on the hierarchical LRTs? Does this seem
reasonable? Make a note of the model on today's result form. In the bottom of the
output file you will find the results of using the Akaike information criterion (AIC)
to select the best model. I will not go further into this method here.
Set model parameters to winning estimates:
Below the indication of the model that was selected by the LRTs, you will find a PAUP command
that sets the model parameters to the estimates that were found using the winning model. Copy and
paste this lset command (without the BEGIN and END parts) into the window where PAUP is
running.
PASTE LSET COMMAND FROM MODELTEST RUN HERE
Find best tree using selected model:
Still in the PAUP-window, enter the following commands:
alltrees
describetree /plot=phylo
You have now found the maximum likelihood tree using the model selected by modeltest, AND the
parameter estimates found by modeltest on that model. You could also have chosen to simply
estimate all the model parameters as part of this step, but fixing them improves speed
tremendously. (And again, you could then have used the iterative approach explained in a previous
exercise).
Quit PAUP:
quit
Detection of positively selected sites in gp120
There is much more to phylogenetic analyses than merely reconstructing trees.
One interesting result of probabilistic methods, is that the parameters of a
model will have their values determined (or estimated) as part of the
optimization procedure. This means that once such a model has been fitted to the
data, it is possible to investigate these parameter values to learn features
about the evolutionary history of the sequences under investigation. In the
present example we will use the codeml program from the PAML package to
investigate whether we can find positively selected sites in a set of sequences
encoding the envelope protein gp120 of HIV. Positive selection is here defined
as codon sites where the dN/dS ratio is larger than 1.
In the maximum likelihood exercise you have previously constructed an
alignment and a maximum likelihood tree for a set of HIV sequences. The
aligned data set is in the file "gp120data.phylip", while the tree is
in the file "hivtree.phylip". We will now use these files as the basis for
performing our analysis. This is typical for how programs in the PAML package
are used: the programs are not well suited for constructing phylogenetic trees,
but rather depend on a pre-defined tree-topology which is then used as the basis
for fitting the other model parameters (branch lengths and substitution
parameters).
Like other retroviruses, particles of HIV are made up of 2 copies of a
single-stranded RNA genome packaged inside a protein core, or capsid. The core
particle also contains viral proteins that are essential for the early steps
of the virus life cycle, such as reverse transcription and integration. A lipid
envelope, derived from the infected cell, surrounds the core particle.
Embedded in this envelope are the surface glycoproteins of HIV: gp120 and
gp41. The gp120 protein is crucial for binding of the virus particle to target
cells, while gp41 is important for the subsequent fusion event. It is the
specific affinity of gp120 for the CD4 protein that targets HIV to those cells
of the immune system that express CD4 on their surface (e.g., T-helper
lymphocytes, monocytes, and macrophages).
The role gp120 plays in infection and the fact that it is situated on the
surface of the HIV particle, means it is an obvious target for the immune
response. That means that there may be a considerable selective pressure on
gp120 for creating immune-escape mutants, where amino acids in the gp120
epitopes have been substituted. One goal of todays exercise is to investigate
whether you can detect such a selective pressure on parts of gp120.
Inspect the parameter file:
nedit codeml.ctl
The file "codeml.ctl" contains several settings that are relevant for running the
program codeml. Here are some of the more important ones (make sure your
file also has the settings indicated below):
seqfile = gp120data.phylip: sequence data filename
treefile = hivtree.phylip: tree structure file name
seqtype = 1: tells the program that our data consists of coding DNA.
NSsites = 0 3: tells the program to analyze models M0 and M3.
cleandata = 1: tells the program to ignore positions with gaps.
method = 1: tells the program to optimize likelihood in a fast, stepwise manner.
The settings entered by us will cause codeml to analyze two hypotheses about dN/dS
ratios. M0 says there is one single dN/dS ratio (w) for all sites in the sequence. M3 says
there are 3 distinct dN/dS ratios (w_0, w_1, w_2) for different sites in the sequence, each
class having its own frequency of occurrence (p_0, p_1, p_2). The value of the three dN/dS
ratios (w_0, w_1, w_2) and the frequency of sites belonging to either class (p_0, p_1,
p_2), will be determined during the analysis.
Start the analysis
codeml
This will start the codeml program using the settings in the file codeml.ctl.
There will be lots of diagnostic output on the screen, but you can ignore most
of it (although you may be interested in keeping an eye on how the log
likelihood improves as the program searches for the best parameter values). The
program is fitting a model that is based on a table of codon-codon substitution
rates. This is different from the nucleotide-based models we usually employ,
but the philosophy is quite similar. Importantly, the codon model includes
separate rates for synonymous and nonsynonymous codon-codon substitutions. This
means that once the parameters have been estimated by maximum likelihood, it is
possible to interpret them in terms of the ratio between synonymous and
non-synonymous mutation rates. The parameter estimates thus directly tell us
about the selective pressure acting on the sequences. Depending on system load
this run will take somewhere around 10 minutes or so. Lean back and enjoy a cup
of mate!
(You may be able to see how the optimization procedure results in
progressively better fits: the likelihood increases, meaning that negative
log-likelihood decreases, as the fit improves).
Inspect result file
Wait until codeml has finished before giving this command!
nedit selection.results &
This file contains a wealth of information concerning your analysis. The top
part of the file gives an overview of your sequences, codon usage and
nucleotide frequencies. You can ignore this information for now, and move on to
the interesting part, namely the model likelihoods and parameter values:
Find likelihood and dN/dS ratio for model M0:
Use the "Find" command in the nedit "Search" menu to find the relevant part of the file:
Search ==> Find... ==> enter "Model 0" and click Find
You are now in the region of the result file where the model likelihoods and parameter
estimates are noted. Now, locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
and use today's result-form to note down the number of "free parameters" used
in model M0: This is indicated by "np", and is 74 in the example shown above.
Also note the log-likelihood of the fitted model. This is the number right after
the parenthesis, and is -4242.470345 in the example above.
Now, scroll down a few lines until you get to a table that is similar to the one
shown below:
dN & dS for each branch
branch t S N dN/dS dN dS S*dS N*dN
46..1 0.000 91.6 370.4 0.6020 0.0000 0.0000 0.0 0.0
46..2 0.000 91.6 370.4 0.6020 0.0000 0.0000 0.0 0.0
46..8 0.000 91.6 370.4 0.6020 0.0000 0.0000 0.0 0.0
...
Note the dN/dS ratio from the column "dN/dS" (0.6020 in the example above) in
today's result form. This is the average dN/dS ratio estimated under the
assumption that all codon sites in the gene are under identical selective
pressure. Is your value above or below 1?
Find likelihood and dN/dS ratio for model M3:
Scroll past the M0 table until you get to the results for model M3. Again note down
the number of free parameters and the log-likelihood of the model.
Now, scroll down a few lines until you get to a small table similar to
this:
dN/dS for site classes (K=3)
p: 0.49485 0.36975 0.13540
w: 0.06583 0.66691 3.17311
This gives a summary of the dN/dS ratios that were found in the data set. The
line starting w: lists the three dN/dS ratios that were found (in this case
0.06583, 0.66691, and 3.17311). The line starting p: gives the proportion of
codon sites belonging to each of the dN/dS ratio classes (in the example above
approximately 49% belong to the first class, 37% belong to the second class, and
14% of all sites belong to the class having dN/dS=3.17311). Note the w's and
their corresponding p's on today's result form. Do you have any class of dN/dS
values with a value that is above 1?
Investigate whether model M3 is significantly better than M0:
M3 will always have a better (higher) log-likelihood than model M0 because the two
models are nested and M3 has more free parameters. To determine if it is also
significantly better, first compute twice the difference between the
log-likelihoods of M0 and M3. This will be your test statistic for the chi-squared test
you will perform in a minute.
For instance, if M3 has a log likelihood of -3768.977660
and the value for M0 is -3961.012861, then twice the difference is:
2*(-3768.977660 - (-3961.012861)) = 384.070402
Note the value in today's result form. Now, compute the difference in number
of free parameters, and again note it. For instance, if M0 used 65 free
parameters and M3 used 73 then:
df = 73 - 65 = 8
This is the degrees of freedom you will use in the chi-squared test. Finally,
perform the test by comparing your statistic to a chi-squared table using the
relevant number for degrees of freedom. A chi-squared table can be found here. In the example shown above, we
have the statistic=384.070402, which is much larger than the largest critical
value in our chi-squared table under df=8, which means that M3 is
significantly better than M0 (p<<0.1%). Note the result of your LRT on the result
form.
It should be noted that actually, this is one case where the LRT statistic
will not follow a chi-squared distribution due to the so-called boundary
problem: M0 corresponds to M3 with two of its parameters constrained at the
boundary of the parameter space. Specifically, M0 has only one class of dN/dS
ratios, and can be considered a special case of M3 where the frequency of two of
its dN/dS classes (say, p_0 and p_1) have been constrained to be zero. In such
cases the LRT staistic does not follow a chi-squared distribution. However, it
has been shown that in this case using the normal chi-squared distribution
leads to an overly conservative test. This means that we may miss significant
results, but if we DO find a significant difference then we are certain the
result is believeable. We will therefore ignore the issue here, but we will note
that it can be circumvented by using parametric bootstrapping.
Examine list of positively selected sites
If your M3 model has a dN/dS class with a value larger than 1 (we firmly
believe it should have if you did things according to our instructions...),
and if M3 was found to be significantly better than M0, then you have
evidence for the existence of positively selected sites in the gp120 gene.
That's not bad for less than one hour of work! Now, scroll down to the end of the
result file and locate a list similar to this one:
Positively selected sites Prob(w>1):
7 K 0.6741
50 L 0.9542*
52 D 0.8680
54 V 0.9991**
64 S 0.9516*
68 N 1.0000**
100 H 0.8655
...
This gives you a list of which residues (if any) that were found to belong to
the positively selected dN/dS-class. Each line lists (1) the codon position, (2)
the residue present at that position in the reference sequence, (3) the
probability that this codon does indeed belong to the positively selected class.
Note all the sites that have more than 95% probability of being positively
selected on the results form.
Software
- PAML:
analyze phylogenetic hypotheses using maximum likelihood.
- Modeltest:
testing the model of DNA substitution.
|