Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Likelihood Ratio Tests

Exercise written by: Anders Gorm Pedersen


Overview

Today's exercise will focus on the use of likelihood ratio tests (LRTs) in a biological/phylogenetic context. Specifically, we will look at a number of examples where we use LRTs to decide whether a parameter-rich model of sequence evolution (the "alternative model") fits a nucleotide data set significantly better than a simpler model which has fewer parameters, (the "null model"). Since each model essentially corresponds to a hypothesis about how the sequences have evolved, we can use this approach as a way of testing biological hypotheses. The type of questions that one can ask (and answer) in this way include, for instance, "Do these sequences evolve according to a molecular clock?" and "Is there evidence of positive selection acting on these sequences?".

Briefly, a likelihood ratio test is performed as follows. We are comparing two models. The simple (null) model has N_0 free parameters. The parameter-rich (alternative) model has N free parameters (where N > N_0). First, the simple model is fitted to the data and its (maximal) log-likelihood recorded (denoted lnL_0). Then the parameter-rich model is fitted to the data and its likelihood recorded (lnL). Third, twice the difference in log-likelihoods is computed - this is the test statistic: D = 2(lnL - lnL_0). The parameter-rich model will always have a better fit, due to the extra parameters and will therefore have the highest log-likelihood, so this is a positive number. Note that PAUP reports the negative log likelihood value (which we will denote nLL). In the present context you could therefore rearrange the expression above to the following: D = 2(nLL_0 - nLL), in order to get the signs right. This is the form we will use in todays exercise.

Finally, the test-statistic can be used to find the corresponding p-value by comparing it to a chi-squared distribution with N - N_0 degrees of freedom. (If the null model has 10 free parameters and the alternative model has 13, then the statistic should be compared to a chi-squared distribution with 13 - 10 = 3 degrees of freedom). This way one can test whether a parameter-rich model is significantly better than a simpler model. It should be noted that this statistic only follows a chi-squared distribution under certain specific circumstances, and there are cases where one might have to determine the null-distribution empirically (e.g., using a technique known as parametric bootstrapping).

One important condition that has to be fulfilled before one can use an LRT to compare two models, is that the models should be "nested". This means that the simpler model must be a constrained version of the parameter-rich model. For instance, the Jukes and Cantor model (where all substitution types have the same rate-parameter and where all nucleotides have the same frequency), is nested within the Felsenstein 81 model (where all substitutions have the same rate-parameter, but where the nucleotides all have different frequencies): seen this way, the JC model is a case of the F81 model where the individual base frequencies have been constrained to be identical.


Getting started.

  1. Log in to your account and construct todays working directory:

    mkdir lrt
    ls -l
  2. Change to today's working directory

    cd lrt
  3. Copy files for today's exercise:

    cp ~gorm/lrt/primatemitDNA.nexus ./primatemitDNA.nexus
    cp ~/likelihood/gp120data.phylip ./gp120data.phylip
    cp ~/likelihood/hivtree.phylip ./hivtree.phylip
    cp ~gorm/lrt/modelblock3.gorm ./modelblock3.gorm
    cp ~gorm/lrt/codeml.ctl ./codeml.ctl

    The file "primatemitDNA.nexus" contains an aligned set of mitochondrial DNA sequences from man, chimpanzee, gorilla, orangutan and gibbon. Mitochondria are cellular organelles that are bounded by a lipid membrane and contain their own genome. Mitochondrial DNA is related to certain bacterial genomes, and it is believed that the original mitochondrium was a primitive bacterial cell that was engulfed by an early ancestor of eukaryotic cells and that the pair subsequently went on to form a constant symbiotic relationship. Mitochondrial DNA has a higher rate of substitution than nuclear DNA. This makes it useful for investigating phylogenetic relationships between closely related species, such as the five primates included in the present data set.

    The files "gp120data.phylip" and "hivtree.phylip" were constructed by you earlier today (we are copying them from your likelihood directory), and contain an HIV alignment and the corresponding maximum likelihood tree respectively.

    We will here use likelihood ratio tests to analyze a number of questions regarding these data.


Comparing models of evolution

Our first use of likelihood ratio tests will be to investigate which of two models of evolution that best fits a set of DNA sequences. We will test the Jukes and Cantor model (null model) vs. the Kimura 2 parameter model (alternative model). Recall that the difference between these two models is that under K2P transitions and transversions have different rates. Both models assume equal base frequencies. (Testing only two models is a slightly artificial situation, and later in this exercise we will return to a more comprehensive scheme where we test 56 models systematically.)

  1. Start program:

    paup
  2. Load sequences:

    execute primatemitDNA.nexus

    This file contains mitochondrial DNA sequences from 5 different primate species, and has been aligned in advance.

  3. Define outgroup:

    outgroup Gibbon
  4. Activate outgroup rooting and select how tree will be printed:

    set root=outgroup outroot=monophyl
  5. Construct neighbor joining tree using JC model of evolution:

    set criterion=distance
    dset distance=jc objective=me basefreq=equal
    nj breakties=random

    The idea is now to use the lscores command to fit two different models to the neighbor joining tree you just constructed. You can then use the reported log-likelihoods to perform an LRT that will show which of the two models fit the data better.

  6. Fit JC model, find negative log-likelihood:

    set criterion=likelihood
    lscores  1/ nst=1  basefreq=equal

    This command uses the first (and only) tree in memory as the basis for fitting the JC model to the sequence data. Write down the reported negative log-likelihood (nLL_0) on today's result form.

  7. Fit K2P model, find negative log-likelihood:

    lscores  1/ nst=2  basefreq=equal  tratio=est  rates=equal

    This command fits the K2P model to the data, again using the nj tree in memory as the basis. Again write down the negative log-likelihood (nLL).

  8. Compute test statistic:

    The test statistic is twice the difference between the two negative log-likelihoods: D = 2(nLL_0 - nLL). Note the value on the form.

  9. Find number of degrees of freedom:

    To test significance we first need to determine the number of degrees of freedom for the chi-squared test we will now perform. In this case that is easy: K2P has one extra free parameter compared to JC (K2P has two different rate parameters where JC has one), so we should use 1 (one) degree of freedom.

    Note that in addition to the substitution parameters there is also one free parameter for each branch length. However, since we are comparing models on the same tree the two models will have the same number of branch length parameters. (In case you are interested: an unrooted tree with n tips contains 2n-3 branches, so the present 5-tip tree therefore has 2 x 5 - 3 = 7 branches, and the branch length for each of these is one free parameter).

  10. Determine significance:

    You now have the necessary information to determine whether the alternative model (K2P) is significantly better than the null model (JC), taking into account that K2P uses one additional free parameter. You will do this by comparing the test statistic computed above with this chi-squared table. Use 1 degree of freedom and a significance-level of 0.001. Which model is selected by the LRT? How does this fit in with what you know about mitochondrial DNA? Make sure to note the results of the LRT on today's result form.


Selecting the best model using the modeltest program:

We will now extend the analysis of this data set to more than just two possible models. We will do this using a set of predefined PAUP-commands and the program modeltest which has been specifically designed to perform LRTs according to a hierarchical system.

  1. Reset PAUP to default settings, clear trees from memory:

    factory
    cleartrees
  2. Re-set relevant options:

    outgroup Gibbon
    set root=outgroup outroot=monophyl
  3. Fit 56 models, record negative log-likelihoods in scorefile:

    execute modelblock3.gorm

    While PAUP is busy fitting the 56 investigated models, we will have a brief look at what exactly is going on here. The idea in modeltest is as follows: we are interested in using the model that best describes our data without having more parameters than strictly necessary. In modeltest this goal is achieved by fitting a series of increasingly more complex models to the data, each time testing whether the more complicated model is significantly better than the simple model, and stopping the procedure when it is not. Specifically, this is done according to a pre-defined hierarchy of LRTs that is described in the modeltest documentation (see figure 1 on page 6).

    As you have noticed, the first step in this process in fact uses PAUP to fit all the models. The PAUP-commands specifying which models to fit can be found in the file modelblock3.gorm that you started executing above. Now, have a look at the modelblock file. This file contains a set of PAUP commands along with several helpful comments (comments are enclosed in brackets: [This is a PAUP comment]). As mentioned in a previous exercise, it is possible to control PAUP by placing commands in a separate text file and then executing the file from within PAUP. This approach is especially useful when the same set of commands have to be applied to several data sets.

    The commands in this file essentially do the same thing that we did manually above: first a neighbor joining tree is constructed using the JC model. Then a set of 56 different models are fitted to the data (using the nj tree as the basis). For each model, the estimated model parameters and the negative log-likelihood are written to a file named "model.scores". In addition to varying sets of substitution rate parameters, some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates, and (2) by allowing for a proportion of constant ("invariable") sites.

  4. Inspect result:

    Open a second shell window on evolution and cd to the lrt directory. Now have a look at the file containing information about the fitted models:

    nedit model.scores &

    For each model this file lists the negative log-likelihood and all estimated model parameters (excluding branch lengths).

  5. Use modeltest program to perform LRTs:

    As mentioned, modeltest uses the list of negative log likelihoods in the score-file to select the model that best fits the sequence data. This is done according to the hierarchy of LRTs shown in the documentation (see above). Make sure you are in the free shell-window (the one where PAUP is not running) and then enter the command below:

    modeltest3.7 < model.scores > modeltest.results

    This command executes the modeltest program using "model.scores" as the input file and "modeltest.results" as the output file.

  6. Inspect result:

    Again make sure you are in the free shell-window.

    nedit modeltest.results &

    This file contains the output of the modeltest program. At the top are the results of the individual LRTs, followed by an indication of the model that was finally selected. What model is selected based on the hierarchical LRTs? Does this seem reasonable? Make a note of the model on today's result form. In the bottom of the output file you will find the results of using the Akaike information criterion (AIC) to select the best model. I will not go further into this method here.

  7. Set model parameters to winning estimates:

    Below the indication of the model that was selected by the LRTs, you will find a PAUP command that sets the model parameters to the estimates that were found using the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

    PASTE LSET COMMAND FROM MODELTEST RUN HERE
  8. Find best tree using selected model:

    Still in the PAUP-window, enter the following commands:

    alltrees
    describetree /plot=phylo

    You have now found the maximum likelihood tree using the model selected by modeltest, AND the parameter estimates found by modeltest on that model. You could also have chosen to simply estimate all the model parameters as part of this step, but fixing them improves speed tremendously. (And again, you could then have used the iterative approach explained in a previous exercise).

  9. Quit PAUP:

    quit


Detection of positively selected sites in gp120

There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods, is that the parameters of a model will have their values determined (or estimated) as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will use the codeml program from the PAML package to investigate whether we can find positively selected sites in a set of sequences encoding the envelope protein gp120 of HIV. Positive selection is here defined as codon sites where the dN/dS ratio is larger than 1.

In the maximum likelihood exercise you have previously constructed an alignment and a maximum likelihood tree for a set of HIV sequences. The aligned data set is in the file "gp120data.phylip", while the tree is in the file "hivtree.phylip". We will now use these files as the basis for performing our analysis. This is typical for how programs in the PAML package are used: the programs are not well suited for constructing phylogenetic trees, but rather depend on a pre-defined tree-topology which is then used as the basis for fitting the other model parameters (branch lengths and substitution parameters).

Biological Background

Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. One goal of todays exercise is to investigate whether you can detect such a selective pressure on parts of gp120.

  1. Inspect the parameter file:

    nedit codeml.ctl

    The file "codeml.ctl" contains several settings that are relevant for running the program codeml. Here are some of the more important ones (make sure your file also has the settings indicated below):

    seqfile = gp120data.phylip: sequence data filename treefile = hivtree.phylip: tree structure file name seqtype = 1: tells the program that our data consists of coding DNA. NSsites = 0 3: tells the program to analyze models M0 and M3. cleandata = 1: tells the program to ignore positions with gaps. method = 1: tells the program to optimize likelihood in a fast, stepwise manner.

    The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M0 says there is one single dN/dS ratio (w) for all sites in the sequence. M3 says there are 3 distinct dN/dS ratios (w_0, w_1, w_2) for different sites in the sequence, each class having its own frequency of occurrence (p_0, p_1, p_2). The value of the three dN/dS ratios (w_0, w_1, w_2) and the frequency of sites belonging to either class (p_0, p_1, p_2), will be determined during the analysis.

  2. Start the analysis

    codeml

    This will start the codeml program using the settings in the file codeml.ctl. There will be lots of diagnostic output on the screen, but you can ignore most of it (although you may be interested in keeping an eye on how the log likelihood improves as the program searches for the best parameter values). The program is fitting a model that is based on a table of codon-codon substitution rates. This is different from the nucleotide-based models we usually employ, but the philosophy is quite similar. Importantly, the codon model includes separate rates for synonymous and nonsynonymous codon-codon substitutions. This means that once the parameters have been estimated by maximum likelihood, it is possible to interpret them in terms of the ratio between synonymous and non-synonymous mutation rates. The parameter estimates thus directly tell us about the selective pressure acting on the sequences. Depending on system load this run will take somewhere around 10 minutes or so. Lean back and enjoy a cup of mate!

    (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves).

  3. Inspect result file

    Wait until codeml has finished before giving this command!

    nedit selection.results &

    This file contains a wealth of information concerning your analysis. The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values:

  4. Find likelihood and dN/dS ratio for model M0:

    Use the "Find" command in the nedit "Search" menu to find the relevant part of the file:

    Search ==> Find... ==> enter "Model 0" and click Find

    You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now, locate a line that looks a bit like the one shown below:

    lnL(ntime: 72  np: 74):  -4242.470345     +0.000000
    

    and use today's result-form to note down the number of "free parameters" used in model M0: This is indicated by "np", and is 74 in the example shown above. Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example above.

    Now, scroll down a few lines until you get to a table that is similar to the one shown below:

    dN & dS for each branch
    
     branch           t        S        N    dN/dS       dN       dS   S*dS   N*dN
    
      46..1       0.000     91.6    370.4   0.6020   0.0000   0.0000    0.0    0.0
      46..2       0.000     91.6    370.4   0.6020   0.0000   0.0000    0.0    0.0
      46..8       0.000     91.6    370.4   0.6020   0.0000   0.0000    0.0    0.0
      ...
    
    

    Note the dN/dS ratio from the column "dN/dS" (0.6020 in the example above) in today's result form. This is the average dN/dS ratio estimated under the assumption that all codon sites in the gene are under identical selective pressure. Is your value above or below 1?

  5. Find likelihood and dN/dS ratio for model M3:

    Scroll past the M0 table until you get to the results for model M3. Again note down the number of free parameters and the log-likelihood of the model.

    Now, scroll down a few lines until you get to a small table similar to this:

    dN/dS for site classes (K=3)
    p:   0.49485  0.36975  0.13540
    w:   0.06583  0.66691  3.17311
    

    This gives a summary of the dN/dS ratios that were found in the data set. The line starting w: lists the three dN/dS ratios that were found (in this case 0.06583, 0.66691, and 3.17311). The line starting p: gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 49% belong to the first class, 37% belong to the second class, and 14% of all sites belong to the class having dN/dS=3.17311). Note the w's and their corresponding p's on today's result form. Do you have any class of dN/dS values with a value that is above 1?

  6. Investigate whether model M3 is significantly better than M0:

    M3 will always have a better (higher) log-likelihood than model M0 because the two models are nested and M3 has more free parameters. To determine if it is also significantly better, first compute twice the difference between the log-likelihoods of M0 and M3. This will be your test statistic for the chi-squared test you will perform in a minute.

    For instance, if M3 has a log likelihood of -3768.977660 and the value for M0 is -3961.012861, then twice the difference is:

    2*(-3768.977660 - (-3961.012861)) = 384.070402

    Note the value in today's result form. Now, compute the difference in number of free parameters, and again note it. For instance, if M0 used 65 free parameters and M3 used 73 then:

    df = 73 - 65 = 8

    This is the degrees of freedom you will use in the chi-squared test. Finally, perform the test by comparing your statistic to a chi-squared table using the relevant number for degrees of freedom. A chi-squared table can be found here. In the example shown above, we have the statistic=384.070402, which is much larger than the largest critical value in our chi-squared table under df=8, which means that M3 is significantly better than M0 (p<<0.1%). Note the result of your LRT on the result form.

    It should be noted that actually, this is one case where the LRT statistic will not follow a chi-squared distribution due to the so-called boundary problem: M0 corresponds to M3 with two of its parameters constrained at the boundary of the parameter space. Specifically, M0 has only one class of dN/dS ratios, and can be considered a special case of M3 where the frequency of two of its dN/dS classes (say, p_0 and p_1) have been constrained to be zero. In such cases the LRT staistic does not follow a chi-squared distribution. However, it has been shown that in this case using the normal chi-squared distribution leads to an overly conservative test. This means that we may miss significant results, but if we DO find a significant difference then we are certain the result is believeable. We will therefore ignore the issue here, but we will note that it can be circumvented by using parametric bootstrapping.

  7. Examine list of positively selected sites

    If your M3 model has a dN/dS class with a value larger than 1 (we firmly believe it should have if you did things according to our instructions...), and if M3 was found to be significantly better than M0, then you have evidence for the existence of positively selected sites in the gp120 gene. That's not bad for less than one hour of work! Now, scroll down to the end of the result file and locate a list similar to this one:

    Positively selected sites Prob(w>1):
    
         7 K 0.6741
        50 L 0.9542*
        52 D 0.8680
        54 V 0.9991**
        64 S 0.9516*
        68 N 1.0000**
       100 H 0.8655
       ...
       
    

    This gives you a list of which residues (if any) that were found to belong to the positively selected dN/dS-class. Each line lists (1) the codon position, (2) the residue present at that position in the reference sequence, (3) the probability that this codon does indeed belong to the positively selected class. Note all the sites that have more than 95% probability of being positively selected on the results form.


Software

  • PAML: analyze phylogenetic hypotheses using maximum likelihood.
  • Modeltest: testing the model of DNA substitution.