Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Model selection

Anders Gorm Pedersen


In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. Specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein. One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute model probabilities. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

Specifically, you will

  1. perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
  2. select a suitable nucleotide substitution model (using PAUP and modeltest)
  3. construct a phylogenetic tree (using PAUP).
  4. try to detect positively selected sites in gp120 (using PAML).

Recipe for computing AIC values and model probabilities:

Later in today's exercise you will be asked to compute AIC values and model probabilities. Return to this section and follow the instructions when you need to do so.

  1. Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.

  2. Compute AIC for each of the models: AIC = -2 x lnL + 2K.
    For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.

  3. Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model "AICmin".

  4. Compute the "ΔAIC" values for each model: ΔAIC = AIC - AICmin
    For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.

  5. For each model compute the following quantity: numerator = exp(-0.5 x ΔAIC)
    For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the sum of the numerator values for all models.

  6. Finally, the model probabilities for each model are found as: P(model) = numerator / sum
    For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

You may want to keep track of the computations by constructing a table along the following lines:

A good starting point for reading more about model selection and multimodel inference is this book:

Note that model probabilities can also be computed using Bayesian methods.

Biological Background

Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. One goal of todays exercise is to investigate whether you can detect such a selective pressure on parts of gp120.

Getting started, description of data

Log on to you account.

Copy today's working directory:

cp -r ~gorm/maxlik ./modelselect

Change to today's working directory, and have a look at which files are there:

cd modelselect
ls -l

Here, you will find a data-file (gp120.fasta) and two parameter files (modelblock3.gorm and codeml.ctl) that will be used later on. Have a look at the DNA data file:

nedit gp120.fasta &

The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

Creating a DNA alignment based on aligned protein sequences

  1. Align sequences with RevTrans server:

    Using what you've learned previously, use the RevTrans server to construct a multiple alignment.

  2. Save alignment in Phylip format and alter alignment file so PAML will understand it:

    First, convert the alignment file to phylip format. Make sure the file is named "gp120.phy". Secondly, alter the first line of the alignment file so it looks like this:

        39    510 I

    (i.e., add a space and the letter "I" to the end of the first line). Getting data files into formats that can be understood by various programs is something that the average bioinformatician spends a LOT of time doing... Here, we added an upper-case "i" so the PAML programs (which we will use later) will understand that the file is in "interleaved" format.

  3. Save alignment in Nexus format so PAUP will understand it:

    Also, convert the alignment file to Nexus format. Make sure the file is named "".

Selection of substitution model using PAUP and modeltest

As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in using the model that best describes our data without having more parameters than strictly necessary (thus avoiding overfitting). We will investigate this issue by fitting a set of 56 different models to our data and then selecting one with a reasonable balance between model complexity and data fit.

  1. Start PAUP:


    This starts the program, giving you a prompt ("paup> ") where you can enter commands. Each command has to be terminated by hitting the "Enter button".

  2. Load alignment:

  3. Fit 56 models, record negative log-likelihoods in scorefile:

    execute modelblock3.gorm

    This causes PAUP to execute the commands present in the file modelblock3.gorm: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are written to a file named "model.scores". In addition to varying sets of substitution rate parameters, some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates, and (2) by allowing for a proportion of constant ("invariable") sites.

    Wait until PAUP is done fitting all 56 models (this will take 5-10 minutes - you can tell PAUP has finished since the prompt "paup> " will reappear). Now, quit the program by typing:

    (Don't worry if the program makes a remark about unsaved trees - in this part of the exercise we are not interested in the tree, we just want to know how well the 56 different models fit our data set, i.e., we only want their log likelihood values).
  4. Inspect result, manually check model probabilities for three models:

    nedit model.scores &

    For each model this file lists the negative log-likelihood and all estimated model parameters (excluding branch lengths).

    Q2: Use AIC-based model probabilities to investigate which of the following three substitution models are best at describing how the sequences have evolved: Jukes and Cantor (JC), Felsenstein 81 (F81), Kimura 2 parameter model (K80/K2P). Before you can do the computation you need to know the log likelihood and the number of parameters for each model. First, locate and note the log likelihood values for the JC, F81, and K80 models in the model.scores file and note the values: each model takes up two lines in the file, JC is the first model, F81 is number 5, and K80 is number 9 in the file, so their lnLs are found on lines 2, 10, and 18). Make sure to get the signs right: the values reported in the file are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative). Use the following values for the number of parameters (K) for the three models: JC: K=0, F81: K=3, K80: K=1. Now, use the recipe above to compute AIC values and model probabilities. Report all values. Based on the model probabilities: wich model has more support? How much better is the best model compared to the alternatives (what are the ratios between their probbailities)?

  5. Use modeltest program to select best model:

    What you just did manually for JC, F81 and K80, the program modeltest does automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods in the score-file to compute AIC and model probabilities, and uses this to select the model that best fits the sequence data.

    modeltest < model.scores > modeltest.results

    This command executes the modeltest program using model.scores as the input file and modeltest.result as the output file.

  6. Inspect result:

    nedit modeltest.results &

    This file contains the output of the modeltest program. The first part of the file shows model selection results based on the so-called likelihood ratio test method. Scroll past that until you find the section containing the AIC-based results.

    Q3: Note the model selected by modeltest based on the AIC values.

    Now locate the line containing the PAUP command corresponding to the best model. The command is enclosed between "BEGIN PAUP" and "END;" and should look somethign like "Lset base=(....". (It is located right above a table showing the AIC model probabilities.) You will need to copy this command to a PAUP session in the next section:

Note: There is a newer version of modeltest (jModeltest) that will run using a graphical user interface, and that is independent of PAUP. You can download and run this on your own computer if you want:

  1. jModeltest homepage
  2. Quickstart: Graphical user interface

Construction of phylogenetic tree using PAUP

Above you used modeltest to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will again use PAUP for this purpose.

  1. Start PAUP:

  2. Load alignment:

  3. Set tree-building criterion to maximum likelihood:

    set criterion=likelihood
  4. Set model parameters to winning estimates:

    Above you located a set of lines in the file modeltest.results giving a PAUP command that sets the model parameters to the estimates that were found using the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

  5. Find best tree using selected model:

    Still in the PAUP-window, enter the following command:

    hsearch swap=nni

    This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "nearest neighbor interchange" type (NNI). We are using the model selected by modeltest, AND the parameter estimates found by modeltest on that model. You could also have chosen to simply estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Findind the best tree should take a few minutes.

  6. Save best tree to file in phylip tree format:

    savetrees format=phylip brlens=yes file=gp120tree.phy
  7. Quit program:

  8. View tree:

    Q4: You have now produced an unrooted tree of the HIV sequences in the file gp120tree.phy. Use FigTree to view the unrooted tree and include this in your report. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection on a subset of codon positions and the tree is just something we need in order to be able to fit the different codon models to the data.

Detection of positively selected sites in gp120

There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods, is that the parameters of a model will have their values determined as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on investigating whether we can find positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and model probabilities from the likelihood values of fitted models, Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

  1. Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS<1, some with dN/dS=1.
  2. Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS>1.
  3. Assess the strength of evidence for the two models using AIC-based model probabilities
  4. If M2 is better: identify the positively selected codons
  1. Inspect the parameter file:

    nedit codeml.ctl &

    The file "codeml.ctl" contains several settings that are relevant for running the program codeml:

    seqtype = 1: tells the program that our data consists of coding DNA. NSsites = 1 2 : tells the program to analyze models M1 and M2. cleandata = 1: tells the program to ignore positions with gaps.

    The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS<1 (codons under purifying or negative selection), and one class with dN/dS=1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS<1, one class with dN/dS=1 (these are the same type of classes as for M1), and one class with dN/dS>1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS<1 or dN/dS>1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

  2. Start the analysis


    This will start the codeml program using the settings in the file codeml.ctl. On padawan this will take somewhere around 10-15 minutes or so. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves).

  3. Inspect result file

    Wait for the run to finish, and then look at the result file:

    nedit selection.results &

    This file contains a wealth of information concerning your analysis. The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values:

  4. Use nedit search menu to find likelihood, K, and dN/dS ratios for model M1:

    Search ==> Find... ==> enter "Model 1" and click Find

    You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now, locate a line that looks a bit like the one shown below:

    lnL(ntime: 72  np: 74):  -4242.470345     +0.000000

    Q5: Note the number of "free parameters", K, used in model M1: This is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

    Now, scroll down a few lines until you get to a small table similar to this:

    dN/dS for site classes (K=2)
    p:   0.75111  0.24889
    w:   0.06583  1.00000

    This gives a summary of the dN/dS ratios that were found in the data set. The line starting w: lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). The line starting p: gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class and 25% of all sites belong to the class having dN/dS=1.00000).

    Q5: Note the dN/dS value and proportion of sites for both classes.

  5. Find likelihood, K, and dN/dS ratios for model M2:

    Scroll past the M1 output until you get to the results for model M2.

    Q7: Note the number of free parameters and the log-likelihood of model M2.

    Now, scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

    Q8: Note the dN/dS value and proportion of sites for all three classes.

  6. Assess strength of evidence for models M1 and M2:

    M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters. You should now use the recipee given above to compute AIC values and model probabilities.

    Q9: Note the selected model (is M2 better than M1?)

  7. Examine list of positively selected sites

    If your M2 is clearly better than M1 (we firmly believe it should be if you did things according to our instructions...), then you have evidence for the existence of positively selected sites in the gp120 gene. That's not bad for a few hours of work! Now, scroll down to the end of the result file and locate a list similar to this one:

    Bayes Empirical Bayes (BEB) analysis
    Positively selected sites
             Prob(w>1)     mean w
        25 A 0.959*        3.133 +- 0.769
        27 P 0.906         2.990 +- 0.877
        56 K 0.987*        3.197 +- 0.687
        59 V 0.915         3.032 +- 0.873
        78 R 0.637         2.351 +- 1.129
        88 K 0.573         2.148 +- 1.077
        95 V 0.925         3.046 +- 0.843

    This gives you a list of which residues (if any) that were found to belong to the positively selected dN/dS-class. Also listed is the probability that the site really is in the dN/dS class with w>1, and a weighted average of the w at the site. Using only DNA sequences you have now identified likely epitopes on the gp120 protein.

    Q10: Note all sites having more than 95% probability of belonging to the positively selected class