Model selection
Anders Gorm Pedersen
Overview
In this exercise you are going to investigate features of HIV1 evolution. You will
do this by analyzing a large set of envgenes from HIV1, subtype B. Specifically, the
DNA sequences analyzed here correspond to a region surrounding the hypervariable V3
region of the gp120 protein. One major goal with the exercise is to introduce you to
statistically based methods for assessing the strength of evidence for a set of alternative
hypotheses about some biological system of interest. The model selection method we will use is
AIC (Akaike Information Criterion), based on which you will compute model probabilities.
A second goal is to make you aware that phylogenetic analysis is not only about
constructing trees, but that it is also a useful framework for analyzing biological
questions more generally.
Specifically, you will
 perform a multiple alignment of gp120 DNA sequences taking proteinlevel information into account (using revtrans).
 select a suitable nucleotide substitution model (using PAUP and modeltest)
 construct a phylogenetic tree (using PAUP).
 try to detect positively selected sites in gp120 (using PAML).
Recipe for computing AIC values and model probabilities:
Later in today's exercise you will be asked to compute AIC values and model probabilities.
Return to this section and follow the instructions when you need to do so.
Fit a set of models to your data, note the maximized log likelihoods (lnL) and
the number of free parameters (K) for each model in the investigated set. The models you fit
should represent a plausible and comprehensive set of hypotheses about your data.
Compute AIC for each of the models: AIC = 2 x lnL + 2K. For example: a model with
lnL = 2010 and K = 5 will have AIC = 2 x 2010 + 2 x 5 = 4030.
Identify the model with the smallest AIC (this is the best model in the set). We will call the
AIC for this model "AIC_{min}".
Compute the "ΔAIC" values for each model: ΔAIC = AIC  AIC_{min}
For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models
will have positive ΔAICs.
For each model compute the following quantity: numerator = exp(0.5 x ΔAIC)
For example, a model with ΔAIC=4.2 will have
numerator = exp(0.5 x 4.2) = exp(2.1) = 0.1225.
Also compute the sum of the numerator values for all models.
Finally, the model probabilities for each model are found as: P(model) = numerator / sum
For example, if sum = 3.75 and a model has numerator = 1.3, then it has
P(model) = 1.3 / 3.75 = 0.35
You may want to keep track of the computations by constructing a table along the following lines:
A good starting point for reading more about model selection and multimodel inference is this book:
Note that model probabilities can also be computed using Bayesian methods.
Like other retroviruses, particles of HIV are made up of 2 copies of a
singlestranded RNA genome packaged inside a protein core, or capsid. The core
particle also contains viral proteins that are essential for the early steps
of the virus life cycle, such as reverse transcription and integration. A lipid
envelope, derived from the infected cell, surrounds the core particle.
Embedded in this envelope are the surface glycoproteins of HIV: gp120 and
gp41. The gp120 protein is crucial for binding of the virus particle to target
cells, while gp41 is important for the subsequent fusion event. It is the
specific affinity of gp120 for the CD4 protein that targets HIV to those cells
of the immune system that express CD4 on their surface (e.g., Thelper
lymphocytes, monocytes, and macrophages).
The role gp120 plays in infection and the fact that it is situated on the
surface of the HIV particle, means it is an obvious target for the immune
response. That means that there may be a considerable selective pressure on
gp120 for creating immuneescape mutants, where amino acids in the gp120
epitopes have been substituted. One goal of todays exercise is to investigate
whether you can detect such a selective pressure on parts of gp120.
Log on to you account.
Copy today's working directory:
cp r ~gorm/maxlik ./modelselect
Change to today's working directory, and have a look at which files are there:
cd modelselect
ls l
Here, you will find a datafile (gp120.fasta) and two parameter
files (modelblock3.gorm and codeml.ctl) that will be used
later on. Have a look at the DNA data file:
nedit gp120.fasta &
The file contains several DNA sequences
from HIV1, subtype B. The sequences are approximately 500 bp long, and
correspond to a region surrounding the hypervariable V3 region in the gene
encoding gp120. Close the nedit window when you've had a look.
Creating a DNA alignment based on aligned protein sequences
Align sequences with RevTrans server:
Using what you've learned previously, use the
RevTrans server to construct a multiple alignment.
Save alignment in Phylip format and alter alignment file so PAML will understand it:
First, convert the alignment file to phylip format. Make sure the file is named "gp120.phy".
Secondly, alter the first line of the alignment file so it looks like this:
39 510 I
(i.e., add a space and the letter "I" to the end of the first line). Getting
data files into formats that can be understood by various programs is something
that the average bioinformatician spends a LOT of time doing... Here, we
added an uppercase "i" so the PAML programs (which we will use later)
will understand that the file is in "interleaved" format.
Save alignment in Nexus format so PAUP will understand it:
Also, convert the alignment file to Nexus format. Make sure the file is named "gp120.nexus".
As part of the present analysis we are going to build a phylogenetic tree based on
the DNA alignment constructed above. We will construct the tree using maximum
likelihood, but to do that we first have to decide which substitution model we want to
use. Specifically, we are interested in using the model that best describes our data
without having more parameters than strictly necessary (thus avoiding overfitting). We
will investigate this issue by fitting a set of 56 different models to our data and
then selecting one with a reasonable balance between model complexity and data
fit.
Start PAUP:
paup
This starts the program, giving you a prompt ("paup> ") where
you can enter commands. Each command has to be terminated by hitting the
"Enter button".
Load alignment:
execute gp120.nexus
Fit 56 models, record negative loglikelihoods in scorefile:
execute modelblock3.gorm
This causes PAUP to execute the commands present in the file
modelblock3.gorm: first a neighbor joining tree is constructed using the
Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set
of 56 different models to the data. For each model, the estimated model parameters and
the negative loglikelihood are written to a file named "model.scores". In addition to
varying sets of substitution rate parameters, some of these models also include extra
parameters that take into account the presence of different rates between sites. This is
done in two ways: (1) by fitting a gamma distribution of rates, and (2) by allowing for a
proportion of constant ("invariable") sites.
Wait until PAUP is done fitting all 56 models (this will take 510 minutes  you can tell PAUP has finished
since the prompt "paup> " will reappear). Now, quit the program by typing:
quit
(Don't worry if the program makes a remark about unsaved trees  in this part of the exercise we are not interested in the
tree, we just want to know how well the 56 different models fit our data set, i.e., we only want their log likelihood
values).
Inspect result, manually check model probabilities for three models:
nedit model.scores &
For each model this file lists the negative loglikelihood and all estimated model
parameters (excluding branch lengths).
Q2: Use AICbased model probabilities to investigate which of the following three
substitution models are best at describing how the sequences have evolved: Jukes and Cantor (JC),
Felsenstein 81 (F81), Kimura 2 parameter model (K80/K2P). Before you can do the computation you
need to know the log likelihood and the number of parameters for each model. First, locate and
note the log likelihood values for the JC, F81, and K80 models in the model.scores file and note
the values: each model takes up two lines in the file, JC is the first model, F81 is number 5, and
K80 is number 9 in the file, so their lnLs are found on lines 2, 10, and 18). Make sure to get the
signs right: the values reported in the file are lnL values, so you will need to reverse
the sign to get the lnL (the lnL values you write down should be negative). Use the following
values for the number of parameters (K) for the three models: JC: K=0, F81: K=3, K80: K=1. Now, use
the recipe above to compute AIC values and model probabilities. Report all values. Based on the
model probabilities: wich model has more support? How much better is the best model compared to the
alternatives (what are the ratios between their probbailities)?
Use modeltest program to select best model:
What you just did manually for JC, F81 and K80, the program modeltest does automatically
for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods
in the scorefile to compute AIC and model probabilities, and uses this
to select the model that best fits the sequence data.
modeltest < model.scores > modeltest.results
This command executes the modeltest program using model.scores as the input file and
modeltest.result as the output file.
Inspect result:
nedit modeltest.results &
This file contains the output of the modeltest program. The first part of the file shows model
selection results based on the socalled likelihood ratio test method. Scroll past that
until you find the section containing the AICbased results.
Q3: Note the model selected by modeltest based on the AIC values.
Now locate the line containing the PAUP command corresponding to the best model.
The command is enclosed between "BEGIN PAUP" and "END;" and should look somethign like
"Lset base=(....". (It is located right above a table showing the AIC model probabilities.)
You will need to copy this command to a PAUP session in the
next section:
Note: There is a newer version of modeltest (jModeltest) that will run using a graphical user
interface, and that is independent of PAUP. You can download and run this on your own
computer if you want:
 jModeltest homepage
 Quickstart: Graphical user interface
Above you used modeltest to select the most suitable substitution model for
the present data set. You will now use this model to construct a maximum likelihood
tree. You will again use PAUP for this purpose.
Start PAUP:
paup
Load alignment:
execute gp120.nexus
Set treebuilding criterion to maximum likelihood:
set criterion=likelihood
Set model parameters to winning estimates:
Above you located a set of lines in the file modeltest.results
giving a PAUP command that sets the model parameters to the estimates that
were found using the winning model. Copy and paste this lset command
(without the BEGIN and END parts) into the window where PAUP is running.
PASTE LSET COMMAND FROM MODELTEST RUN HERE
Find best tree using selected model:
Still in the PAUPwindow, enter the following command:
hsearch swap=nni
This command causes PAUP to perform a heuristic search for the best maximum
likelihood tree. Once an initial tree has been constructed, the heuristic
search proceeds by rearrangements of the "nearest neighbor interchange" type
(NNI). We are using the model selected by modeltest, AND the parameter
estimates found by modeltest on that model. You could also have chosen to
simply estimate all the model parameters as part of this step (i.e., at the
same time as finding the best tree), but fixing them improves speed
tremendously. Findind the best tree should take a few minutes.
Save best tree to file in phylip tree format:
savetrees format=phylip brlens=yes file=gp120tree.phy
Quit program:
quit
View tree:
Q4: You have now produced an unrooted tree of the HIV sequences in the file
gp120tree.phy. Use FigTree to view the unrooted tree and include this in your report. Note
that in this exercise we will not be interested in the tree as such  our focus is instead on finding
positive selection on a subset of codon positions and the tree is just something we need in order to
be able to fit the different codon models to the data.
Detection of positively selected sites in gp120
There is much more to phylogenetic analyses than merely reconstructing
trees. One interesting result of probabilistic methods, is that the parameters
of a model will have their values determined as part of the optimization
procedure. This means that once such a model has been fitted to the data, it is
possible to investigate these estimated parameter values to learn features
about the evolutionary history of the sequences under investigation. In the
present example we will focus on investigating whether we can find positively
selected sites in our data set, defined as sites where the dN/dS ratio is
larger than 1. We do that by using a codon substitution model where the dN/dS ratio
is one of its parameters.
A further strength of the probabilistic approach is that you get a
probabilistic measure of how well any model fits the data. This means you can
use a stringent approach to determine which model fits the data best.
In this framework one uses likelihoods
(probabilities of data given model) to
determine which model fits the data best. As you saw above, it is for instance possible
to compute AIC values and model probabilities from the likelihood values of fitted models,
Since each model essentially corresponds to a hypothesis about the
evolutionary history of the data, we can thus use a stringent statistical
approach to decide which hypothesis best describes our data.
In outline, you will now use the following steps to investigate whether
there is any evidence for positively selected codons in your data set:
 Fit model M1, which assumes there are two classes of codons in the sequence:
some with dN/dS<1, some with dN/dS=1.
 Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1,
and one extra class with dN/dS>1.
 Assess the strength of evidence for the two models using AICbased model probabilities
 If M2 is better: identify the positively selected codons
Inspect the parameter file:
nedit codeml.ctl &
The file "codeml.ctl" contains several settings that are relevant for running the
program codeml:
seqtype = 1: tells the program that our data consists of coding DNA.
NSsites = 1 2 : tells the program to analyze models M1 and M2.
cleandata = 1: tells the program to ignore positions with gaps.
The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says
there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS<1 (codons
under purifying or negative selection), and one class with dN/dS=1 (no selection  neutrally evolving
sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with
dN/dS<1, one class with dN/dS=1 (these are the same type of classes as for M1), and one class with dN/dS>1
(corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that
have dN/dS<1 or dN/dS>1), the fraction of sites belonging to each class, and the position of sites
belonging to each class, are unknown at first and will be determined during the analysis.
Start the analysis
codeml
This will start the codeml program using the settings in the file codeml.ctl.
On padawan this will take somewhere around 1015 minutes or so. (You may be able to see how the optimization procedure
results in progressively better fits: the likelihood increases, meaning that negative
loglikelihood decreases, as the fit improves).
Inspect result file
Wait for the run to finish, and then look at the result file:
nedit selection.results &
This file contains a wealth of information concerning your analysis. The top
part of the file gives an overview of your sequences, codon usage and
nucleotide frequencies. You can ignore this information for now, and move on to
the interesting part, namely the model likelihoods and parameter values:
Use nedit search menu to find likelihood, K, and dN/dS ratios for model M1:
Search ==> Find... ==> enter "Model 1" and click Find
You are now in the region of the result file where the model likelihoods and
parameter estimates are noted. Now, locate a line that looks a bit like the
one shown below:
lnL(ntime: 72 np: 74): 4242.470345 +0.000000
Q5: Note the number of "free parameters", K, used in model M1: This is indicated by
"np", and is 74 in the example shown above (most of these parameters are branch lengths
in the tree; specifically, the number of branch length parameters is indicated by "ntime", and
is 72 in this example). Also note the loglikelihood of the fitted
model. This is the number right after the parenthesis, and is 4242.470345 in the example here.
Now, scroll down a few lines until you get to a small table similar to
this:
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
This gives a summary of the dN/dS ratios that were found in the data set. The line
starting w: lists the two dN/dS ratios that were found (in this case 0.06583
and 1.00000  the last one was prespecified by us as part of the model and was therefore not
a free parameter). The line starting p: gives the proportion of codon sites belonging to each
of the dN/dS ratio classes (in the example above approximately 75% belong to the first
class and 25% of all sites belong to the class having
dN/dS=1.00000).
Q5: Note the dN/dS value and proportion of sites for both classes.
Find likelihood, K, and dN/dS ratios for model M2:
Scroll past the M1 output until you get to the results for model M2.
Q7: Note the number of free parameters and the loglikelihood of model
M2.
Now, scroll down a few lines until you get to a small table similar to
the one you examined for M1 before. For this model there are 3 separate
classes of codons.
Q8: Note the dN/dS value and proportion of sites for all three classes.
Assess strength of evidence for models M1 and M2:
M2 will always have a better (higher) loglikelihood than model M1 because M2 has more
free parameters. You should now use the recipee given above to compute AIC values
and model probabilities.
Q9: Note the selected model (is M2 better
than M1?)
Examine list of positively selected sites
If your M2 is clearly better than M1 (we firmly believe it should be if you did things
according to our instructions...), then you have evidence for the existence of positively selected
sites in the gp120 gene. That's not bad for a few hours of work! Now, scroll down to the end of the
result file and locate a list similar to this one:
Bayes Empirical Bayes (BEB) analysis
Positively selected sites
Prob(w>1) mean w
25 A 0.959* 3.133 + 0.769
27 P 0.906 2.990 + 0.877
56 K 0.987* 3.197 + 0.687
59 V 0.915 3.032 + 0.873
78 R 0.637 2.351 + 1.129
88 K 0.573 2.148 + 1.077
95 V 0.925 3.046 + 0.843
...
This gives you a list of which residues (if any) that were found to belong
to the positively selected dN/dSclass. Also listed is the probability that
the site really is in the dN/dS class with w>1, and a weighted average of the w
at the site. Using only DNA sequences you have now identified likely epitopes
on the gp120 protein.
Q10: Note all sites having more than 95% probability of belonging to the
positively selected class
