Maximum likelihood: model selection, phylogenetic reconstruction, and
detection of selection in viral sequences
Anders Gorm Pedersen
In this exercise you are going to investigate features of HIV-1 evolution. You will
do this by analyzing a large set of env-genes from HIV-1, subtype B. Specifically, the
DNA sequences analyzed here correspond to a region surrounding the hypervariable V3
region of the gp120 protein. One major goal with the exercise is to introduce you to
statistically based methods for assessing the strength of evidence for a set of alternative
hypotheses about some biological system of interest. The model selection method we will use is
AIC (Akaike Information Criterion), based on which you will compute model probabilities.
A second goal is to make you aware that phylogenetic analysis is not only about
constructing trees, but that it is also a useful framework for analyzing biological
questions more generally.
Specifically, you will
- perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
- select a suitable nucleotide substitution model (using PAUP and modeltest)
- construct a phylogenetic tree (using PAUP).
- try to detect positively selected sites in gp120 (using PAML).
Recipe for computing AIC values and model probabilities:
Later in today's exercise you will be asked to compute AIC values and model probabilities.
Return to this section and follow the instructions when you need to do so.
Fit a set of models to your data, note the maximized log likelihoods (lnL) and
the number of free parameters (K) for each model in the investigated set. The models you fit
should represent a plausible and comprehensive set of hypotheses about your data.
Compute AIC for each of the models: AIC = -2 x lnL + 2K.
For example: a model with
lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
Identify the model with the smallest AIC (this is the best model in the set). We will call the
AIC for this model "AICmin".
Compute the "ΔAIC" values for each model: ΔAIC = AIC - AICmin
For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models
will have positive ΔAICs.
For each model compute the following quantity: numerator = exp(-0.5 x ΔAIC)
For example, a model with ΔAIC=4.2 will have
numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225.
Also compute the sum of the numerator values for all models.
Finally, the model probabilities for each model are found as: P(model) = numerator / sum
For example, if sum = 3.75 and a model has numerator = 1.3, then it has
P(model) = 1.3 / 3.75 = 0.35
You may want to keep track of the computations by constructing a table along the following lines:
A good starting point for reading more about model selection and multimodel inference is this book:
Note that model probabilities can also be computed using Bayesian methods.
Like other retroviruses, particles of HIV are made up of 2 copies of a
single-stranded RNA genome packaged inside a protein core, or capsid. The core
particle also contains viral proteins that are essential for the early steps
of the virus life cycle, such as reverse transcription and integration. A lipid
envelope, derived from the infected cell, surrounds the core particle.
Embedded in this envelope are the surface glycoproteins of HIV: gp120 and
gp41. The gp120 protein is crucial for binding of the virus particle to target
cells, while gp41 is important for the subsequent fusion event. It is the
specific affinity of gp120 for the CD4 protein that targets HIV to those cells
of the immune system that express CD4 on their surface (e.g., T-helper
lymphocytes, monocytes, and macrophages).
The role gp120 plays in infection and the fact that it is situated on the
surface of the HIV particle, means it is an obvious target for the immune
response. That means that there may be a considerable selective pressure on
gp120 for creating immune-escape mutants, where amino acids in the gp120
epitopes have been substituted. One goal of todays exercise is to investigate
whether you can detect such a selective pressure on parts of gp120.
Log on to you account on padawan.cbs.dtu.dk
Copy today's working directory:
cp -r ~gorm/maxlik ./maxlik
Change to today's working directory, and have a look at which files are there:
Here, you will find a data-file (gp120.fasta) and two parameter
files (modelblock3.gorm and codeml.ctl) that will be used
later on. Have a look at the DNA data file:
nedit gp120.fasta &
The file is in the so-called fasta-format, and contains several DNA sequences
from HIV-1, subtype B. The sequences are approximately 500 bp long, and
correspond to a region surrounding the hypervariable V3 region in the gene
encoding gp120. Close the nedit window when you've had a look.
Creating a DNA alignment based on aligned protein sequences
In this part of the exercise, we will use the program revtrans to
make a multiple alignment of the gp120 DNA sequences.
The simple fact that proteins are built from 20 amino acids while DNA only
contains four different bases, means that the 'signal-to-noise ratio' in protein
sequence alignments is much better than in alignments of DNA. Besides this
information-theoretical advantage, protein alignments also benefit from the
information that is implicit in empirical substitution matrices such as
BLOSUM-62. Taken together with the generally higher rate of synonymous mutations
over non-synonymous ones, this means that the phylogenetic signal disappears
much more rapidly from DNA sequences than from the encoded proteins. It is
therefore preferable to align coding DNA at the amino acid level.
For this purpose we will use the program revtrans, which
constructs a multiple DNA alignment by: (i) translating the DNA; (ii) aligning
the resulting peptide sequences; and (iii) building a multiple DNA alignment by
using the aligned protein sequences as a template. In the resulting DNA
alignment, gaps occur in groups of three corresponding to entire codons, and
analogous codon positions are therefore always lined up.
The enthusiastic student can find more info in: Rasmus Wernersson and Anders Gorm
Pedersen. "RevTrans: multiple alignment of coding DNA from aligned amino acid sequences",
Nucl. Acids Res., 2003, 31(13), 3537-3539. (PDF)
Open file containing gp120 data set:
On the server type:
nedit gp120.fasta &
Start web-interface for revtrans in a browser:
On your local machine - not on the server - open a new browser window on the RevTrans server:
- Copy and paste the entire content of the gp120.fasta file to the window labeled "Paste in DNA sequences" in the RevTrans server window.
- Click button labeled "Submit query"
The server will now automatically translate the DNA sequences, align
them at the protein level, and construct the DNA alignment from the protein
Note: You may have to press the link to "refresh status".
Save result to file (when RevTrans is done):
- Click the button labeled "Download result"
- nedit gp120.aln & (on the server)
- copy and paste the entire content from the revtrans result page to the gp120.aln file (the new, open nedit window)
- save the file and exit nedit
Open alignment in ClustalX:
clustalx gp120.aln &
You can inspect the result by scrolling along the sequences in the ClustalX window. In this case
the starting point was a set of very similar sequences so not much has changed but gaps have been
inserted here and there.
Q1: Locate the gaps in the alignment. You will note that in this alignment
gaps come in two different sizes. Note these two lengths. What is the greatest common divisor
in the gap size, and why does it have that value in this alignment?
Save alignment in Phylip format:
From the clustalx window:
File ==> Save sequences as... ==> Tick off "Phylip" format and click "OK"
(Make sure the file is named "gp120.phy".) The next program you will use
does not understand Clustalx format so you have to convert the alignment into
Phylip format before proceeding. Close the Clustalx window when you are done.
Alter alignment file so PAML will understand it:
Alter the first line of the alignment file so it looks like this:
39 510 I
(i.e., add a space and the letter "I" to the end of the first line). Getting
data files into formats that can be understood by various programs is something
that the average bioinformatician spends a LOT of time doing... Here, we
added an upper-case "i" so the PAML programs (which we will use later)
will understand that the file is in "interleaved" format.
As part of the present analysis we are going to build a phylogenetic tree based on
the DNA alignment constructed above. We will construct the tree using maximum
likelihood, but to do that we first have to decide which substitution model we want to
use. Specifically, we are interested in using the model that best describes our data
without having more parameters than strictly necessary (thus avoiding overfitting). We
will investigate this issue by fitting a set of 56 different models to our data and
then selecting one with a reasonable balance between model complexity and data
This starts the program, giving you a prompt ("paup> ") where
you can enter commands. Each command has to be terminated by hitting the
Convert alignment to NEXUS format:
tonexus fromfile=gp120.phy format=phylip interleave=yes tofile=gp120.nexus
Fit 56 models, record negative log-likelihoods in scorefile:
This causes PAUP to execute the commands present in the file
modelblock3.gorm: first a neighbor joining tree is constructed using the
Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set
of 56 different models to the data. For each model, the estimated model parameters and
the negative log-likelihood are written to a file named "model.scores". In addition to
varying sets of substitution rate parameters, some of these models also include extra
parameters that take into account the presence of different rates between sites. This is
done in two ways: (1) by fitting a gamma distribution of rates, and (2) by allowing for a
proportion of constant ("invariable") sites.
Wait until PAUP is done fitting all 56 models (this will take 5-10 minutes - you can tell PAUP has finished
since the prompt "paup> " will reappear). Now, quit the program by typing:
(Don't worry if the program makes a remark about unsaved trees - in this part of the exercise we are not interested in the
tree, we just want to know how well the 56 different models fit our data set, i.e., we only want their log likelihood
Inspect result, manually check model probabilities for two models:
nedit model.scores &
For each model this file lists the negative log-likelihood and all estimated model
parameters (excluding branch lengths).
Q2: Use AIC-based model probabilities to investigate which of the following three
substitution models are best at describing how the sequences have evolved: Jukes and Cantor (JC),
Felsenstein 81 (F81), Kimura 2 parameter model (K80/K2P). Before you can do the computation you
need to know the log likelihood and the number of parameters for each model. First, locate and
note the log likelihood values for the JC, F81, and K80 models in the model.scores file and note
the values: each model takes up two lines in the file, JC is the first model, F81 is number 5, and
K80 is number 9 in the file, so their lnLs are found on lines 2, 10, and 18). Make sure to get the
signs right: the values reported in the file are -lnL values, so you will need to reverse
the sign to get the lnL (the lnL values you write down should be negative). Use the following
values for the number of parameters (K) for the three models: JC: K=0, F81: K=3, K80: K=1. Now, use
the recipe above to compute AIC values and model probabilities. Report all values. Based on the
model probabilities: wich model has more support? How much better is the best model compared to the
alternatives (what are the ratios between their probbailities)?
Use modeltest program to select best model:
What you just did manually for JC, F81 and K80, the program modeltest does automatically
for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods
in the score-file to compute AIC and model probabilities, and uses this
to select the model that best fits the sequence data.
modeltest < model.scores > modeltest.results
This command executes the modeltest program using model.scores as the input file and
modeltest.result as the output file.
nedit modeltest.results &
This file contains the output of the modeltest program. The first part of the file shows model
selection results based on the so-called likelihood ratio test method. Scroll past that
until you find the section containing the AIC-based results.
Q3: Note the model selected by modeltest based on the AIC values.
Now locate the line containing the PAUP command corresponding to the best model.
The command is enclosed between "BEGIN PAUP" and "END;" and should look somethign like
"Lset base=(....". (It is located right above a table showing the AIC model probabilities.)
You will need to copy this command to a PAUP session in the
Above you used modeltest to select the most suitable substitution model for
the present data set. You will now use this model to construct a maximum likelihood
tree. You will again use PAUP for this purpose.
Set tree-building criterion to maximum likelihood:
Set model parameters to winning estimates:
Above you located a set of lines in the file modeltest.results
giving a PAUP command that sets the model parameters to the estimates that
were found using the winning model. Copy and paste this lset command
(without the BEGIN and END parts) into the window where PAUP is running.
PASTE LSET COMMAND FROM MODELTEST RUN HERE
Find best tree using selected model:
Still in the PAUP-window, enter the following command:
This command causes PAUP to perform a heuristic search for the best maximum
likelihood tree. Once an initial tree has been constructed, the heuristic
search proceeds by rearrangements of the "nearest neighbor interchange" type
(NNI). We are using the model selected by modeltest, AND the parameter
estimates found by modeltest on that model. You could also have chosen to
simply estimate all the model parameters as part of this step (i.e., at the
same time as finding the best tree), but fixing them improves speed
tremendously. Findind the best tree should take about 5-10 minutes.
Save best tree to file in phylip tree format:
savetrees format=phylip brlens=yes file=gp120tree.phy
(Optional) View tree:
You have now produced an unrooted tree of the HIV sequences in the file gp120tree.phy.
In this exercise we will not be interested in the tree as such - our focus is instead on finding
positive selection on a subset of codon positions and the tree is just something we need
in order to be able to fit the different codon models to the data. However, if you want to see
the tree you can do so using the following approach:
Copy the contents of the file gp120tree.phy to a file on your local computer.
Install the FigTree tree viewer on your computer
Open the tree file in FigTree
Detection of positively selected sites in gp120
There is much more to phylogenetic analyses than merely reconstructing
trees. One interesting result of probabilistic methods, is that the parameters
of a model will have their values determined as part of the optimization
procedure. This means that once such a model has been fitted to the data, it is
possible to investigate these estimated parameter values to learn features
about the evolutionary history of the sequences under investigation. In the
present example we will focus on investigating whether we can find positively
selected sites in our data set, defined as sites where the dN/dS ratio is
larger than 1. We do that by using a codon substitution model where the dN/dS ratio
is one of its parameters.
A further strength of the probabilistic approach is that you get a
probabilistic measure of how well any model fits the data. This means you can
use a stringent approach to determine which model fits the data best.
In this framework one uses likelihoods
(probabilities of data given model) to
determine which model fits the data best. As you saw above, it is for instance possible
to compute AIC values and model probabilities from the likelihood values of fitted models,
Since each model essentially corresponds to a hypothesis about the
evolutionary history of the data, we can thus use a stringent statistical
approach to decide which hypothesis best describes our data.
In outline, you will now use the following steps to investigate whether
there is any evidence for positively selected codons in your data set:
- Fit model M1, which assumes there are two classes of codons in the sequence:
some with dN/dS<1, some with dN/dS=1.
- Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1,
and one extra class with dN/dS>1.
- Assess the strength of evidence for the two models using AIC-based model probabilities
- If M2 is better: identify the positively selected codons
Inspect the parameter file:
nedit codeml.ctl &
The file "codeml.ctl" contains several settings that are relevant for running the
seqtype = 1: tells the program that our data consists of coding DNA.
NSsites = 1 2 : tells the program to analyze models M1 and M2.
cleandata = 1: tells the program to ignore positions with gaps.
The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says
there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS<1 (codons
under purifying or negative selection), and one class with dN/dS=1 (no selection - neutrally evolving
sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with
dN/dS<1, one class with dN/dS=1 (these are the same type of classes as for M1), and one class with dN/dS>1
(corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that
have dN/dS<1 or dN/dS>1), the fraction of sites belonging to each class, and the position of sites
belonging to each class, are unknown at first and will be determined during the analysis.
Start the analysis
This will start the codeml program using the settings in the file codeml.ctl.
Depending on system load and the exact topology of your phylogenetic tree this will take
somewhere around 20 minutes or so. (You may be able to see how the optimization procedure
results in progressively better fits: the likelihood increases, meaning that negative
log-likelihood decreases, as the fit improves).
Inspect result file
Wait for the run to finish, and then look at the result file:
nedit selection.results &
This file contains a wealth of information concerning your analysis. The top
part of the file gives an overview of your sequences, codon usage and
nucleotide frequencies. You can ignore this information for now, and move on to
the interesting part, namely the model likelihoods and parameter values:
Use nedit search menu to find likelihood, K, and dN/dS ratios for model M1:
Search ==> Find... ==> enter "Model 1" and click Find
You are now in the region of the result file where the model likelihoods and
parameter estimates are noted. Now, locate a line that looks a bit like the
one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
Q4: Note the number of "free parameters", K, used in model M1: This is indicated by
"np", and is 74 in the example shown above (most of these parameters are branch lengths
in the tree; specifically, the number of branch length parameters is indicated by "ntime", and
is 72 in this example). Also note the log-likelihood of the fitted
model. This is the number right after the parenthesis, and is -4242.470345 in the example here.
Now, scroll down a few lines until you get to a small table similar to
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
This gives a summary of the dN/dS ratios that were found in the data set. The line
starting w: lists the two dN/dS ratios that were found (in this case 0.06583
and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not
a free parameter). The line starting p: gives the proportion of codon sites belonging to each
of the dN/dS ratio classes (in the example above approximately 75% belong to the first
class and 25% of all sites belong to the class having
Q5: Note the dN/dS value and proportion of sites for both classes.
Find likelihood, K, and dN/dS ratios for model M2:
Scroll past the M1 output until you get to the results for model M2.
Q6: Note the number of free parameters and the log-likelihood of model
Now, scroll down a few lines until you get to a small table similar to
the one you examined for M1 before. For this model there are 3 separate
classes of codons.
Q7: Note the dN/dS value and proportion of sites for all three classes.
Assess strength of evidence for models M1 and M2:
M2 will always have a better (higher) log-likelihood than model M1 because M2 has more
free parameters. You should now use the recipee given above to compute AIC values
and model probabilities.
Q10: Note the selected model (is M2 better
Examine list of positively selected sites
If your M2 is clearly better than M1 (we firmly believe it should be if you did things
according to our instructions...), then you have evidence for the existence of positively selected
sites in the gp120 gene. That's not bad for a few hours of work! Now, scroll down to the end of the
result file and locate a list similar to this one:
Bayes Empirical Bayes (BEB) analysis
Positively selected sites
Prob(w>1) mean w
25 A 0.959* 3.133 +- 0.769
27 P 0.906 2.990 +- 0.877
56 K 0.987* 3.197 +- 0.687
59 V 0.915 3.032 +- 0.873
78 R 0.637 2.351 +- 1.129
88 K 0.573 2.148 +- 1.077
95 V 0.925 3.046 +- 0.843
This gives you a list of which residues (if any) that were found to belong
to the positively selected dN/dS-class. Also listed is the probability that
the site really is in the dN/dS class with w>1, and a weighted average of the w
at the site. Using only DNA sequences you have now identified likely epitopes
on the gp120 protein.
Q11: Note all sites having more than 90% probability of belonging to the
positively selected class