|
Viral Evolution: Using Phylogenetic Analysis to Detect Epitopes
Anders Gorm Pedersen (gorm@cbs.dtu.dk)
Claus Lundegaard (lunde@cbs.dtu.dk)
Overview
In this exercise you are going to investigate features of HIV-1 evolution. You will do this
by analyzing a large set of env-genes from HIV-1, subtype B isolated from a number of
different patients. Specifically, the DNA sequences analyzed here correspond to a region
surrounding the hypervariable V3 region of the gp120 protein. Using so-called probabilistic
modelling techniques (Bayes and maximum likelihood) you will attempt to find putative
epitopes. A major goal with the exercise is to introduce you to concepts and tools that
are important for phylogenetic analyses of viral DNA, and to make you aware that phylogenetic
analysis is much more than merely constructing evolutionary trees.
Specifically, you will
- perform a multiple alignment of gp120 protein sequences (using ClustalX).
- use the protein-alignment as a template for constructing a DNA alignment (using revtrans).
- construct a phylogenetic tree based on the DNA alignment (using MrBayes).
- try to detect positively selected sites in gp120 (using PAML).
The exercise involves using software on a UNIX-machine, but don't panic!
Everything you need to know is explained here. Just take your time and carefully follow
the instructions. I suggest you constantly keep this page visible somewhere on your
screen (resize the browser window) so you can switch back and forth between reading
instructions and performing the exercise.
Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded
RNA genome packaged inside a protein core, or capsid. The core particle also contains
viral proteins that are essential for the early steps of the virus life cycle, such as
reverse transcription and integration. A lipid envelope, derived from the infected cell,
surrounds the core particle. Embedded in this envelope are the surface glycoproteins of
HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to
target cells, while gp41 is important for the subsequent fusion event. It is the
specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the
immune system that express CD4 on their surface (e.g., T-helper lymphocytes,
monocytes, and macrophages).
The role gp120 plays in infection and the fact that it is situated on the surface of
the HIV particle, means it is an obvious target for the immune response. That means that
there may be a considerable selective pressure on gp120 for creating immune-escape
mutants, where amino acids in the gp120 epitopes have been substituted. One goal of
todays exercise is to investigate whether you can detect such a selective pressure on
parts of gp120.
Change to today's working directory, and have a look at which files are there:
cd exercise3
ls -l
Here, you will find two data-files: gp120.prot.fasta and
gp120.DNA.fasta. Have a look at the DNA data file:
nedit gp120.DNA.fasta &
The file is in the so-called fasta-format, and contains several DNA sequences
from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a
region surrounding the hypervariable V3 region in the gene encoding gp120. Now,
close the nedit window and have a look at the protein data file:
nedit gp120.prot.fasta &
This file contains translated versions of the DNA-sequences in
gp120.DNA.fasta. The file was prepared by us in advance but there are also
webtools for doing this type of virtual translation of DNA sequences - if you want
to try it at home (see list of resources at the end of
this page). When you've had a look, then close the nedit window so it
doesn't clutter your screen.
In this part of the exercise, we will use the program ClustalX to make a multiple
alignment of the gp120 protein sequences.
Start ClustalX
clustalx &
Load the protein sequence file
File ==> Load sequences ==> select the file gp120.prot.fasta
Start the multiple alignment:
Alignment ==> Do complete alignment ==> Align
Inspect the alignment:
When the alignment is done you can inspect the result by scrolling along the
sequences in the ClustalX window. In this case the starting point was a set of very
similar sequences so not much has changed, but you will notice that gaps have been
inserted here and there.
Note: The starting point of all phylogenetic analyses is a multiple
alignment, and your results will be no better than the quality of that initial
alignment. Although you do not need to do so in this exercise, it is always a good idea
to manually check your alignment for errors. To this end Clustalx gives you the
opportunity of interactively realigning selected residue-ranges or selected
sequences.
Creating a DNA alignment from aligned protein sequences
The above multiple alignment was made using protein sequences: DNA
sequences are a lot less informative than protein sequences and for this
reason it is always preferable to align coding DNA in translated form.
(More details).
However, in the context of molecular evolution, DNA alignments retain a
lot of useful information regarding silent mutations. Especially the
ratio between silent and non-silent substitutions is informative.
Therefore, we now want to perform the phylogenetic analysis using the DNA
sequences instead. In order to do that we first have to construct a
multiple DNA-alignment using the protein alignment as a template.
For this purpose we will use the program revtrans which takes as input
an unaligned set of DNA sequences (in fasta-format) and an aligned set of protein
sequences (also in fasta-format) and produces a DNA alignment that is in accordance with
the protein alignment. This means that gaps are always inserted in groups of three so
reading frames are kept in order.
Construct the DNA alignment from the protein-alignment:
revtrans gp120.DNA.fasta gp120.prot.aln > gp120.DNA.aln.fasta
Have a look at the result:
clustalx gp120.DNA.aln.fasta &
Note how the gaps occur in groups of three (codon-sized).
Convert the DNA alignment to nexus format:
fasta2nexus gp120.DNA.aln.fasta > gp120.DNA.aln.nexus
The next program you will use requires that its data is in the so-called nexus format.
Have a look at how you could have done that more simply:
http://www.cbs.dtu.dk/services/RevTrans/
Instead of doing the above steps manually, you could have used the
web-interface for revtrans. This server allows you to simply submit your
unaligned DNA sequences. The server will then automatically translate the DNA
sequences, align them at the protein level, and construct the DNA alignment from
the protein alignment. The server gives you the opportunity of dowloading the
resulting alignment in several formats.
In this part of the exercise we will use the program MrBayes to produce a
phylogenetic tree. MrBayes uses an interactive interface where you type commands at a
prompt. It is a program that uses a probabilistic approach for estimating phylogenies,
meaning that the user specifies a model of sequence evolution and the program then
finds good trees according to that model.
Here, we will first let MrBayes produce a large set of possible trees, and then we use
the best subset of these to construct a "consensus tree" that will be used for further
analysis.
Start Mrbayes:
mb
This starts the program, giving you a prompt ("MrBayes> ") where you can
enter commands. Each command has to be terminated by hitting the "Enter button".
(Entering "help" gives you a list of possible commands).
Load your sequences:
execute gp120.DNA.aln.nexus
Specify your model of sequence evolution:
set autoclose=yes
charset 1stpos=1-.\3
charset 2ndpos=2-.\3
charset 3rdpos=3-.\3
partition bycodon = 3:1stpos,2ndpos,3rdpos
set partition=bycodon
prset ratepr=variable
lset nst=6
The above commands essentially means we are using a model where each possible pair of
nucleotides have distinct substitution rates, and where the three codon positions
furthermore also have distinct rates. The exact values of these rates will be
determined from the data during the optimization procedure.
Search tree-space:
mcmc ngen=100000 samplefreq=100 nchains=4 filename=mrtree
This command essentially starts 4 parallel tree-search paths ("chains"), lets the
search run for 100,000 rounds ("generations") and saves a tree once every 100 rounds
(meaning that a total of 1000 trees will be saved). In real life you would let the
program run for more generations (at least 500,000 or so).
What you are actually doing here is to use a method known as MCMCMC
("Metropolis-coupled Markov chain Monte Carlo") to empirically determine
the posterior probability distribution of trees, branch lengths and
substitution parameters. There's a lot of interesting information you can
extract from the resulting output, but for the purpose of this exercise we will
limit ourselves to contruct a consensus tree.
When the run starts you will see reports about the progress of the four chains. Each line of
output lists the generation number and the log likelihoods of the current tree/parameter
combination for each of the four chains. The run will take a few minutes to finish.
Quit program:
quit
Determine "burnin":
We now have to determine which subset of trees to use for building our consensus tree.
This is done by inspecting the likelihood values of the trees saved by MrBayes and then
selecting a good subset. (More details).
mbplot mrtree.p
This shows you how the likelihood value increases through the generations, to finally reach
a plateau. Look at the x-axis to determine the number of generations required before the
plateau is reached. Divide this number by 100, write down the result, and close the
window.
Restart Mrbayes:
mb
Load your sequences again:
execute gp120.DNA.aln.nexus
Construct consensus tree
The burnin value (XX below) is the number of trees required to reach the
plateau, not the number of generations. You get the number of trees by dividing
the number of generations with 100 (recall that we only saved trees once every 100
generations). For instance, a burnin of 6000 generations correspond to 60 trees.
In the line below, substitute the number you wrote down for "XX".
sumt filename=mrtree.t contype=halfcompat burnin=XX
Quit program:
quit
Convert the consensus tree to paml-format:
~gorm/projekter/bin/mb2paml2 gp120.DNA.aln.nexus.con | sed 's/[0-9]\.[0-9][0-9]//g' > contree.paml
This converts the treefile into a format that is understood by the program you will
use in the next step.
Detection of positively selected sites in gp120
There is much more to phylogenetic analyses than merely reconstructing trees. One
interesting result of probabilistic methods, is that several of the parameters of any
model can have their values determined as part of the optimization procedure. This means
that once such a model has been fitted to the data, it is possible to investigate these
values to learn features about the evolutionary history of the sequences under
investigation. A typical (but perhaps slightly boring) example would be the
transition/transversion rate ratio. In the present example we will focus on
investigating whether we can find positively selected sites in our data set, defined as
sites where the dN/dS ratio is larger than 1.
A further strength of the probabilistic approach is that it is possible to fit different
models to a given data set and then compare their likelihoods to determine which model fits
the data best. Briefly, if the models are "nested", then it is possible to compare their
likelihood values using a chi-squared test, thus using statistics to see if one model is
significantly better than an alternative model. Recall that each model essentially
corresponds to a hypothesis about the evolutionary history of the data, and we can thus use a
stringent statistical approach to decide which hypothesis best describes our data.
Convert DNA alignment to phylip format (and remove columns with gaps):
cat gp120.DNA.aln.fasta | degap | fasta2phylip > gp120.DNA.aln.phylip
Inspect the parameter file:
nedit codeml.ctl &
The file "codeml.ctl" contains several settings that are relevant for running the
program codeml. Here are some of the more important ones:
seqtype = 1: tells the program that our data consists of coding DNA.
NSsites = 0 3: tells the program to analyze models M0 and M3.
The settings entered by us will cause codeml to analyze two hypotheses about
dN/dS ratios. M0 says there is one single dN/dS ratio for all sites in the sequence. M3
says there are 3 distinct dN/dS ratios for different sites in the sequence. The value of
the three ratios and the position of sites belonging to either, will be determined during
the analysis.
Start the analysis
codeml
This will start the codeml program using the settings in the file codeml.ctl.
Depending on system load and the exact topology of your consensus tree this will take
somewhere around 20 minutes or so. Lean back and enjoy a cup of coffee!
(You may be able to see how the optimization procedure results in progressively better
fits: the likelihood increases, meaning that negative log-likelihood decreases, as the
fit improves). When codeml has finished you can move on to the next step:
Inspect result file
nedit selection.results
Make sure to turn off line-wrapping: Preferences ==> Wrap ==> None.
This file contains a
wealth of information concerning your analysis. The top part of the file gives an overview of your
sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and
move on to the interesting part, namely the model likelihoods and parameter values:
Find likelihood and dN/dS ratio for model M0:
Search ==> Find... ==> enter "Model 0" and click Find
You are now in the region of the result file where the model likelihoods and
parameter estimates are noted. Now, locate a line that is similar to the one
shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
and note down the number of "free parameters" used in model M0: This is indicated by
"np", and is 74 in the example shown above. Also note the log-likelihood of the fitted
model. This is the number right after the parenthesis, and is -4242.470345 in the example
above.
Now, scroll down a few lines until you get to a table that is similar to the one
shown below:
dN & dS for each branch
branch t S N dN/dS dN dS S*dS N*dN
46..1 0.000 91.6 370.4 0.6020 0.0000 0.0000 0.0 0.0
46..2 0.000 91.6 370.4 0.6020 0.0000 0.0000 0.0 0.0
46..8 0.000 91.6 370.4 0.6020 0.0000 0.0000 0.0 0.0
...
Note down the dN/dS ratio from the column "dN/dS" (0.6020 in the example above). This
is the average ratio estimated under the assumption that all codon sites in the gene are
under identical selective pressure. Is your value above or below 1?
Find likelihood and dN/dS ratio for model M3:
Scroll past the M0 table until you get to the results for model M3. Again note down
the number of free parameters and the log-likelihood of the model.
Now, scroll down a few lines until you get to a small table similar to
this:
dN/dS for site classes (K=3)
p: 0.49485 0.36975 0.13540
w: 0.06583 0.66691 3.17311
This gives a summary of the dN/dS ratios that were found in the data set. The line
starting w: lists the three dN/dS ratios that were found (in this case 0.06583, 0.66691,
and 3.17311). The line starting p: gives the proportion of codon sites belonging to each
of the dN/dS ratio classes (in the example above approximately 49% belong to the first
class, 37% belong to the second class, and 14% of all sites belong to the class having
dN/dS=3.17311). Do you have any class of dN/dS values with a value that is above 1?
Investigate whether model M3 is significantly better than M0:
M3 will always have a better (higher) log-likelihood than model M0 because the two
models are nested and M3 has more free parameters. To determine if it is also
significantly better, first compute twice the difference between the
log-likelihoods of M0 and M3. This will be your test statistic for the chi-squared test
you will perform in a minute.
For instance, if M3 has a log likelihood of -3768.977660
and the value for M0 is -3961.012861, then twice the difference is:
2*(-3768.977660 - (-3961.012861)) = 384.070402
Now, compute the difference in number of free parameters. For instance, if
M0 used 71 free parameters and M3 used 75 then:
df = 75 - 71 = 4
This is the degrees of freedom you will use in the chi-squared test.
Finally, perform the test by comparing your statistic to a chi-squared table
using the relevant number for degrees of freedom. A chi-squared table can be
found here if you don't have one
handy. In the example shown above, we have the statistic=384.070402, which is much larger
than the largest critical value in our chi-squared table under df=4, which
means that M3 is (very) significantly better than M0 (p<<0.1%).
Examine list of positively selected sites
If your M3 model has a dN/dS class with a value larger than 1 (we firmly believe it should
have if you did things according to our instructions...), and if M3 was found to be
significantly better than M0, then you have evidence for the existence of positively selected
sites in the gp120 gene. Positive selection in this gene is most likely the result of the
selective pressure from the immune system: if an epitope remains unchanged, then antibodies
against it will eventually be constructed and the virus will be removed. If the epitope
constantly changes then the virus will avoid being recognized by the immune system. Now,
scroll down to the end of the result file and locate a list similar to this one:
Positively selected sites Prob(w>1):
7 K 0.6741
50 L 0.9542*
52 D 0.8680
54 V 0.9991**
64 S 0.9516*
68 N 1.0000**
100 H 0.8655
...
This gives you a list of which residues (if any) that were found to belong to the
positively selected dN/dS-class. Using only DNA sequences you have now identified amino acids
that are probably located in epitopes on the gp120 protein. That's not bad for approximately
2 hours of work!
Downloadable software
Tree Viewers
- TreeView:
tree-viewer with user-friendly, graphical interface.
- PhyloDraw:
another tree-viewer.
- ClustalX:
multiple alignment and distance-based tree-reconstruction (neighbor joining).
- Phylip:
reconstruct trees using maximum likelihood, parsimony, or distance methods.
- PAUP*:
reconstruct and analyze trees using maximum likelihood, parsimony, or distance methods.
- PAML:
analyze phylogenetic hypotheses using maximum likelihood.
- MrBayes:
Bayesian inference of phylogeny.
The program revtrans was written by Rasmus Wernersson here at CBS
(raz@cbs.dtu.dk).
All other scripts were written by Anders Gorm
Pedersen, and Claus Lundegaard, and can bee
seen in their entirety by cd'ing to the directory named "~immu00/bin"
Note: These are quick and dirty, no error-checking type of scripts that
will crash if you so much as raise your voice, so use them with caution...
|