The Phylogeny of HIV
Anders Gorm Pedersen (email@example.com)
During this exercise you will:
- perform a multiple alignment of gp120 protein sequences from HIV and SIV
- construct an unrooted tree from the alignment of gp120 sequences
(using the "neighbor joining" algorithm in ClustalX)
- visualize the gp120-based tree using the program njplot
- consider the evolutionary implications of the gp120-based tree.
- investigate the robustness of your tree by bootstrapping
- construct a rooted tree from the alignment of gp120 using the UPGMA algorithm
implemented in the PHYLIP package.
- perform a second multiple alignment based on POL protein sequences from HIV and SIV.
- construct a new neighbor joining tree from the POL alignment
- investigate whether the POL-based tree supports the conclusions from the gp120-based analysis
- perform a multiple alignment of the same POL sequences and a POL
sequence from HTLV-1.
- root the POL-based tree using the HTLV sequence as an outgroup
Background: AIDS, HIV1, HIV2, and SIV
Acquired Immune Deficiency Syndrome (AIDS) is caused by two divergent viruses, Human Immunodeficiency
Virus one (HIV-1) and Human Immunodeficiency Virus two (HIV-2). HIV-1 is responsible for the global pandemic,
while HIV-2 has, until recently, been restricted to West Africa and appears to be less virulent in its effects.
Viruses related to HIV have been found in many species of non-human primates (monkeys, apes, ...) and have been named Simian
Immunodeficiency Virus, SIV.
These primate viruses are lentiviruses, a subfamily of the retroviruses. Retroviruses have RNA genomes but
are unique among RNA viruses because they have a replication cycle that involves the reverse transcription of
their RNA genome into DNA (this is the opposite direction compared to the usual flow of information from DNA to RNA).
The reverse-transcribed viral DNA is stably incorporated into the genomic DNA
of an infected cell and subsequent transcription can then create multiple
copies of mRNA encoding new viral material.
Like other retroviruses, particles of HIV are made up of 2 copies of the single-stranded RNA genome
packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for
the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived
from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of
HIV: gp120; and, gp41. The gp120 protein is crucial for binding of the virus particle to target cells. It is the
specific affinity of gp120 for the CD4 protein that targets HIV to those cells of
the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).
Purpose of exercise, description of data
In this exercise you are going to investigate the phylogenetic relationship
between HIV and SIV. You will do this using two different data sets:
- a set consisting of 27 different gp120 protein sequences from isolates of
HIV1, HIV2, chimpanzee SIV and macaque monkey SIV.
- a set consisting of 20 different POL-polyprotein sequences from HIV1, HIV2,
chimpanzee SIV and sooty mangabey SIV.
(Note for enthusiasts: a number of lines of evidence have indicated that
macaques are not naturally infected with SIV and that they have acquired their SIV
infection while in captivity by cross-species transmission of SIV from sooty
mangabeys. This means that both the macaque SIVs and the sooty mangabey SIVs
originate from sooty mangabeys).
Finally, the exercise
Change to this afternoons working directory and have a look at what it contains:
First, we will use the file gp120.fasta, which contains 27 gp120 envelope
protein sequences from isolates of HIV-1, HIV-2, and SIV in fasta-format. Take
a look at its contents:
In this file, all HIV-1 sequences have names starting HV1. All HIV-2 sequences
have names starting HV2. SIVCZ was isolated from chimpanzee. SIVMK, SIVM1, and
SIVML were isolated from macaques.
We will use the program ClustalX to make a multiple alignment. (ClustalX
is actually a graphical front-end to the command line-based program CLUSTALW.).
The program is interactive with a typical, relatively user friendly,
windows-style interface. ClustalX has online help available from the pull-down
Start ClustalX with the following command (be sure to include the
"&" which will make the program run in the background, so we can start
more programs from the same shell window.)
This opens the program in a separate window.
The first thing you have to do is load the sequences. In the "File" menu
choose "Load sequences", and select gp120.fasta from the file list
The sequences are displayed on the screen with names on the left
hand side, and the sequences themselves on the right. Residues are
colored according to amino acid characteristics. It is possible to resize
the window by "pulling" at the edges, so you can fit all 27 lines in one
window. A scroll-bar at the bottom of the window allows you to move along the
alignment (the sequences are too long to fit in the horizontal direction).
Beneath the sequences there is a ruler starting at 1 for the
first residue position. Below this is a graphical indication of the degree of
conservation in each column of the alignment. A high score indicates a
well-conserved column; a low score indicates low conservation. Since the
sequences have not yet been aligned (they are all just lined up starting with
their first residues), most values are quite low.
Have a look at the alignment parameters:
From the "Alignment" pull-down menu choose "Alignment parameters", and then
"Pairwise alignment parameters". In this window you are able to change
gap-penalties and substitution matrix for the initial, pairwise part of the
multiple alignment. You can also
specify whether you want the pairwise alignments performed using a slower but
accurate method, or a faster but approximate method.
For now just leave everything as is.
Exit the window by clicking the tab labeled "Close".
There is a similar window for changing the multiple alignment parameters
("Alignment", "Alignment parameters", "Multiple alignment parameters".) Have a
look, but keep the default values for now.
Specify that you want the alignment output in all available
From the "Alignment" pull-down menu choose "Output format options", and
select all possible file formats (selected formats have checkmarks next to
Start the multiple alignment:
From the "Alignment" menu choose "Do complete alignment". This opens a
window giving you the opportunity to rename output files. Accept the suggested
names and start the alignment by clicking the tab labeled "Align".
You may be able to get a glimpse of how ClustalX is working in the bottom
part of the window: first, it does all the pairwise alignments.
These alignments are then used to construct a "guide
tree". (The guide tree should not be confused with the phylogenetic tree we
will construct later. The guide tree is entirely based on the pairwise
alignments, and is used to guide the construction of the multiple alignment).
Finally, ClustalX constructs the multiple alignment by progressively "aligning
algnments" following the guide tree.
Depending on system load, this alignment step may take anywhere between 20
seconds and 5 minutes. If waiting gets to be too painful you can change the
pairwise alignment parameters to "fast but approximate", but then you'll also
get a poorer alignment.
Inspect the alignment:
When the alignment is done you can inspect the result by scrolling along the
sequences in the ClustalX window. You will notice that the conservation graph at
the bottom of the window now has several peaks and plateaus corresponding to the
conserved regions of gp120. An additional summary of conservation is given above
the sequences. "*" indicates positions which have a single, fully
conserved residue, ":" indicates positions with strong functional
conservation, and "." indicates positions with weaker functional
conservation. Don't close ClustalX after inspecting the alignment (you can iconize
it if it clutters the screen).
In addition to the alignment shown in the ClustalX window, you also produced
several text-files with the alignment in different formats. These can be used
as input to other programs (for instance alignment "prettifyers" such as
BOXSHADE, or other programs for constructing phylogenetic trees). Have a look
at the different formats by looking at the text-files: gp120.gde, gp120.msf,
gp120.pir, gp120.phy and gp120.aln. For instance use nedit or
less in the shell window.
Computing an unrooted tree
In this part of the exercise we will use ClustalX to produce a phylogenetic
tree. The tree is built with the neighbour joining algorithm, and is
based on distances computed from the multiple alignment you just
Re-open the ClustalX window containing the multiple
alignment from before.
(Don't panic if you accidentally closed the ClustalX program after the
multiple alignment: start the program as before, and use "File",
"Load sequences" to load the alignment file: "gp120.aln")
Select output of treefile and distance matrix:
From the "Trees" menu choose "Output format options" and select "PHYLIP
format tree" and "PHYLIP distance matrix". Exit the window by clicking the
tab labeled "close".
Construct the tree:
From the "Trees" menu choose "Draw N-J tree". This gives you a window where you
can change the name of the tree-file. Accept the default and click "OK" to
calculate the tree. ClustalX uses the multiple alignment to calculate a distance
matrix with all pairwise distances between the 27 sequences, and then constructs a
tree by progressively clustering sequences that are close to each other (using the
neighbor-joining algorithm). (Note: the construction of the NJ-tree is so fast that
its practically finished the second you have clicked "Draw N-J tree" - so no need
to sit around and wait for the result).
Inspect the output files:
This treefile is in a text-based format that is
obviously mostly meant for computers. Also look at the file containing
all pairwise distances between all sequences:
These are the numbers that were used by the neighbor joining algorithm for
constructing the tree.
- View a plot of the unrooted tree:
unrooted gp120.ph &
The UNIX-program unrooted is a very simple viewer for
- Think for a minute about the implications:
What does this tree tell us about the phylogenetic
relationship of HIV-1, HIV-2 and SIV? Notice especially where the two different
groups of SIV cluster compared to the two different groups of HIV.
When you've thought about the problem, you can read a
that I've prepared. Additionally, you can find good descriptions of HIV
evolution on these two sites:
Bootstrapping a neighbor joining tree
ClustalX also has the possibility of bootstrapping your neighbor
Reopen your alignment from before (described above) and in
the Trees menu choose "Bootstrap NJ-tree".
This gives you a window where you can change the number of resampled data sets.
The default is 1000, but you may want to change this to 100 in order not to wait
for too long. Click OK to start the bootstrapping
The idea in bootstrapping is to assess how well supported each branch in your
tree is, by the data at hand. Imagine you have an alignment with 100 columns.
ClustalX performs bootstrapping by randomly picking one of these columns 100 times
in a row, in a way so that the same column may be chosen more than once. This gives
a permuted dataset having the same size as the original data set, and where some
columns are present multiple times, while some columns are absent. This is done a
large number of times (usually 500-1000), so that a lot of data sets are
constructed. For each of these data sets a tree is now constructed. Finally, a
consensus tree is made from the large set of trees. Basically the consensus tree
consists of taxonomic (monophyletic) groups that occur as often as possible in
the data. The tree printed out has on each branch a number indicating how many
times the data was divided into the two groups that are on either side of the
branch. Thus if we read in 15 trees and find that a branch has the number 15, then
the data to one side of the branch was clustered in all the trees and the data to
the other side of the branch was also clustered in all the trees. It has been found
that the variation you get as a result of this type of bootstrapping is similar to
what you get when collecting new data, and it therefore gives you an idea about how
well supported the individual branches in your tree are. It is, however, not a
real significance value.
View the bootstrapped tree:
Select "Bootstrap values", and you should now be able to see the tree with the
values attached to nodes. Remember that the number tells how often the data was
divided into the two groups present on either side of the branch. Please note that
this is still an unrooted tree, you are just using a rooted viewer since that is
the only one we have easily available that will display bootstrap values....
Computing a rooted tree
In this part of the exercise you will again construct a phylogenetic tree
from the multiple alignment of gp120. However, this time we will use a
program from the PHYLIP package, and furthermore we will
construct and plot a rooted tree. This time the starting point is the file
gp120.phdst which contains all pairwise distances calculated from the multiple
alignment by ClustalX.
- Provide PHYLIP with an input file with the expected name:
cp gp120.phdst infile
This copies the distance matrix to a file cleverly named
infile (PHYLIP programs all have very rigid requirements for file
- Start the PHYLIP program neighbor:
- Type 'N' to change the algorithm to 'UPGMA'.
(If we simply use the default 'Neighbor-joining' we will get the same result
as we already got using ClustalX). The neighbor program has a
text-based interface. After you've entered a command you should remember to
press the ENTER key.
- Type 'Y' to start the clustering.
NEIGHBOR now constructs an output file cleverly named 'treefile' in the
same format as the tree constructed above.
- Have a look at the rooted tree:
The UNIX-program njplot is a very simple viewer for rooted
Was the root placed where you would expect it?
The UPGMA algorithm used for this part of the exercise assumes that the
sequences have evolved at similar rates (evolution has followed a "molecular clock"). That
assumption is often not valid and generally you should probably not use the UPGMA
algorith which also is problematic in other ways. Another way of rooting the tree
(without assuming anything about evolutionary rates) is illustrated in the next
part of the exercise.
Rooting a tree using an outgroup
In this part of the exersize you will use a data set of 20 different
POL-polyprotein sequences isolated from HIV-1, HIV-2, chimpanzee SIV, and sooty
mangabey SIV. (The Pol gene encodes three different polypeptides: integrase,
reverse transcriptase, and protease. It is expressed as a single polyprotein and is
subsequently cleaved by protease into its three separate parts).
First, you will construct a neighbor-joining tree like before and investigate
whether this new, independent data set confirms the conclusions you made based on
the alignment of gp120 sequences. Then you will add a POL-polyprotein sequence from
HTLV-1 to the data set and construct a new tree, that you can then root using the
HTLV sequence as an outgroup. (HTLV-1 is another member of the family of
retroviruses and is thus more distantly related to HIV - which was originally named
HTLV-3 by the way)
- Have a look at the POL sequence file:
As mentioned, this file contains POL-polyprotein sequences from HIV-1, HIV-2,
chimpanzee SIV, and sooty mangabey SIV.
- Align the POL sequences using ClustalX:
Re-open the ClustalX window and load the sequence file
hiv-siv-pol.fasta. Now, start the alignment by choosing "Do complete
alignment" from the "Alignment" menu as before.
- Construct a neighbor-joining tree with no outgroup:
From the "Trees" menu choose "Draw N-J tree".
- Inspect the unrooted tree:
unrooted hiv-siv-pol.ph &
This tree has been constructed from an entirely independent set of sequences.
Does it support the conclusions that could be made from the gp120-based tree?
- Have a look at the POL sequence file with an added outgroup:
This file contains the same sequences as the file hiv-siv-pol.fasta
plus an additional POL-sequence from the related virus HTLV-1. This
sequence will be used as an outgroup in this part of the exercise.
- Align the outgroup-containing data set using ClustalX:
Re-open the ClustalX window and load the sequence file
hiv-siv-htlv-pol.fasta. Now, start the alignment by choosing "Do complete
alignment" from the "Alignment" menu as before.
- Construct a neighbor-joining tree with an outgroup:
From the "Trees" menu choose "Draw N-J tree".
This step will still just produce an unrooted neighbor joining tree (try having
a look at it with unrooted). In order to root the tree on the outgroup we
first have to use a separate program:
- Prepare a tree file for the program retree:
cp hiv-siv-htlv-pol.ph intree
The program retree is also from the PHYLIP package and is used to make
various manipulations on tree-files (mostly aesthetic). Like other PHYLIP programs
it expects an input file with a certain name ("intree" in this case).
- Start retree:
This will start the program and list a set of plotting parameters. You can
accept the default values by entering "y" at the prompt. Alternatively you can
enlarge your shell window and change the parameters accordingly.
- Select the HTLV sequence as outgroup:
The option "o" selects a specific leaf as an outgroup. If you cannnot locate
the HTLV sequence in your initial view of the tree, then you can move the viewing
window using the commands "j" (down) or "k" (up). When you've located the HTLV
node, enter "o" followed by its number at the prompt.
- Exit retree:
Enter "x" ("exit") at the prompt. Answer the next question with a "y" to
indicate that you want to save the tree. Finally enter "r" to indicate that you
want to save it as a rooted tree.
- Have a look at the tree:
Was the root placed in the same way as in previous analyses?
Bonus exercise (time permitting)
There is much more to phylogenetic analysis than merely reconstructing trees.
Using maximum likelihood methods, it is possible to infer many different aspects
of the evolutionary history of the sequences in a certain tree. For instance, you
can compute the relative ratios of silent and non-silent substitutions at certain
positions in a DNA-sequence, thus discovering which parts of the protein are under
positive or negative selection.
This mini-exercise gives you a glimpse of that type of analysis: in the
subdirectory HIVNSsites (under malignment) you will find some files that are
necessary to investigate the selective pressure on the HIV gp120 protein.
Specifically the analysis looks at a set obtained from a single patient (there is
always a diverse population of viruses circulating within any one patient), and
contains 13 different V3-regions from the gp120 gene. The V3 region of gp120 is the
most variable part of the gene.
Note: since I didn't have time to test this part of the exercise, it is posible
that running times will be prohibitive. But if you have the time, then try it
The program used here is part of Ziheng Yangs PAML package which can be found
- Have a look at the data set:
This file is in PAUP-format and contains both the 13 sequences and a
pre-computed tree describing their relationship (actually, it contains three
possible trees, but you will only be using the first of them). The tree (which does
not contain information about branch-lengths - only branching order) can also be
inspected on itself:
- Have a look at the parameter file:
This file contains the settings that will be used in the maximum likelihood
analysis of the silent/non-silent ratios. I wont go into the details...
- Run the analysis:
The program now tries to fit a codon-based model of the evolution of the V3
sequences to the specified tree. While it runs, the program prints out a summary of
how well it is doing. Specifically the sixth column should be the negative
log-likelihood of the fit (a low negative log-likelihood value is a good one - you
can see it slowly dropping....).
- Have a look at the results:
The program saved the results of its run in this file. Scroll to the bottom
of the file and see if you can figure out what is going on. I'll explain