Øvelse: "The Phylogeny of HIV"

The Phylogeny of HIV

Written by Anders Gorm Pedersen, modified by Thomas S. Rask


During this exercise you will:

  1. Perform a multiple alignment of gp120 protein sequences from HIV and SIV (using Clustal)
  2. Construct an unrooted tree from the alignment of gp120 sequences (using the "neighbor joining" algorithm in Clustal)
  3. Visualize the gp120-based tree using the program FigTree
  4. Consider the evolutionary implications of the gp120-based tree.
  5. Investigate the robustness of your tree by bootstrapping
  6. Perform a second multiple alignment based on POL protein sequences from HIV and SIV.
  7. Construct a new neighbor joining tree from the POL alignment
  8. Investigate whether the POL-based tree supports the conclusions from the gp120-based analysis
  9. Perform a multiple alignment of the same POL sequences and a POL sequence from HTLV-1.
  10. Root the POL-based tree using the HTLV sequence as an outgroup

(Tilføjet af Rasmus Wernersson): Aflevering: Vi skal igen idag afprøve at afleveret en opgave via CampusNet. I skal derfor skive en "log-bog" med jeres svar. Når I er færdige med øvelsen upload'er I den til CampusNet under "Kursus 27611/Afleveringer".

I skal lige som sidst bruge en helt almindelig tekst-editor, og ikke bruge en masse kræfter på fancy formatering. Det vigtige er at adskille de enkelt svar, så det er let at læse. Se mit eksempel fra sidste gang: Multiple Alignment øvelse.

Background: AIDS, HIV1, HIV2, and SIV

Acquired Immune Deficiency Syndrome (AIDS) is caused by two divergent viruses, Human Immunodeficiency Virus one (HIV-1) and Human Immunodeficiency Virus two (HIV-2). HIV-1 is responsible for the global pandemic, while HIV-2 has, until recently, been restricted to West Africa and appears to be less virulent in its effects. Viruses related to HIV have been found in many species of non-human primates (monkeys, apes, ...) and have been named Simian Immunodeficiency Virus, SIV.

These primate viruses are lentiviruses, a subfamily of the retroviruses. Retroviruses have RNA genomes but are unique among RNA viruses because they have a replication cycle that involves the reverse transcription of their RNA genome into DNA (this is the opposite direction compared to the usual flow of information from DNA to RNA). The reverse-transcribed viral DNA is stably incorporated into the genomic DNA of an infected cell and subsequent transcription can then create multiple copies of mRNA encoding new viral material.

Like other retroviruses, particles of HIV are made up of 2 copies of the single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120; and, gp41. The gp120 protein is crucial for binding of the virus particle to target cells. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

Purpose of exercise, description of data

In this exercise you are going to investigate the phylogenetic relationship between HIV and SIV. You will do this using two different data sets:

  1. a set consisting of 27 different gp120 protein sequences from isolates of HIV1, HIV2, chimpanzee SIV and macaque monkey SIV:   gp120.fasta
  2. a set consisting of 20 different POL-polyprotein sequences from HIV1, HIV2, chimpanzee SIV and sooty mangabey SIV:   hiv-siv-pol.fasta
    and with the HTLV-1 sequence:   htlv-hiv-siv-pol.fasta

(Note for enthusiasts: a number of lines of evidence have indicated that macaques are not naturally infected with SIV and that they have acquired their SIV infection while in captivity by cross-species transmission of SIV from sooty mangabeys. This means that both the macaque SIVs and the sooty mangabey SIVs originate from sooty mangabeys).

Finally, the exercise

First, we will use the file gp120.fasta, which contains 27 gp120 envelope protein sequences from isolates of HIV-1, HIV-2, and SIV in fasta-format.

Create a working directory called multalign on your harddisk, download the gp120 file to this directory and take a look at its contents (using a text editor like notepad and nedit, or a sequence alignment editor like JalView and BioEdit).

In this file, all HIV-1 sequences have names starting HV1. All HIV-2 sequences have names starting HV2. SIVCZ was isolated from chimpanzee. SIVMK, SIVM1, and SIVML were isolated from macaques.

Multiple alignment

We will use the program Clustal to make a multiple alignment of the virus sequences.  We have previously used the web-interface provided by EBI. Today we will use the program ClustalX and thus run the program on our own computer.  Windows, Mac, Linux/Unix, SUN or SGI versions of the latest ClustalX version (v1.83) are freely available for download. ClustalX is a graphical front-end to the command line-based program ClustalW. The program is interactive with a typical, relatively user friendly, windows-style interface. ClustalX has online help available from the pull-down menu "Help".
  1. Download, unpack and start ClustalX

    Download from here the ClustalX package corresponding to your operating system (OS) (HINT: see the README file for which OS corresponds to which files). Place the package in the multalign directory.

    Unpack the ClustalX program files to a subdirectory in your working directory. On Mac/Unix/Linux/SUN systems this can be done by opening a terminal window, changing directory to the working directory (eg. with the command cd multalign), and issuing the following commands (NOTE: alter the filenames in the commands to fit the package you downloaded):

    gunzip clustalx1.83.sun.tar.gz

    tar -xvf clustalx1.83.sun.tar

    The program files have now been unpacked and are ready to run. Start ClustalX with the following command (be sure to include the "&" which will make the program run in the background, so we can start more programs from the same shell window.)

    clustalx1.83.sun/clustalx &

    This opens the program in a separate window.
  2. Load the sequences in ClustalX

    The first thing you have to do is load the sequences. In the "File" menu choose "Load sequences", and select gp120.fasta in the multalign directory.

    In ClustalX the sequences are displayed on the screen with names on the left hand side, and the sequences themselves on the right. Residues are colored according to amino acid characteristics. It is possible to resize the window by "pulling" at the edges, so you can fit all 27 lines in one window. A scroll-bar at the bottom of the window allows you to move along the alignment (the sequences are too long to fit in the horizontal direction).

    Beneath the sequences there is a ruler starting at 1 for the first residue position. Below this is a graphical indication of the degree of conservation in each column of the alignment. A high score indicates a well-conserved column; a low score indicates low conservation. Since the sequences have not yet been aligned (they are all just lined up starting with their first residues), most values are quite low.

  3. Have a look at the alignment parameters:

    From the "Alignment" pull-down menu choose "Alignment parameters", and then "Pairwise alignment parameters". In this window you are able to change gap-penalties and substitution matrix for the initial, pairwise part of the multiple alignment.

    Test: Note the substitution matrix, the gap opening penalty and the gap elongation penalty on today's results form.

    You can also specify whether you want the pairwise alignments performed using a slower but accurate method, or a faster but approximate method. For now just leave everything as is. Exit the window by clicking the tab labeled "Close".

    There is a similar window for changing the multiple alignment parameters ("Alignment", "Alignment parameters", "Multiple alignment parameters".) Have a look, but keep the default values for now.

    Finally, there is also a window for changing a special set of gap parameters used by ClustalX ("Alignment", "Alignment parameters", "Protein gap parameters".).

    Test: Check what the parameters are set to in the "Protein gap parameters" window and note the values on the results form. Explanations of the various parameters are listed below

    • RESIDUE SPECIFIC PENALTIES: are amino acid specific gap penalties that reduce or increase the gap opening penalties at each position in the alignment or sequence. See the documentation for details. As an example, positions that are rich in glycine are more likely to have an adjacent gap than positions that are rich in valine.
    • HYDROPHILIC GAP PENALTIES are used to increase the chances of a gap within a run (5 or more residues) of hydrophilic amino acids; these are likely to be loop or random coil regions where gaps are more common. The residues that are "considered" to be hydrophilic can be entered in HYDROPHILIC RESIDUES.
    • GAP SEPARATION DISTANCE tries to decrease the chances of gaps being too close to each other. Gaps that are less than this distance apart are penalised more than other gaps. This does not prevent close gaps; it makes them less frequent, promoting a block-like appearance of the alignment.
    • END GAP SEPARATION treats end gaps just like internal gaps for the purposes of avoiding gaps that are too close (set by GAP SEPARATION DISTANCE above). If you turn this off, end gaps will be ignored for this purpose. This is useful when you wish to align fragments where the end gaps are not biologically meaningful.
  4. Start the multiple alignment:

    From the "Alignment" menu choose "Do complete alignment". This opens a window giving you the opportunity to rename output files. Accept the suggested names and start the alignment by clicking the tab labeled "Align".

    You may be able to get a glimpse of how ClustalX is working in the bottom part of the window: first, it does all the pairwise alignments. These alignments are then used to construct a "guide tree". (The guide tree should not be confused with the phylogenetic tree we will construct later. The guide tree is entirely based on the pairwise alignments, and is used to guide the construction of the multiple alignment). Finally, ClustalX constructs the multiple alignment by progressively "aligning algnments" following the guide tree. 

  5. Inspect the alignment:

    When the alignment is done you can inspect the result by scrolling along the sequences in the ClustalX window. You will notice that the conservation graph at the bottom of the window now has several peaks and plateaus corresponding to the conserved regions of gp120. An additional summary of conservation is given above the sequences. "*" indicates positions which have a single, fully conserved residue, ":" indicates positions with strong functional conservation, and "." indicates positions with weaker functional conservation. Don't close ClustalX after inspecting the alignment (you can iconize it if it clutters the screen).

Computing an unrooted tree

In this part of the exercise we will use ClustalX to produce a phylogenetic tree.  The tree is built with the neighbour joining algorithm, and is based on distances computed from the multiple alignment you just constructed.

  1. Re-open the ClustalX window containing the multiple alignment from before.

    (Don't panic if you accidentally closed the ClustalX program after the multiple alignment: start the program as before, and use "File", "Load sequences" to load the alignment file: "gp120.aln")

  2. Select output of treefile and distance matrix:

    From the "Trees" menu choose "Output format options" and select "PHYLIP format tree" and "PHYLIP distance matrix". Exit the window by clicking the tab labeled "close".

  3. Construct the tree:

    From the "Trees" menu choose "Draw N-J tree". This gives you a window where you can change the name of the tree-file and the distance matrix. Make sure they are named "gp120.ph" and "gp120.phdst" respectively, and tnen click "OK" to calculate the tree. ClustalX uses the multiple alignment to calculate a distance matrix with all pairwise distances between the 27 sequences, and then constructs a tree by progressively clustering sequences that are close to each other (using the neighbor-joining algorithm). (Note: the construction of the NJ-tree is so fast that its practically finished the second you have clicked "Draw N-J tree" - so no need to sit around and wait for the result).

  4. Inspect the output files:

    Use a text editor like nedit and notepad, or issue the following command at the command prompt (EXIT by pressing q ):

    less gp120.ph
    This treefile is in a text-based format that is obviously mostly meant for computers. Also look at the file containing all pairwise distances between all sequences:
    less gp120.phdst

    These are the numbers that were used by the neighbor joining algorithm for constructing the tree. For each sequence the distance to all other sequences are listed on a number of consecutive lines (there are too many distances to fit on a single line). The format is such that you should imagine a table where the order on the (unlabeled) horizontal axis is the same as on the vertical axis (i.e., the first sequence is HV1EL, the second is HV1Z2, etc.)

    Test: Scroll down to the entry for the sequence named HV2BE. Note the first and last distances on the results form (these are the distances to the sequences HV1EL and SIVM1, respectively).

  5. View a plot of the unrooted tree:

    There are several programs for visualizing tree-files like the PHYLIP-file gp120.ph. Today we will use the java version of the program FigTree version 1.0, which can be downloaded here.

    1. Download, unpack and start ClustalX

      Download the FigTree Java package from the link above to the multalign directory, and unpack the program files to a subdirectory in your working directory. On Mac/Unix/Linux/SUN systems this can be done by opening a terminal window, changing directory to the working directory, and issuing the following commands:

      gunzip FigTree.v1.0.tgz

      tar -xvf FigTree.v1.0.tar

      Start FigTree with the following command:

      java -Xms64m -Xmx128m -jar FigTree.v1.0/figtree.jar &

      This opens the program in a separate window.
    2. Load the PHYLIP file in FigTree

      FigTree starts out by asking for a tree file. Select the gp120.ph file.

      Under "Layout" in the column to the left, select the button furthest to the right with an unrooted tree on it. As our tree has not yet been rooted it is only appropriate to visualize it as unrooted. You can enlarge the Tip labels under "Tip labels"->"Font size" if necessary.

    Test: make a sketch of the tree on the results form (do not show every single sequence, just indicate loosely where the HIV1 cluster, the HIV2 cluster, the SIVCZ sequence, and the SIVMK sequences are located).

  6. Think for a minute about the implications:

    What does this tree tell us about the phylogenetic relationship of HIV-1, HIV-2 and SIV? Notice especially where the two different groups of SIV cluster compared to the two different groups of HIV.

    When you've thought about the problem, you can read a brief explanation that I've prepared. Additionally, you can find a good description of HIV evolution here:


Bootstrapping a neighbor joining tree

  1. ClustalX also has the possibility of bootstrapping your neighbor joining tree:

    Reopen your alignment from before (described above) and In the Trees menu choose "Output Format Options", and set Bootstrap labels to NODE. This will allow FigTree (and TreeViewer) to visualize the bootstrap values, as it does not recognize bootstrap labels in PHYLIP files where the labels are located on branches.

    In the Trees menu choose "Bootstrap NJ-tree". This gives you a window where you can change the number of resampled data sets. The default is 1000, but you may want to change this to 100 in order not to wait for too long.

    Make sure the tree file will be named "gp120.phb" and click OK to start the bootstrapping

    The idea in bootstrapping is to assess how much support there is for each branch in your tree, based on the data at hand. Imagine you have an alignment with 135 columns. ClustalX performs bootstrapping by randomly picking a column 135 times in a row, thereby generating a new permuted alignment having the same size as the original alignment (135 columns). A column may be chosen more than once (this is normally termed "sampling with replacement"). In the newly generated, permuted alignment some columns will be present multiple times, while some of the original columns will be absent.

    This sampling with replacement is done a large number of times (usually 500-1000), so that a lot of permuted alignments are produced. For each of these alignments a tree is now constructed using neighborjoining. Finally, a consensus tree is made from the large set of trees. Basically the consensus tree consists of the taxonomic (monophyletic) groups that occur most frequently in the large set of trees.

    The consensus tree that is produced has a number on each internal branch indicating how well supported that branch is (larger is better). Specifically, the number indicates how many of the individual trees in the large tree-set that contained that particular branch. (It is perhaps useful to think of each internal branch as defining a bi-partition of the leafs. If two different trees each possess an internal branch that divides the leafs into, say, one group with leaf A, B, and C, and another group with leaf D and E, then both trees are said to have the same internal branch).

    It has been found that the variation you get as a result of this type of bootstrapping is similar to what you get when collecting new data, and it therefore gives you an idea about how well supported the individual branches in your tree are. It is, however, not a real significance or probability (for that purpose you might want to use Bayesian phylogeny which we will, however, not discuss in this course).

  2. View the bootstrapped tree:

    Load the gp120.phb file in FigTree, make sure that "Branch labels" are check-marked to the left and that "Display" is set to "label" under this menu. You should now be able to see the tree with the values attached to all internal branches. Remember that the number tells how often the data was divided into the two groups present on either side of the branch. Please note that this is still an unrooted tree, you are just using a rooted viewer since that is the only one we have easily available that will display bootstrap values....

    Test: What is the bootstrap value on the internal branch separating the HIV1/SIVCZ cluster from the HIV2/SIVMK cluster? Note the value on the form.

Rooting a tree using an outgroup

In this part of the exersize you will use a data set of 20 different POL-polyprotein sequences isolated from HIV-1, HIV-2, chimpanzee SIV, and sooty mangabey SIV. (The Pol gene encodes three different polypeptides: integrase, reverse transcriptase, and protease. It is expressed as a single polyprotein and is subsequently cleaved by protease into its three separate parts).

First, you will construct a neighbor-joining tree like before and investigate whether this new, independent data set confirms the conclusions you made based on the alignment of gp120 sequences. Then you will add a POL-polyprotein sequence from HTLV-1 to the data set and construct a new tree, that you can then root using the HTLV sequence as an outgroup. (HTLV-1 is another member of the family of retroviruses and is thus more distantly related to HIV - which was originally named HTLV-3 by the way)

  1. Download and have a look at the POL sequence file:

    Download the hiv-siv-pol.fasta file to the working directory, and inspect the alignment with a text editor or alignment viewer.

    As mentioned, this file contains POL-polyprotein sequences from HIV-1, HIV-2, chimpanzee SIV, and sooty mangabey SIV.

  2. Align the POL sequences using ClustalX:

    Re-open the ClustalX window and load the sequence file hiv-siv-pol.fasta. Now, start the alignment by choosing "Do complete alignment" from the "Alignment" menu as before.

  3. Construct a neighbor-joining tree with no outgroup:

    From the "Trees" menu choose "Draw N-J tree". Make sure the tree will be named "hiv-siv-pol.ph".

  4. Inspect the unrooted tree:

    Open FigTree and load the PHYLIP tree file hiv-siv-pol.ph.

    This tree has been constructed from an entirely independent set of sequences. Does it support the conclusions that could be made from the gp120-based tree?

    Test: Make a sketch of the POL-based tree (again, just loosely indicate the position of the HIV1-cluster, the HIV2-cluster, the HIVCZ sequence and the HIVSmanga sequences).

  5. Have a look at the POL sequence file with an added outgroup:

    Download the htlv-hiv-siv-pol.fasta file to the working directory, and inspect the alignment with a text editor or alignment viewer.

    This file contains the same sequences as the file hiv-siv-pol.fasta plus an additional POL-sequence from the related virus HTLV-1 (the first sequence in this file). The HTLV POL sequence will be used as an outgroup in this part of the exercise.

  6. Align the outgroup-containing data set using ClustalX and create neighbor-joining tree:

    Re-open the ClustalX window and load the sequence file htlv-hiv-siv-pol.fasta.

    Now, start the alignment by choosing "Do complete alignment" from the "Alignment" menu as before.

    From the "Trees" menu choose "Draw N-J tree". Make sure the tree will be named "htlv-hiv-siv-pol.ph".

  7. Define outgroup:

    We will now use the same data for constructing a rooted tree, using the HTLV sequence as a way of defining where to place the root.

    For this purpose we have created a rooting service: http://www.cbs.dtu.dk/services/Trooting-1.0/

    Now, go to the rooting service above. and submit the treefile htlv-hiv-siv-pol.ph. When it asks you to define an outgroup, you should choose HTLV and submit.

    The outgroup will be used to place the root of the tree. The rationale is as follows: our data set consists of sequences from HIV-1, HIV-2, SIV and HTLV. We know from other evidence that the lineage leading to HTLV branched off before any of the remaining viruses diverged from each other. The root of the tree connecting the organisms investigated here, must therefore be located between the HTLV sequence (the "outgroup") and the rest (the "ingroup"). This way of finding a root is called "outgroup rooting", and constructs a tree where the outgroup is a monophyletic sister group to the ingroup.

    The results from the rooting service shows first the original tree(s) and in the bottom the constructed rooted tree(s).

    Now, open a text editor and copy-paste the rooted tree from the rooting service to the text editor, and save the file as htlv-hiv-siv-pol_og.ph.

  8. Inspect the unrooted tree:

    Load the treefile htlv-hiv-siv-pol.ph in FigTree, and in the Layout menu choose the button furthest to the right, with the unrooted tree on it.

    Observe how the outgroup HTLV is located quite distantly from the other sequences.

  9. Have a look at the outgroup-rooted tree:

    Load the treefile htlv-hiv-siv-pol_og.ph in FigTree, and in the Layout menu choose the button furthest to the left, with the rooted tree on it.

    In this tree we have used the outgroup to place the root.

    Test: On the sketch you made before, indicate which branch the root is located on. Was this were you expected it?

Additional information, online resources