Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Viral Evolution:
Using Phylogenetic Analysis to Detect Epitopes

Anders Gorm Pedersen (gorm@cbs.dtu.dk)
Claus Lundegaard (lunde@cbs.dtu.dk)


Overview

In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B isolated from a number of different patients. Specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein. Using so-called probabilistic modelling techniques (Bayes and maximum likelihood) you will attempt to find putative epitopes. A major goal with the exercise is to introduce you to concepts and tools that are important for phylogenetic analyses of viral DNA, and to make you aware that phylogenetic analysis is much more than merely constructing evolutionary trees.

Specifically, you will

  1. perform a multiple alignment of gp120 protein sequences (using ClustalX).
  2. use the protein-alignment as a template for constructing a DNA alignment (using revtrans).
  3. construct a phylogenetic tree based on the DNA alignment (using MrBayes).
  4. try to detect positively selected sites in gp120 (using PAML).

The exercise involves using software on a UNIX-machine, but don't panic! Everything you need to know is explained here. Just take your time and carefully follow the instructions. I suggest you constantly keep this page visible somewhere on your screen (resize the browser window) so you can switch back and forth between reading instructions and performing the exercise.


Biological Background

Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. One goal of todays exercise is to investigate whether you can detect such a selective pressure on parts of gp120.


Getting started, description of data

Change to today's working directory, and have a look at which files are there:

cd exercise3
ls -l

Here, you will find two data-files: gp120.prot.fasta and gp120.DNA.fasta. Have a look at the DNA data file:

nedit gp120.DNA.fasta &

The file is in the so-called fasta-format, and contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Now, close the nedit window and have a look at the protein data file:

nedit gp120.prot.fasta &

This file contains translated versions of the DNA-sequences in gp120.DNA.fasta. The file was prepared by us in advance but there are also webtools for doing this type of virtual translation of DNA sequences - if you want to try it at home (see list of resources at the end of this page). When you've had a look, then close the nedit window so it doesn't clutter your screen.


Multiple alignment of protein sequences

In this part of the exercise, we will use the program ClustalX to make a multiple alignment of the gp120 protein sequences.

  1. Start ClustalX

    clustalx &
  2. Load the protein sequence file

    File ==> Load sequences ==> select the file gp120.prot.fasta
  3. Start the multiple alignment:

    Alignment ==> Do complete alignment ==> Align
  4. Inspect the alignment:

    When the alignment is done you can inspect the result by scrolling along the sequences in the ClustalX window. In this case the starting point was a set of very similar sequences so not much has changed, but you will notice that gaps have been inserted here and there.

    Note: The starting point of all phylogenetic analyses is a multiple alignment, and your results will be no better than the quality of that initial alignment. Although you do not need to do so in this exercise, it is always a good idea to manually check your alignment for errors. To this end Clustalx gives you the opportunity of interactively realigning selected residue-ranges or selected sequences.


Creating a DNA alignment from aligned protein sequences

The above multiple alignment was made using protein sequences: DNA sequences are a lot less informative than protein sequences and for this reason it is always preferable to align coding DNA in translated form. (More details).

However, in the context of molecular evolution, DNA alignments retain a lot of useful information regarding silent mutations. Especially the ratio between silent and non-silent substitutions is informative. Therefore, we now want to perform the phylogenetic analysis using the DNA sequences instead. In order to do that we first have to construct a multiple DNA-alignment using the protein alignment as a template.

For this purpose we will use the program revtrans which takes as input an unaligned set of DNA sequences (in fasta-format) and an aligned set of protein sequences (also in fasta-format) and produces a DNA alignment that is in accordance with the protein alignment. This means that gaps are always inserted in groups of three so reading frames are kept in order.


  1. Construct the DNA alignment from the protein-alignment:

    revtrans gp120.DNA.fasta gp120.prot.aln > gp120.DNA.aln.fasta
  2. Have a look at the result:

    clustalx gp120.DNA.aln.fasta &

    Note how the gaps occur in groups of three (codon-sized).

  3. Convert the DNA alignment to nexus format:

    fasta2nexus gp120.DNA.aln.fasta > gp120.DNA.aln.nexus

    The next program you will use requires that its data is in the so-called nexus format.

  4. Have a look at how you could have done that more simply:

    http://www.cbs.dtu.dk/services/RevTrans/

    Instead of doing the above steps manually, you could have used the web-interface for revtrans. This server allows you to simply submit your unaligned DNA sequences. The server will then automatically translate the DNA sequences, align them at the protein level, and construct the DNA alignment from the protein alignment. The server gives you the opportunity of dowloading the resulting alignment in several formats.



Phylogenetic reconstruction

In this part of the exercise we will use the program MrBayes to produce a phylogenetic tree. MrBayes uses an interactive interface where you type commands at a prompt. It is a program that uses a probabilistic approach for estimating phylogenies, meaning that the user specifies a model of sequence evolution and the program then finds good trees according to that model.

Here, we will first let MrBayes produce a large set of possible trees, and then we use the best subset of these to construct a "consensus tree" that will be used for further analysis.

  1. Start Mrbayes:

    mb

    This starts the program, giving you a prompt ("MrBayes> ") where you can enter commands. Each command has to be terminated by hitting the "Enter button". (Entering "help" gives you a list of possible commands).

  2. Load your sequences:

    execute gp120.DNA.aln.nexus
  3. Specify your model of sequence evolution:

    set autoclose=yes
    charset 1stpos=1-.\3
    charset 2ndpos=2-.\3
    charset 3rdpos=3-.\3
    partition bycodon = 3:1stpos,2ndpos,3rdpos
    set partition=bycodon
    prset ratepr=variable
    lset nst=6
    

    The above commands essentially means we are using a model where each possible pair of nucleotides have distinct substitution rates, and where the three codon positions furthermore also have distinct rates. The exact values of these rates will be determined from the data during the optimization procedure.

  4. Search tree-space:

    mcmc ngen=100000 samplefreq=100 nchains=4 filename=mrtree

    This command essentially starts 4 parallel tree-search paths ("chains"), lets the search run for 100,000 rounds ("generations") and saves a tree once every 100 rounds (meaning that a total of 1000 trees will be saved). In real life you would let the program run for more generations (at least 500,000 or so).

    What you are actually doing here is to use a method known as MCMCMC ("Metropolis-coupled Markov chain Monte Carlo") to empirically determine the posterior probability distribution of trees, branch lengths and substitution parameters. There's a lot of interesting information you can extract from the resulting output, but for the purpose of this exercise we will limit ourselves to contruct a consensus tree.

    When the run starts you will see reports about the progress of the four chains. Each line of output lists the generation number and the log likelihoods of the current tree/parameter combination for each of the four chains. The run will take a few minutes to finish.

  5. Quit program:

    quit
  6. Determine "burnin":

    We now have to determine which subset of trees to use for building our consensus tree. This is done by inspecting the likelihood values of the trees saved by MrBayes and then selecting a good subset. (More details).

    mbplot mrtree.p

    This shows you how the likelihood value increases through the generations, to finally reach a plateau. Look at the x-axis to determine the number of generations required before the plateau is reached. Divide this number by 100, write down the result, and close the window.

  7. Restart Mrbayes:

    mb
  8. Load your sequences again:

    execute gp120.DNA.aln.nexus
  9. Construct consensus tree

    The burnin value (XX below) is the number of trees required to reach the plateau, not the number of generations. You get the number of trees by dividing the number of generations with 100 (recall that we only saved trees once every 100 generations). For instance, a burnin of 6000 generations correspond to 60 trees. In the line below, substitute the number you wrote down for "XX".

    sumt filename=mrtree.t contype=halfcompat burnin=XX
  10. Quit program:

    quit
  11. Convert the consensus tree to paml-format:

    ~gorm/projekter/bin/mb2paml2 gp120.DNA.aln.nexus.con | sed 's/[0-9]\.[0-9][0-9]//g' > contree.paml

    This converts the treefile into a format that is understood by the program you will use in the next step.


Detection of positively selected sites in gp120

There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods, is that several of the parameters of any model can have their values determined as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these values to learn features about the evolutionary history of the sequences under investigation. A typical (but perhaps slightly boring) example would be the transition/transversion rate ratio. In the present example we will focus on investigating whether we can find positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1.

A further strength of the probabilistic approach is that it is possible to fit different models to a given data set and then compare their likelihoods to determine which model fits the data best. Briefly, if the models are "nested", then it is possible to compare their likelihood values using a chi-squared test, thus using statistics to see if one model is significantly better than an alternative model. Recall that each model essentially corresponds to a hypothesis about the evolutionary history of the data, and we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

  1. Convert DNA alignment to phylip format (and remove columns with gaps):

    cat gp120.DNA.aln.fasta | degap | fasta2phylip  > gp120.DNA.aln.phylip
  2. Inspect the parameter file:

    nedit codeml.ctl &

    The file "codeml.ctl" contains several settings that are relevant for running the program codeml. Here are some of the more important ones:

    seqtype = 1: tells the program that our data consists of coding DNA. NSsites = 0 3: tells the program to analyze models M0 and M3.

    The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M0 says there is one single dN/dS ratio for all sites in the sequence. M3 says there are 3 distinct dN/dS ratios for different sites in the sequence. The value of the three ratios and the position of sites belonging to either, will be determined during the analysis.

  3. Start the analysis

    codeml

    This will start the codeml program using the settings in the file codeml.ctl. Depending on system load and the exact topology of your consensus tree this will take somewhere around 20 minutes or so. Lean back and enjoy a cup of coffee!

    (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves). When codeml has finished you can move on to the next step:

  4. Inspect result file

    nedit selection.results

    Make sure to turn off line-wrapping: Preferences ==> Wrap ==> None. This file contains a wealth of information concerning your analysis. The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values:

  5. Find likelihood and dN/dS ratio for model M0:

    Search ==> Find... ==> enter "Model 0" and click Find

    You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now, locate a line that is similar to the one shown below:

    lnL(ntime: 72  np: 74):  -4242.470345     +0.000000
    

    and note down the number of "free parameters" used in model M0: This is indicated by "np", and is 74 in the example shown above. Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example above.

    Now, scroll down a few lines until you get to a table that is similar to the one shown below:

    dN & dS for each branch
    
     branch           t        S        N    dN/dS       dN       dS   S*dS   N*dN
    
      46..1       0.000     91.6    370.4   0.6020   0.0000   0.0000    0.0    0.0
      46..2       0.000     91.6    370.4   0.6020   0.0000   0.0000    0.0    0.0
      46..8       0.000     91.6    370.4   0.6020   0.0000   0.0000    0.0    0.0
      ...
    
    

    Note down the dN/dS ratio from the column "dN/dS" (0.6020 in the example above). This is the average ratio estimated under the assumption that all codon sites in the gene are under identical selective pressure. Is your value above or below 1?

  6. Find likelihood and dN/dS ratio for model M3:

    Scroll past the M0 table until you get to the results for model M3. Again note down the number of free parameters and the log-likelihood of the model.

    Now, scroll down a few lines until you get to a small table similar to this:

    dN/dS for site classes (K=3)
    p:   0.49485  0.36975  0.13540
    w:   0.06583  0.66691  3.17311
    

    This gives a summary of the dN/dS ratios that were found in the data set. The line starting w: lists the three dN/dS ratios that were found (in this case 0.06583, 0.66691, and 3.17311). The line starting p: gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 49% belong to the first class, 37% belong to the second class, and 14% of all sites belong to the class having dN/dS=3.17311). Do you have any class of dN/dS values with a value that is above 1?

  7. Investigate whether model M3 is significantly better than M0:

    M3 will always have a better (higher) log-likelihood than model M0 because the two models are nested and M3 has more free parameters. To determine if it is also significantly better, first compute twice the difference between the log-likelihoods of M0 and M3. This will be your test statistic for the chi-squared test you will perform in a minute.

    For instance, if M3 has a log likelihood of -3768.977660 and the value for M0 is -3961.012861, then twice the difference is:

    2*(-3768.977660 - (-3961.012861)) = 384.070402

    Now, compute the difference in number of free parameters. For instance, if M0 used 71 free parameters and M3 used 75 then:

    df = 75 - 71 = 4

    This is the degrees of freedom you will use in the chi-squared test. Finally, perform the test by comparing your statistic to a chi-squared table using the relevant number for degrees of freedom. A chi-squared table can be found here if you don't have one handy. In the example shown above, we have the statistic=384.070402, which is much larger than the largest critical value in our chi-squared table under df=4, which means that M3 is (very) significantly better than M0 (p<<0.1%).

  8. Examine list of positively selected sites

    If your M3 model has a dN/dS class with a value larger than 1 (we firmly believe it should have if you did things according to our instructions...), and if M3 was found to be significantly better than M0, then you have evidence for the existence of positively selected sites in the gp120 gene. Positive selection in this gene is most likely the result of the selective pressure from the immune system: if an epitope remains unchanged, then antibodies against it will eventually be constructed and the virus will be removed. If the epitope constantly changes then the virus will avoid being recognized by the immune system. Now, scroll down to the end of the result file and locate a list similar to this one:

    Positively selected sites Prob(w>1):
    
         7 K 0.6741
        50 L 0.9542*
        52 D 0.8680
        54 V 0.9991**
        64 S 0.9516*
        68 N 1.0000**
       100 H 0.8655
       ...
       
    

    This gives you a list of which residues (if any) that were found to belong to the positively selected dN/dS-class. Using only DNA sequences you have now identified amino acids that are probably located in epitopes on the gp120 protein. That's not bad for approximately 2 hours of work!


    Downloadable software

    Tree Viewers

    • TreeView: tree-viewer with user-friendly, graphical interface.
    • PhyloDraw: another tree-viewer.

    Tools for inferring and interpreting phylogenetic trees

    • ClustalX: multiple alignment and distance-based tree-reconstruction (neighbor joining).
    • Phylip: reconstruct trees using maximum likelihood, parsimony, or distance methods.
    • PAUP*: reconstruct and analyze trees using maximum likelihood, parsimony, or distance methods.
    • PAML: analyze phylogenetic hypotheses using maximum likelihood.
    • MrBayes: Bayesian inference of phylogeny.


    Additional resources


    Local software

    The program revtrans was written by Rasmus Wernersson here at CBS (raz@cbs.dtu.dk).

    All other scripts were written by Anders Gorm Pedersen, and Claus Lundegaard, and can bee seen in their entirety by cd'ing to the directory named "~immu00/bin" Note: These are quick and dirty, no error-checking type of scripts that will crash if you so much as raise your voice, so use them with caution...