Øvelse: "Multiple Alignment and Phylogeny"
Multiple Alignment and Phylogeny
Written by Anders Gorm Pedersen,
modified by Thomas S. Rask,
Daniel Aaen Hansen
and Rasmus Wernersson
This exercise has two parts. During the first (fairly small) part you will:
- Compare the performance of 3 different multiple alignment methods (mafft, muscle, clustalw)
for aligning a set of proteins
During the second part you will:
- Perform a multiple alignment of gp120 protein
sequences from HIV and SIV (using Clustal)
- Construct an unrooted tree from the alignment of
gp120 sequences (using the "neighbor joining" algorithm in Clustal)
- Visualize the gp120-based tree using the program FigTree
- Consider the evolutionary implications of the
- Investigate the robustness of your tree by
- Perform a second multiple alignment based on POL
protein sequences from HIV and SIV.
- Construct a new neighbor joining tree from the POL
- Investigate whether the POL-based tree supports the
conclusions from the gp120-based analysis
- Perform a multiple alignment of the same POL
sequences and a POL sequence from HTLV-1.
- Root the POL-based tree using the HTLV sequence as an
Download and install ClustalX
In today's exercises, we will use the program Clustal to construct various multiple alignments.
We have previously used the web-interface
provided by EBI. Today we will use the program ClustalX and thus run
the program on our own computer. Windows, Mac and Linux/Unix
versions of the latest ClustalX version (v2.0.10) are freely
available for download. ClustalX
is a graphical front-end to the command line-based program ClustalW.
The program is interactive with a typical, relatively user friendly,
windows-style interface. ClustalX has online help available from the
ClustalX package corresponding to your operating system. Make sure to download ClustalX and NOT ClustalW!
- Windows: Download clustalx-2.0.10-win.msi
Double click the .msi file and follow the instructions to install the program.
Start ClustalX from the Start menu.
- Mac (OSX only): Download clustalx-2.0.10-macosx.dmg
Mount and open the downloaded .dmg file (unless it does so automatically).
Drag the clustalx program to your Applications folder in Finder.
Open ClustalX from the Applications folder.
- Linux/UNIX/SUN: Download clustalx-2.0.10-linux-i386-libcppstatic.tar.gz to the place
where you want to install the program.
Change working directory to the folder where the downloaded .tar.gz file is located and unpack it using these two commands:
tar -xvf clustalx-2.0.5-linux-i386-libcppstatic.tar
(Clustal X running on Macintosh)
Comparison of multiple alignment programs
In this part of the exercise you will use a dataset of 11 alternatively-spliced gene
products from the human erythrocyte membrane protein band 4.1 (EPB). In addition to the
unaligned data set we also have a pre-aligned optimal alignment. The goal of this
short exercise is to compare how well three different popular multiple alignment programs
perform when attempting to align a set of proteins that are identical except for having
- Align the EPB sequences using mafft, muscle, and clustalw:
Align the EPB sequences using the servers at EBI with default settings.
For each alignment, keep the window (or tab) open after use:
Data: EPB.fasta: unaligned set of EPB proteins
- Have a look at the original, unaligned data:
Open the unaligned set of EPB sequences
in your local version of the ClustalX program and have a look (hint: first save the file to your disk, then open).
Note: In order to load sequences in ClustalX you do the following: In the "File"
menu choose "Load sequences", and then select your fasta file in the proper directory.
In ClustalX the sequences are displayed on the
screen with names on the left
hand side, and the sequences themselves on the right. Residues are
colored according to amino acid characteristics. It is possible to
the window by "pulling" at the edges, so you can fit more lines in
window. A scroll-bar at the bottom of the window allows you to move
alignment (in case the sequences are too long to fit in the horizontal
Beneath the sequences there is a ruler starting at 1
first residue position. Below this is a graphical indication of the
conservation in each column of the alignment. A high score indicates a
well-conserved column; a low score indicates low conservation. Since
sequences have not yet been aligned (they are all just lined up
their first residues), most values are quite low.
Recall that this is the unaligned set of sequences. NOTE: we are here only using clustalx as a
sequence/alignment viewer, not as an alignment program.
- Have a look at the optimally aligned data:
Now, open the prealigned, optimal alignment of the EPB sequences
in your local ClustalX program. Again drag your window and scroll along to see the entire
alignment. Keep this window hanging around somewhere so you can compare the
other alignments to it.
- Have a look at the three different alignments:
Compare the ideal alignment (shown in your local ClustalX program)
to the three alignments you just constructed using the EBI servers (you can use either the JalView alignment
viewer linked on the EBI result page, or your local ClustalX program for easier viewing. Note that for uncertain
reasons there is no jalview link on the mafft page - you will here need to use ClustalX).
Question 1: Are the three alignments different? Which, if any, of the three alignment methods got the alignment entirely correct?
You should note that this was just one particular form of test. On a different problem
the relative performance of the alignment methods could well be different. However, you should also
note that this was a fairly simple problem, and one where you could easily see artefacts. That will
not be the case for most real biological data sets.
Phylogeny of HIV
Background: AIDS, HIV1, HIV2, and SIV
Acquired Immune Deficiency Syndrome (AIDS) is caused by
two divergent viruses, Human Immunodeficiency Virus one (HIV-1) and
Human Immunodeficiency Virus two (HIV-2). HIV-1 is responsible for the
global pandemic, while HIV-2 has, until recently, been restricted to
West Africa and appears to be less virulent in its effects. Viruses
related to HIV have been found in many species of non-human primates
(monkeys, apes, ...) and have been named Simian Immunodeficiency Virus,
These primate viruses are lentiviruses, a subfamily of
the retroviruses. Retroviruses have RNA genomes but are unique among
RNA viruses because they have a replication cycle that involves the
reverse transcription of their RNA genome into DNA (this is the
opposite direction compared to the usual flow of information from
DNA to RNA). The reverse-transcribed viral DNA is stably
incorporated into the genomic DNA of an infected cell and subsequent
transcription can then create multiple copies of mRNA encoding new
Like other retroviruses, particles of HIV are made up of
2 copies of the single-stranded RNA genome packaged inside a protein
core, or capsid. The core particle also contains viral proteins that
are essential for the early steps of the virus life cycle, such as
reverse transcription and integration. A lipid envelope, derived from
the infected cell, surrounds the core particle. Embedded in this
envelope are the surface glycoproteins of HIV: gp120; and, gp41. The
gp120 protein is crucial for binding of the virus particle to target
cells. It is the specific affinity of gp120 for the CD4 protein that
targets HIV to those cells of the immune system that express CD4 on
their surface (e.g., T-helper lymphocytes, monocytes, and
The Pol gene is encoded by the RNA genome and encodes three different polypeptides: integrase, reverse transcriptase, and protease. It is expressed as a single polyprotein and is subsequently cleaved by protease into its three separate parts.
Purpose of exercise, description of data
In this exercise you are going to investigate the
between HIV and SIV. You will do this using two different data sets:
- a set consisting of 27 different gp120 protein
sequences from isolates of
HIV1, HIV2, chimpanzee SIV and macaque monkey SIV: gp120.fasta
- a set consisting of 20 different POL-polyprotein
sequences from HIV1, HIV2, chimpanzee SIV and sooty mangabey
and with the HTLV-1 sequence: htlv-hiv-siv-pol.fasta
Note for enthusiasts: a number of lines of evidence
have indicated that
macaques are not naturally infected with SIV and that they have
acquired their SIV
infection while in captivity by cross-species transmission of SIV from
mangabeys. This means that both the macaque SIVs and the sooty mangabey
SIVs originate from sooty mangabeys.
Finally, the exercise
First, we will use the file gp120.fasta, which contains
27 gp120 envelope
protein sequences from isolates of HIV-1, HIV-2, and SIV in
Create a working directory
called phylogeny on your
download the gp120 file to this directory and take
a look at its contents (using a text editor like jEdit).
In this file, all HIV-1 sequences have names starting
HV1. All HIV-2 sequences
have names starting HV2. SIVCZ was isolated from chimpanzee. SIVMK,
SIVML were isolated from macaques.
Open ClustalX program, load files
Discard all the previously opened browser- and ClustalX-windows and open a fresh ClustalX window
on your own computer. Load the gp120.fasta file you just saved.
Have a look at the alignment parameters:
From the "Alignment" pull-down menu choose
"Alignment parameters", and then
"Pairwise alignment parameters". In this window you are able to change
gap-penalties and substitution matrix for the initial, pairwise part of
QUESTION 2: Note the substitution matrix, the gap
opening penalty and the gap elongation penalty in your report.
You can also specify whether you want the pairwise
alignments performed using a
slower but accurate method, or a faster but approximate method. For now
everything as is and exit the window.
There is a similar window for changing the multiple
("Alignment", "Alignment parameters", "Multiple alignment parameters".)
look, but keep the default values for now.
Finally, there is also a window for changing a
special set of gap parameters
used by ClustalX ("Alignment", "Alignment parameters", "Protein gap
QUESTION 3: Check what the parameters are set to
in the "Protein gap
Explanations of the
various parameters are listed below
- RESIDUE SPECIFIC PENALTIES: are amino acid
specific gap penalties that reduce
or increase the gap opening penalties at each position in the alignment
sequence. See the documentation for details. As an example, positions
are rich in glycine are more likely to have an adjacent gap than
are rich in valine.
- HYDROPHILIC GAP PENALTIES are used to increase
the chances of a gap within
a run (5 or more residues) of hydrophilic amino acids; these are likely
be loop or random coil regions where gaps are more common. The residues
are "considered" to be hydrophilic can be entered in HYDROPHILIC
- GAP SEPARATION DISTANCE tries to decrease the
chances of gaps being too close
to each other. Gaps that are less than this distance apart are
than other gaps. This does not prevent close gaps; it makes them less
promoting a block-like appearance of the alignment.
- END GAP SEPARATION treats end gaps just like
internal gaps for the purposes of
avoiding gaps that are too close (set by GAP SEPARATION DISTANCE
you turn this off, end gaps will be ignored for this purpose. This is
when you wish to align fragments where the end gaps are not
Start the multiple alignment:
From the "Alignment" menu choose "Do complete
alignment". This opens a
window giving you the opportunity to rename output files. Make sure the
files are named "gp120.dnd" and "gp120.aln" respectively, and start the
alignment by clicking "OK".
You may be able to get a glimpse of how ClustalX is
working in the bottom
part of the window: first, it does all the pairwise alignments.
These alignments are then used to construct a "guide
tree". (The guide tree should not be confused with the phylogenetic
will construct later. The guide tree is entirely based on the pairwise
alignments, and is used to guide the construction of the multiple
Finally, ClustalX constructs the multiple alignment by progressively
algnments" following the guide tree.
Inspect the alignment:
When the alignment is done you can inspect the
result by scrolling along the
sequences in the ClustalX window. You will notice that the conservation
the bottom of the window now has several peaks and plateaus
corresponding to the
conserved regions of gp120. An additional summary of conservation is
the sequences. "*" indicates positions which have a single,
conserved residue, ":" indicates positions with strong
conservation, and "." indicates positions with weaker
conservation. Don't close ClustalX after inspecting the alignment (you
it if it clutters the screen).
(FigTree running on Macintosh - pseudo-rooted tree)
Computing an unrooted tree
In this part of the exercise we will use ClustalX to
produce a phylogenetic
tree. The tree is built with the neighbour joining algorithm, and
based on distances computed from the multiple alignment you just
Re-open the ClustalX window containing the
alignment from before.
(Don't panic if you accidentally closed the ClustalX
program after the
multiple alignment: start the program as before, and use "File",
"Load sequences" to load the alignment file: "gp120.aln")
Select output of treefile and distance matrix:
From the "Trees" menu choose "Output format options"
and select "PHYLIP
format tree" and "PHYLIP distance matrix". Click "OK".
Construct the tree:
From the "Trees" menu choose "Clustering Algorithm" and make
sure "NJ" is selected. Then, from the "Trees" menu choose "Draw Tree". This
gives you a window where you can change
the name of the tree-file and the distance matrix. Make sure they are
named "gp120.ph" and
"gp120.dst" respectively, and then click "OK" to calculate the tree.
ClustalX uses the
multiple alignment to calculate a distance matrix with all pairwise
distances between the 27
sequences, and then constructs a tree by progressively clustering
sequences that are close to
each other (using the neighbor-joining algorithm). (Note: the
construction of the NJ-tree is
so fast that its practically finished the second you have clicked "Draw Tree" - so no
need to sit around and wait for the result).
Inspect the output files:
Use a text editor like jEdit to open gp120.ph.
This treefile is in a text-based format that is
obviously mostly meant for computers. Also look at the file containing
all pairwise distances between all sequences: gp120.dst.
These are the numbers that were used by the neighbor
joining algorithm for
constructing the tree. For each sequence the distance to all other
listed on a number of consecutive lines (there are too many distances
to fit on
a single line). The format is such that you should imagine a table
where the order on the (unlabeled) horizontal axis is the same as on
the vertical axis (i.e.,
the first sequence is HV1EL, the second is HV1Z2,
QUESTION 4: Scroll down to the entry for the
sequence named HV2BE.
Note the first and last distances on the results form (these are the
the sequences HV1EL and SIVM1, respectively).
- View a plot of the unrooted tree:
There are several programs for visualizing
tree-files like the PHYLIP-file gp120.ph. Today we will use the program
FigTree version 1.2, which can be downloaded from here:
Download, install and start FigTree
Download the FigTree program corresponding to your
NOTE: FigTree requires Java to run. If you don't have Java installed, you can get it from here.
Also note that the Java version of FigTree (see Linux/UNIX/SUN above) can actually be used on any operating system with Java installed (including Winsows and Mac OSX).
Load the PHYLIP
file in FigTree
Under "Layout" in the column to the left, select the button furthest to
the right with an unrooted tree on it. As our tree has not yet been
rooted it is only appropriate to visualize it as unrooted. You can
enlarge the Tip labels under "Tip labels"->"Font size" if necessary.
On some operating systems, FigTree starts out by asking for a tree file.
If FigTree doesn't ask for a tree file select "Open..." from the "File" menu.
Select the gp120.ph file.
QUESTION 5: Include a screenshot of the tree in your report (or create a separate file for this) and
indicate where the HIV1 cluster, the HIV2
the SIVCZ sequence, and the SIVMK sequences are located.
- Think for a minute about the implications:
What does this tree tell us about the phylogenetic
relationship of HIV-1, HIV-2 and SIV? Notice especially where the two
groups of SIV cluster compared to the two different groups of HIV.
When you've thought about the problem, you can read
that I've prepared. Additionally, you can find a good description of
Diversity and Evolution of Primate Lentiviruses
Bootstrapping a neighbor joining tree
ClustalX also has the possibility of
bootstrapping your neighbor
Reopen your alignment from before (described above)
and In the Trees menu choose "Output Format Options", and set Bootstrap
labels to NODE. This will allow FigTree (and TreeViewer) to visualize
the bootstrap values, as it does not recognize bootstrap labels in
PHYLIP files where the labels are located on branches.
(Setting the bootstrap output options)
the Trees menu choose "Bootstrap NJ-tree". This gives you a window
where you can change the
number of resampled data sets.
The default is 1000, but you may want to change this to 100 in order
not to wait
for too long.
Make sure the tree file will be named "gp120.phb"
and click OK to start the
The idea in bootstrapping is to assess how much
support there is for
each branch in your tree, based on the data at hand. Imagine you have
alignment with 135 columns. ClustalX performs bootstrapping by randomly
picking a column 135 times in a row, thereby generating a new permuted
alignment having the same size as the original alignment (135 columns).
column may be chosen more than once (this is normally termed "sampling
with replacement"). In the newly generated, permuted alignment some
columns will be present multiple times, while some of the original
will be absent.
This sampling with replacement is done a large
number of times (usually
500-1000), so that a lot of permuted alignments are produced. For each
of these alignments a tree is now constructed using neighborjoining.
Finally, a consensus tree is made from the large set of trees.
the consensus tree consists of the taxonomic (monophyletic) groups that
occur most frequently in the large set of trees.
The consensus tree that is produced has a number on
branch indicating how well supported that branch is (larger is better).
Specifically, the number indicates how many of the individual trees in
large tree-set that contained that particular branch. (It is perhaps
useful to think of each internal branch as defining a bi-partition of
leafs. If two different trees each possess an internal branch that
the leafs into, say, one group with leaf A, B, and C, and another group
with leaf D and E, then both trees are said to have the same internal
It has been found that the variation you get as a
result of this type
of bootstrapping is similar to what you get when collecting new data,
it therefore gives you an idea about how well supported the individual
branches in your tree are. It is, however, not a real significance or
probability (for that purpose you might want to use Bayesian phylogeny
which we will, however, not discuss in this course).
View the bootstrapped tree:
Load the gp120.phb file in FigTree, make
sure that "Branch labels" are check-marked to the left and that
is set to "label" under this menu. You should now be
able to see the tree with the values
attached to all internal branches. Remember that the number tells how
often the data was
divided into the two groups present on either side of the branch.
Please note that this is
still an unrooted tree so you may still want to view it in the unrooted layout.
QUESTION 6: What is the bootstrap value on the
internal branch separating
the HIV1/SIVCZ cluster from the HIV2/SIVMK cluster?
Rooting a tree using an outgroup
In this part of the exersize you will use a data set of
POL-polyprotein sequences isolated from HIV-1, HIV-2, chimpanzee SIV,
mangabey SIV. The Pol gene encodes three different polypeptides:
reverse transcriptase, and protease. It is expressed as a single
polyprotein and is
subsequently cleaved by protease into its three separate parts.
First, you will construct a neighbor-joining tree like
before and investigate
whether this new, independent data set confirms the conclusions you
made based on
the alignment of gp120 sequences. Then you will add a POL-polyprotein
HTLV-1 to the data set and construct a new tree, that you can then root
HTLV sequence as an outgroup. (HTLV-1 is another member of the family
retroviruses and is thus more distantly related to HIV - which was
originally named HTLV-3 by the way)
- Download and have a look at the POL sequence file:
Download the hiv-siv-pol.fasta
file to the working directory, and inspect the alignment with a text
As mentioned, this file contains POL-polyprotein
sequences from HIV-1, HIV-2,
chimpanzee SIV, and sooty mangabey SIV.
- Align the POL sequences using ClustalX:
Re-open the ClustalX window and load the sequence
file hiv-siv-pol.fasta. Now, start the alignment by
choosing "Do complete
alignment" from the "Alignment" menu as before.
- Construct a neighbor-joining (NJ) tree with no outgroup:
Make sure "Clustering Algorithm" is set to "NJ". From the "Trees" menu choose "Draw Tree". Make
sure the tree will be named "hiv-siv-pol.ph".
- Inspect the unrooted tree:
Open FigTree and load the PHYLIP tree file hiv-siv-pol.ph.
This tree has been constructed from an entirely
independent set of sequences.
Does it support the conclusions that could be made from the gp120-based
tree? (again, remember to use the unrooted layout).
QUESTION 7: Create a screenshot (again you may create a new file for this) indicating
the position of the HIV1-cluster, the HIV2-cluster, the HIVCZ sequence
and the HIVSmanga sequences.
- Have a look at the POL sequence file with an added
Download the htlv-hiv-siv-pol.fasta
file to the working directory, and inspect the data set with a text
This file contains the same sequences as the file hiv-siv-pol.fasta
plus an additional
POL-sequence from the
related virus HTLV-1 (the first sequence in this file). The HTLV POL
sequence will be used as an outgroup in this part of the exercise.
- Align the outgroup-containing data set using
ClustalX and create neighbor-joining tree:
Re-open the ClustalX window and load the sequence
Now, start the alignment by choosing "Do complete
alignment" from the
"Alignment" menu as before.
From the "Trees" menu choose "Draw Tree" (make sure that clustering algorithm is set to "NJ"). Make
sure the tree will be named "htlv-hiv-siv-pol.ph".
- Inspect the unrooted tree:
Load the treefile htlv-hiv-siv-pol.ph in FigTree,
and in the Layout menu choose the button furthest to the right, with
the unrooted tree on it.
Observe how the outgroup HTLV is located quite
distantly from the other
- Root the tree by using an outgroup:
We will now construct a
rooted tree, using the HTLV sequence as an outgroup. The outgroup will
be used to place the root of the tree.
The rationale is as follows: our data set consists of sequences from HIV-1, HIV-2, SIV
and HTLV. We know from other evidence that the lineage leading to HTLV
branched off before any of the remaining viruses diverged from each
other. The root of the tree connecting the organisms investigated here,
must therefore be located between the HTLV sequence (the "outgroup")
the rest (the "ingroup"). This way of finding a root is called
rooting", and constructs a tree where the outgroup is a monophyletic
sister group to
In FigTree select the branch leading to the HTLV sequence by clicking on it.
In the "Tree" menu choose "Root on Branch...". This places the root on the
branch leading to the HTLV sequence. But we won't see any change since the tree
is being viewed as an unrooted tree.
In the "Layout" option choose the button furthest to the left to view the tree as a rooted tree.
- Have a look at the outgroup-rooted tree:
QUESTION 8: On the screenshot you made before,
indicate which branch the
root is located on. Was this were you expected it?
Additional information, online resources