Multiple alignment and phylogeny
Anders Gorm Pedersen (firstname.lastname@example.org)
This exercise has two parts. During the first (fairly small) part you will:
- Compare the performance of 4 different multiple alignment methods (mafft, muscle, dialign, clustalw)
for aligning a set of proteins
During the second part you will:
- Perform a multiple alignment of gp120 protein sequences from HIV and SIV
- Construct an unrooted tree from the alignment of gp120 sequences
(using the "neighbor joining" algorithm in ClustalX)
- Visualize the gp120-based tree using the program njplot
- Consider the evolutionary implications of the gp120-based tree.
- Investigate the robustness of your tree by bootstrapping
- Perform a second multiple alignment based on POL protein sequences from HIV and SIV.
- Construct a new neighbor joining tree from the POL alignment
- Investigate whether the POL-based tree supports the conclusions from the gp120-based analysis
- Perform a multiple alignment of the same POL sequences and a POL
sequence from HTLV-1.
- Root the POL-based tree using the HTLV sequence as an outgroup
Comparison of multiple alignment programs
In this part of the exersize you will use a data set of 11 alternatively-spliced gene
products from the human erythrocyte membrane protein band 4.1 (EPB). In addition to the
unaligned data set we also have a pre-aligned optimal alignment. The goal of this
short exercise is to compare how well four different popular multiple alignment programs
perform when attempting to align a set of proteins that are identical except for having
- Log on to CBS system:
Using what you learned yesterday, log on to one the CBS computers (organism or cell).
- Construct working directory, copy files:
cp -r /home/people/gorm/malignment ./malignment
- Log on to other computer where programs are installed, cd to malignment dir:
The alignment programs we will use are all installed on a computer called interaction, so
the first step is to log on to this machine using secure shell:
ssh -Y interaction
(Use the same password as for organism and cell). You can get away with a much shorter version
of the ssh command since you are already logged in on one of our machines.
- Align the EPB sequences using mafft:
mafft EPB.fasta > EPB_mafft_aligned.fasta
- Align the EPB sequences using muscle:
muscle -in EPB.fasta -out EPB_muscle_aligned.fasta
- Align the EPB sequences using dialign:
dialign2-2 -fn EPB_dialign_aligned -fa EPB.fasta
This method will take slightly longer to finish. Wait for the prompt to re-appear.
- Align the EPB sequences using clustalw:
- Log out of "ibiology" and back on to your original machine:
You are now back on organism or cell where you started today's exercises.
Make sure you're still in the malignment dir, or cd there if you're not (you can do it!)
- Have a look at the original, unaligned data:
clustalx EPB.fasta &
This is the unaligned set of sequences. The sequences are displayed on the screen with
names on the left hand side, and the sequences themselves on the right. Residues are colored
according to amino acid characteristics. It is possible to resize the window by "pulling" at
the edges, so you can fit all lines in one window. Make sure to pull the window so it's as
wide as possible. A scroll-bar at the bottom of the window allows you to move along the
alignment (the sequences are too long to fit in the horizontal direction).
Beneath the sequences there is a ruler starting at 1 for the
first residue position. Below this is a graphical indication of the degree of
conservation in each column of the alignment. A high score indicates a
well-conserved column; a low score indicates low conservation. Since these
sequences are not aligned (they are all just lined up starting with
their first residues), most values are quite low.
NOTE: In this part of the exercise we are only using clustalx as a
sequence/alignment viewer, not as an alignment program.
- Have a look at the optimally aligned data:
clustalx EPB_IDEAL_alignment.fasta &
This is the optimally aligned alignment. Again drag your window and scroll along to see the entire
alignment. Keep this window hanging around somewhere so you can compare the
other alignments to it.
- Have a look at the four different alignments:
Note that clustalx windows have no indication of what file you have opened, so it can be
difficult to tell them apart once they are opened. You may want to open one of these at a time,
compare it to the ideal alignment, and then close it again.
clustalx EPB.aln &
clustalx EPB_dialign_aligned.fa &
clustalx EPB_muscle_aligned.fasta &
clustalx EPB_mafft_aligned.fasta &
Q1: Which, if any, of the four alignment methods got the alignment entirely
You should note that this was just one particular form of test. On a different problem
the relative performance of the alignment methods could well be different. However, you should also
note that this was a fairly simple problem, and one where you could easily see artefacts. That will
not be the case for most real biological data sets.
Phylogeny of HIV
Background: AIDS, HIV1, HIV2, and SIV
Acquired Immune Deficiency Syndrome (AIDS) is caused by two divergent viruses, Human Immunodeficiency
Virus one (HIV-1) and Human Immunodeficiency Virus two (HIV-2). HIV-1 is responsible for the global pandemic,
while HIV-2 has, until recently, been restricted to West Africa and appears to be less virulent in its effects.
Viruses related to HIV have been found in many species of non-human primates (monkeys, apes, ...) and have been named Simian
Immunodeficiency Virus, SIV.
These primate viruses are lentiviruses, a subfamily of the retroviruses. Retroviruses have RNA genomes but
are unique among RNA viruses because they have a replication cycle that involves the reverse transcription of
their RNA genome into DNA (this is the opposite direction compared to the usual flow of information from DNA to RNA).
The reverse-transcribed viral DNA is stably incorporated into the genomic DNA
of an infected cell and subsequent transcription can then create multiple
copies of mRNA encoding new viral material.
Like other retroviruses, particles of HIV are made up of 2 copies of the single-stranded RNA genome
packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for
the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived
from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of
HIV: gp120; and, gp41. The gp120 protein is crucial for binding of the virus particle to target cells. It is the
specific affinity of gp120 for the CD4 protein that targets HIV to those cells of
the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).
Purpose of exercise, description of data
In this part of the exercise you are going to investigate the phylogenetic relationship
between HIV and SIV. You will do this using two different data sets:
- a set consisting of 27 different gp120 protein sequences from isolates of
HIV1, HIV2, chimpanzee SIV and macaque monkey SIV.
- a set consisting of 20 different POL-polyprotein sequences from HIV1, HIV2,
chimpanzee SIV and sooty mangabey SIV.
(Note for enthusiasts: a number of lines of evidence have indicated that
macaques are not naturally infected with SIV and that they have acquired their SIV
infection while in captivity by cross-species transmission of SIV from sooty
mangabeys. This means that both the macaque SIVs and the sooty mangabey SIVs
originate from sooty mangabeys).
Have a look at what else today's data directory contains:
First, we will use the file gp120.fasta, which contains 27 gp120 envelope
protein sequences from isolates of HIV-1, HIV-2, and SIV in fasta-format. Take
a look at its contents:
nedit gp120.fasta &
In this file, all HIV-1 sequences have names starting HV1. All HIV-2 sequences
have names starting HV2. SIVCZ was isolated from chimpanzee. SIVMK, SIVM1, and
SIVML were isolated from macaques.
We will use the program ClustalX to make a multiple alignment. (ClustalX
is actually a graphical front-end to the command line-based program CLUSTALW that you
used before.), and take the opportunity to have a closer look at some of the details.
The program is interactive with a typical, relatively user friendly,
windows-style interface. ClustalX has online help available from the pull-down
Start ClustalX with the following command (be sure to include the
"&" which will make the program run in the background, so we can start
more programs from the same shell window.)
This opens the program in a separate window.
The first thing you have to do is load the sequences. In the "File" menu
choose "Load sequences", and select gp120.fasta from the file list
that appears. Again, make sure to pull the window so it is as wide as possible,
and scroll along to see the entire length of the sequences.
Have a look at the alignment parameters:
From the "Alignment" pull-down menu choose "Alignment parameters", and then
"Pairwise alignment parameters". In this window you are able to change
gap-penalties and substitution matrix for the initial, pairwise part of the
Q2: Note the substitution matrix, the gap opening penalty and the
gap elongation penalty.
You can also specify whether you want the pairwise alignments performed using a
slower but accurate method, or a faster but approximate method. For now just leave
everything as is. Exit the window by clicking the tab labeled "Close".
There is a similar window for changing the multiple alignment parameters
("Alignment", "Alignment parameters", "Multiple alignment parameters".) Have a
look, but keep the default values for now.
Finally, there is also a window for changing a special set of gap parameters
used by ClustalX ("Alignment", "Alignment parameters", "Protein gap
Q3: Check what the parameters are set to in the "Protein gap
parameters" window and note the values. Explanations of the
various parameters are listed below
- RESIDUE SPECIFIC PENALTIES: are amino acid specific gap penalties that reduce
or increase the gap opening penalties at each position in the alignment or
sequence. See the documentation for details. As an example, positions that
are rich in glycine are more likely to have an adjacent gap than positions that
are rich in valine.
- HYDROPHILIC GAP PENALTIES are used to increase the chances of a gap within
a run (5 or more residues) of hydrophilic amino acids; these are likely to
be loop or random coil regions where gaps are more common. The residues that
are "considered" to be hydrophilic can be entered in HYDROPHILIC RESIDUES.
- GAP SEPARATION DISTANCE tries to decrease the chances of gaps being too close
to each other. Gaps that are less than this distance apart are penalised more
than other gaps. This does not prevent close gaps; it makes them less frequent,
promoting a block-like appearance of the alignment.
- END GAP SEPARATION treats end gaps just like internal gaps for the purposes of
avoiding gaps that are too close (set by GAP SEPARATION DISTANCE above). If
you turn this off, end gaps will be ignored for this purpose. This is useful
when you wish to align fragments where the end gaps are not biologically
Start the multiple alignment:
From the "Alignment" menu choose "Do complete alignment". This opens a
window giving you the opportunity to rename output files. Accept the suggested
names and start the alignment by clicking the tab labeled "Align".
You may be able to get a glimpse of how ClustalX is working in the bottom
part of the window: first, it does all the pairwise alignments.
These alignments are then used to construct a "guide
tree". (The guide tree should not be confused with the phylogenetic tree we
will construct later. The guide tree is entirely based on the pairwise
alignments, and is used to guide the construction of the multiple alignment).
Finally, ClustalX constructs the multiple alignment by progressively "aligning
algnments" following the guide tree.
Inspect the alignment:
When the alignment is done you can inspect the result by scrolling along the
sequences in the ClustalX window. You will notice that the conservation graph at
the bottom of the window now has several peaks and plateaus corresponding to the
conserved regions of gp120. An additional summary of conservation is given above
the sequences. "*" indicates positions which have a single, fully
conserved residue, ":" indicates positions with strong functional
conservation, and "." indicates positions with weaker functional
conservation. Don't close ClustalX after inspecting the alignment (you can iconize
it if it clutters the screen).
Computing an unrooted tree
In this part of the exercise we will use ClustalX to produce a phylogenetic
tree. The tree is built with the neighbour joining algorithm, and is
based on distances computed from the multiple alignment you just
Select output of treefile and distance matrix:
From the "Trees" menu choose "Output format options" and select "PHYLIP
format tree" and "PHYLIP distance matrix". Exit the window by clicking the
tab labeled "close".
Construct the tree:
From the "Trees" menu choose "Draw N-J tree". This gives you a window where you can change
the name of the tree-file and the distance matrix. Make sure they are named "gp120.ph" and
"gp120.phdst" respectively, and tnen click "OK" to calculate the tree. ClustalX uses the
multiple alignment to calculate a distance matrix with all pairwise distances between the 27
sequences, and then constructs a tree by progressively clustering sequences that are close to
each other (using the neighbor-joining algorithm). (Note: the construction of the NJ-tree is
so fast that its practically finished the second you have clicked "Draw N-J tree" - so no
need to sit around and wait for the result).
Inspect the output files:
nedit gp120.ph &
This treefile is in the text-based format that we discussed in class, where tree structure is
represented by nested parentheses and commadelimited names. The format is
obviously mostly meant for computers. Also look at the file containing
all pairwise distances between all sequences:
nedit gp120.phdst &
These are the numbers that were used by the neighbor joining algorithm for
constructing the tree. For each sequence the distance to all other sequences are
listed on a number of consecutive lines (there are too many distances to fit on
a single line). The format is such that you should imagine a table where the
order on the (unlabeled) horizontal axis is the same as on the vertical axis (i.e.,
the first sequence is HV1EL, the second is HV1Z2, etc.)
Q4: Scroll down to the entry for the sequence named HV2BE.
Note the first and last distances (these are the distances to
the sequences HV1EL and SIVM1, respectively).
- View a plot of the unrooted tree:
unrooted gp120.ph &
The UNIX-program unrooted is a very simple viewer for
unrooted trees. (Note: there are several programs for viewing trees, and
constructing tree plots. One good free one is
Q5: include a screendump of the tree, or simply describe using words
how the HIV1 cluster, the HIV2 cluster,
the SIVCZ sequence, and the SIVMK sequences are located with respect to each other).
- Think for a minute about the implications:
What does this tree tell us about the phylogenetic
relationship of HIV-1, HIV-2 and SIV? Notice especially where the two different
groups of SIV cluster compared to the two different groups of HIV.
When you've thought about the problem, you can read a
that I've prepared.
Bootstrapping a neighbor joining tree
ClustalX also has the possibility of bootstrapping your neighbor
Reopen your alignment from before (described above) and in
the Trees menu choose "Bootstrap NJ-tree".
This gives you a window where you can change the number of resampled data sets.
The default is 1000, but you can change this to a lower number if you become
Make sure the tree file will be named "gp120.phb" and click OK to start the
The idea in bootstrapping is to assess how much support there is for
each branch in your tree, based on the data at hand. Imagine you have an
alignment with 135 columns. ClustalX performs bootstrapping by randomly
picking one of the columns 135 times in a row, thereby generating a new permuted
alignment having the same size as the original alignment (135 columns). A
column may be chosen more than once (this is normally termed "sampling
with replacement"). In the newly generated, permuted alignment some
columns will be present multiple times, while some of the original columns
will be absent.
This sampling with replacement is done a large number of times (usually
500-1000), so that a lot of permuted alignments are produced. For each
of these alignments a tree is now constructed using neighborjoining.
Finally, a consensus tree is made from the large set of trees. Basically
the consensus tree consists of the monophyletic groups that
occur most frequently in the large set of trees.
The consensus tree that is produced has a number on each internal
branch indicating how well supported that branch is (larger is better).
Specifically, the number indicates how many of the individual trees in the
large tree-set that contained that particular branch. (It is perhaps
useful to think of each internal branch as defining a bi-partition of the
leafs. If two different trees each possess an internal branch that divides
the leafs into, say, one group with leaf A, B, and C, and another group
with leaf D and E, then both trees are said to have the same internal
It has been found that the variation you get as a result of this type
of bootstrapping is similar to what you get when collecting new data, and
it therefore gives you an idea about how well supported the individual
branches in your tree are. It is, however, not a real significance or
probability (for that purpose you might want to use Bayesian phylogeny
which we will, however, not discuss in this course).
View the bootstrapped tree:
njplot gp120.phb &
Select "Bootstrap values", and you should now be able to see the tree with the values
attached to all internal branches. Remember that the number tells how often the data was
divided into the two groups present on either side of the branch. Please note that this is
still an unrooted tree, you are just using a rooted viewer since that is the only one we have
easily available that will display bootstrap values....
Q6: What is the bootstrap value on the internal branch separating
the HIV1/SIVCZ cluster from the HIV2/SIVMK cluster?
Rooting a tree using an outgroup
In this part of the exercise you will use a data set of 20 different
POL-polyprotein sequences isolated from HIV-1, HIV-2, chimpanzee SIV, and sooty
mangabey SIV. (The Pol gene encodes three different polypeptides: integrase,
reverse transcriptase, and protease. It is expressed as a single polyprotein and is
subsequently cleaved by protease into its three separate parts).
First, you will construct a neighbor-joining tree like before and investigate
whether this new, independent data set confirms the conclusions you made based on
the alignment of gp120 sequences. Then you will add a POL-polyprotein sequence from
HTLV-1 to the data set and construct a new tree, that you can then root using the
HTLV sequence as an outgroup. (HTLV-1 is another member of the family of
retroviruses and is thus more distantly related to HIV - which was originally named
HTLV-3 by the way)
- Have a look at the POL sequence file:
nedit hiv-siv-pol.fasta &
As mentioned, this file contains POL-polyprotein sequences from HIV-1, HIV-2,
chimpanzee SIV, and sooty mangabey SIV.
- Align the POL sequences using ClustalX:
Re-open the ClustalX window and load the sequence file
hiv-siv-pol.fasta. Now, start the alignment by choosing "Do complete
alignment" from the "Alignment" menu as before.
- Construct a neighbor-joining tree with no outgroup:
From the "Trees" menu choose "Draw N-J tree". Make sure the tree will be named
- Inspect the unrooted tree:
unrooted hiv-siv-pol.ph &
This tree has been constructed from an entirely independent set of sequences.
Does it support the conclusions that could be made from the gp120-based tree?
Q7: Include a screendump of the POL-based tree (or again, just briefly describe
the position of the HIV1-cluster, the HIV2-cluster, the HIVCZ sequence and
the HIVSmanga sequences).
- Have a look at the POL sequence file with an added outgroup:
nedit htlv-hiv-siv-pol.fasta &
This file contains the same sequences as the file
hiv-siv-pol.fasta plus an additional POL-sequence from the
related virus HTLV-1 (the first sequence in this file). The HTLV POL
sequence will be used as an outgroup in this part of the exercise.
- Align the outgroup-containing data set using ClustalX:
Re-open the ClustalX window and load the sequence file
From the "Alignment" pull-down menu choose "Output format options", de-select
"CLUSTAL format" and select "NBRF/PIR format". This format will be used by the
PAUP* program that we will use later.
Now, start the alignment by choosing "Do complete alignment" from the
"Alignment" menu as before. Make sure the output file will be named
Quit ClustalX when you are done: we will now use the program PAUP*
for constructing trees with the outgroup.
- Start the PAUP* program:
This will give you a prompt ("paup>") where you can write commands to
- Convert data file to nexus format:
- Load the nexus file:
This command loads the sequences from the nexus file into memory.
- Select distance-based tree-reconstruction:
PAUP* can perform many different types of analyses and reconstruct phylogenies
using several different methods. Here we will use neighbor-joining which is a
distance matrix method.
- Construct an unrooted neighbor joining tree:
- Save tree to file:
savetrees file=unrooted-htlvtree.ph format=phylip
- Clear current tree from memory:
- Define outgroup:
We will now use the same data for constructing a rooted tree, using
the HTLV sequence as a way of defining where to place the root.
This defines an outgroup consisting of the HTLV sequence. ("HTLV" is the actual
name found in the htlv-hiv-siv-pol.nexus file).
The outgroup will be used to place the root of the tree. The rationale
is as follows: our data set consists of sequences from HIV-1, HIV-2, SIV
and HTLV. We know from other evidence that the lineage leading to HTLV
branched off before any of the remaining viruses diverged from each
other. The root of the tree connecting the organisms investigated here,
must therefore be located between the HTLV sequence (the "outgroup") and
the rest (the "ingroup"). This way of finding a root is called "outgroup
- Set outgroup rooting and select options for tree
set root=outgroup outroot=monophyl
This makes PAUP* use the outgroup we defined above for the purpose
of rooting the tree. The "outroot=monophyl" command makes PAUP
construct a tree where the outgroup is a monophyletic sister group to
- Construct a rooted neighbor joining tree:
Again, we are constructing a neighbor joining tree, but this time we
have defined how to root it.
- Save the rooted tree to file:
savetrees file=rooted-htlvtree.ph format=phylip
- Quit program:
- Inspect the unrooted tree:
unrooted unrooted-htlvtree.ph &
Observe how the outgroup HTLV is located quite distantly from the other
- Have a look at the outgroup-rooted tree:
njplot rooted-htlvtree.ph &
In this tree we have used the outgroup to place the root.
Q8: Compare the location of the root to the unrooted tree you constructed earlier.
Does this rooting support the (slightly premature) interpretation of the evolution of HIV that we presented earlier?