Exercise: Pairwise alignment
Exercise written by: Rasmus
In this exercise we will be working with
pairwise alignment of protein sequences. As explained in todays lecture
pairwise alignment is performed using an algorithm known as Dynamic Programming
(DP). In this
exercise we will not go into further details about the matematics
behind the algorithm, but rather focus on the pratical use of
Pairwise Alignment, and how to critically evaluate the result.
Keep the following in mind as we go through the exercise (as mentioned
in the lecture):
- The quality of an
alignment is assessed through it's alignment
- The alignment score is
calculated based on match/mis-match of the amino-acids via look-ups in
an alignment matrix (for
example BLOSUM62). The pernalty for introducing gaps in the alignment is governed
by two parameters: Gap opening
("expensive" - big pernalty, e.g. 10 points) and Gap elongation ("cheap", e.g. 1
- The Alignment Matrices
is based on emperical data
for the BLOSUM-series and simulated
data for the PAM-series (we don't cover DNA matrices in this
- There are two main variants of the DP algorithm:
- Needleman-Wunsch (Global
alignment - the sequences are forced to be aligned across the
- Smith-Waterman (Local
alignment - only the best matching part of the sequences are
Part 1 - basic use
For starters will will work with a pair of procaryotic serin-proteases (both retrieved
from UniProt). The first one (P29600) is the very same thermo-stabile
protease that the company Novozymes
sells to the detergent
(washing-powder) industry under the trade-name "Savinase". The other sequence is
likewise a thermostabile serin-protease, but originates from a
different species of Bacillus.
- Open the alignemnt webpage at EBI: http://www.ebi.ac.uk/emboss/align/
- Observe that a EBI has a page which offers detailed help of how
to the use of
the alignemnt program, and how to tweak the parameters. Just click the "Embross align help" (direct link: http://www.ebi.ac.uk/emboss/align/help.html).
Subtilisin Savinase - Bacillus lentus
Thermostable alkaline protease precursor - Bacillus halodurans
Question 1: Which sequence
format are the two sequences listed in?
- Simply paste in the sequences - including the header (one in each
searchboxe). Make sure the options are set for global protein alignment using the BLOSUM62 matrix, and submit the
Inspect the result and observe the following:
Question 2: Report the
following values / observations from the alignment.
- The similarity between the aminoacids is depicted with "|" indicating perfect match and two categories to
mis-match: ":" for a "close" mis-match
(positive value on the BLOSUM matrix - the aminoacids share some
physio-chemical properties) and "."
for a "far" mis-match (negative value on the BLOSUM matrix).
- This particular implementation of the global alignment algorithm ignores
gaps at the verys ends of the alignments.
Perform the alignment once again (for preference in a new browser
tab/window, so it's easy to compare), but this time as a local alignment.
- Alignment score
- Alignment length
- % and fraction Identity (Perfect match only)
- % and fraction Similarity (Perfect match + "close" mismatch)
Let's go a bit deeping into why the two sequences differ in the
N-terminal part: Look up both entries in UniProt (http://www.uniprot.org)
- remember you can turn on extented
view to inspect all the annotated information (this type of
information is also available in the "flat
- Report the same value as above (Alignment score etc), and
conclude on which method YOU belive to produce the most biologically relevant alignment of
the two sequences (or do they both contribute valueable information?)
- How were the proteins sequenced?
- Subcelluar localization: Where in (or outside) the cell does
the enzymes have thier function?
- The feature
table contains details about the regions/domains of the protein - try
and do a comparision to spot the differences between the two UniProt
entries (you can safely ignore super-detailed information about
secondary structure at this point - e.g. "TURN, HELIX and STRAND").
- By using the combined
evidence from the alignment and the investigation of the UniProt
entries, argue if the P41363 protein can be used for
detergents (washing-powder) or not - (why? / why not?).
Part 2 - about gaps and dubious
- So far we have been comparing sequences that are relatively
closely related. It's now the time to compare our procaryotic Savinase (P29600) protease with a human peptidase (a
Tripeptidyl Peptidase), and see if they are related at all.
tripeptidyl peptidase II
- Question 6: Compare
Savinase to the human peptidase by global
alinment and report the following:
Question 7: Repeat the
alignment - this time using the local
alignment algorithm - and report the same results as above.
- Alignment score.
- Identity and Similarity
- How large a part of the alignment is gaps?
Now we need to conclude if we believe the two peptidases are
truely evolutionary related.
Before discussing confidence intervals of alignment score / homology,
let's have a look at how an alignment between two unrealted
sequences looks like.
- What is you conclusion comparing the gobal and local alignments?
- Which type of alignemnt will be the best for finding similar
parts of distantly related proteins?
The sequence below is a UniPrto entry on Alpa Globin from Sus scrofa (the domestic pig).
Hemoglobin subunit alpha - Sus scrofa
- Align Savinase and the Alpha Globin sequence using global alignment.
Repeat the alignment using local
- How does the alinment look? (Alignment score etc).
Question 11: Comparing
the Savinase/Alpha Globin alignemnts to the previous Savinase / Human
Peptidase alignment - how will you judge the peptidase alignments now?
- How does this alignment look compared to the global alignment?
- How does the length of
the alignment compare to the length of the two proteins we're trying to
- Judging from these two alignments, will you conclude that the
sequences are related? (Why / why not).
In this case it turn out we have access to additional information about the
human peptidase. The protein originates from a translation of the
GenBank entry: M55169. Look up this entry in GenBank
and follow the
links to the protein product and see if you can find out if any
well-known protein domains (or
other relevant information) are found in this entry.
Notice: As a rule of
thumb the limit for detecting if two protein sequences are related
through a simple pairwise alignment
is around 25-30% similarity
over at least 100 aminoacids.
- Can you find any additional information that will
increase/decrease you confidence in the alignment?
Release the Gaps!: As
the last part of the comparison of unrealed sequences - let's allow as
many gaps as possible, by making it almost "free" for the alorithm in
Align Savinase + Alpha Globin once more.
- Set the Gap opening penalty to 1.0
(smallest possible value).
- Set the Gap elongation penalty to 0.1 (smallest possible value).
- Choose globalt alignment.
- How does the quality paramters look this time? (Score, gaps,
- Is this alignment biological meaningful at all?
remember to reset the alignment parameters before continueing.
Part 3 - alignment matrices
Concerning alignment matrices:
EBI has a webpage with an exellent review of the contruction and use of
alignment matrices linked from their alignment page (direct link:
Quote from this page - with
"It is assumed that the sequences
being sought have an evolutionary ancestral
sequence in common with the
query sequence. The best guess at the actual path of evolution is the
path that requires the fewest evolutionary
events. All substitutions are not equally likely and should be weighted to account for
this. Insertions and deletions are less likely than substitutions and
should be weighted to account for this. It is necessary to consider
that the choice of search algorithm influences the sensitivity and
selectivity of the search. The choice of similarity matrix determines
both the pattern and the extent of substitutions in the sequences the
database search is most likely to discover."
Now it's time to experiment with some of the alignment
matrices offered for alignment at the EBI. Notice that virtualy all
protein alignment programs default to use the BLOSUM62 matrix. This is
simply due to the fact that BLOSUM62 is one of the best all-around
matrices to use, when no prior knowledge of homology can be assumed.
The numerical part of the name of a BLOSUM matrix is a reference to the
% homology of the sequences used to construct the alignment matrix.
BLOSUM80 is thus derived from a BLOCK alignment with an avarage
homology to 80%.
The PAM matrices is contructed from fundamentaly different data than
the BLOSUM matrices (see the webpage linked above for details). It's
possible to "convert" between the matrices using the table below. Will
will need this later, since the EBI server only offers a limited numer
of BLOSUM matrices.
PAM120 ==> Blosum80
PAM160 ==> Blosum60
PAM200 ==> Blosum52
PAM250 ==> Blosum45
- Let's revisit the alignment of the two prokaryotic proteases
(Savinase + P41363). We
already know from the first alignments we made using the default
that the sequences are very similair. Let's go for a "higher" matrix
Construct a new local alignment
using the PAM100 matrix.
Let's have a look at the more divergent sequences now: Savinase +
the human Peptidase.
- Report the details about the alignment (score, length, %ident,
- WHY do we get different values now? Is the length affected?
Construct a new local alignment
using the BLOSUM40 matrix,
- Report the alignment details (yet again).
- Is the alignment significantly different from the one before?
- Was the effect of changeing the alignment matrix greater for
the high-homology sequence set or the low-homology sequence set?