Exercise: Pairwise alignment

Exercise written by: Rasmus Wernersson

In this exercise we will be working with pairwise alignment of protein sequences. As explained in todays lecture pairwise alignment is performed using an algorithm known as Dynamic Programming (DP). In this exercise we will not go into further details about the matematics behind the algorithm, but rather focus on the pratical use of Pairwise Alignment, and how to critically evaluate the result.

Keep the following in mind as we go through the exercise (as mentioned in the lecture):
  • The quality of an alignment is assessed through it's alignment score.

  • The alignment score is calculated based on match/mis-match of the amino-acids via look-ups in an alignment matrix (for example BLOSUM62). The pernalty for introducing gaps in the alignment is governed by two parameters: Gap opening ("expensive" - big pernalty, e.g. 10 points) and Gap elongation ("cheap", e.g. 1 point).

  • The Alignment Matrices is based on emperical data for the BLOSUM-series and simulated data for the PAM-series (we don't cover DNA matrices in this exercise).

  • There are two main variants of the DP algorithm:
    • Needleman-Wunsch (Global alignment - the sequences are forced to be aligned across the entire length).
    • Smith-Waterman (Local alignment - only the best matching part of the sequences are shown).

Embross@ebi webpage

Part 1 - basic use

  1. Open the alignemnt webpage at EBI: http://www.ebi.ac.uk/emboss/align/

  2. Observe that a EBI has a page which offers detailed help of how to the use of the alignemnt program, and how to tweak the parameters. Just click the "Embross align help" (direct link: http://www.ebi.ac.uk/emboss/align/help.html).
    • Notice that quite a few sequence formats are supported (e.g. FASTA).
    • Notice that there is a link to a detailed description of how gaps are handled (direct link: http://www.ebi.ac.uk/help/gaps_frame.html).

  3. For starters will will work with a pair of procaryotic serin-proteases (both retrieved from UniProt). The first one (P29600) is the very same thermo-stabile protease that the company Novozymes sells to the detergent (washing-powder) industry under the trade-name "Savinase". The other sequence is likewise a thermostabile serin-protease, but originates from a different species of Bacillus.
>P29600|SUBS_BACLE Subtilisin Savinase - Bacillus lentus

>P41363|ELYA_BACHD Thermostable alkaline protease precursor - Bacillus halodurans

Question 1: Which sequence format are the two sequences listed in?

  1. Simply paste in the sequences - including the header (one in each searchboxe). Make sure the options are set for global protein alignment using the BLOSUM62 matrix, and submit the alignment job.

    Inspect the result and observe the following:

    • The similarity between the aminoacids is depicted with "|" indicating perfect match and two categories to mis-match: ":" for a "close" mis-match (positive value on the BLOSUM matrix - the aminoacids share some physio-chemical properties) and "." for a "far" mis-match (negative value on the BLOSUM matrix).

    • This particular implementation of the global alignment algorithm ignores gaps at the verys ends of the alignments.

    Question 2: Report the following values / observations from the alignment.
    • Alignment score
    • Alignment length
    • % and fraction Identity (Perfect match only)
    • % and fraction Similarity (Perfect match + "close" mismatch)

  2. Perform the alignment once again (for preference in a new browser tab/window, so it's easy to compare), but this time as a local alignment.

    Question 3:
    • Report the same value as above (Alignment score etc), and conclude on which method YOU belive to produce the most biologically relevant alignment of the two sequences (or do they both contribute valueable information?)

  3. Let's go a bit deeping into why the two sequences differ in the N-terminal part: Look up both entries in UniProt (http://www.uniprot.org) - remember you can turn on extented view to inspect all the annotated information (this type of information is also available in the "flat file" format).

    Question 4:
    • How were the proteins sequenced?

    • Subcelluar localization: Where in (or outside) the cell does the enzymes have thier function?

    • The feature table contains details about the regions/domains of the protein - try and do a comparision to spot the differences between the two UniProt entries (you can safely ignore super-detailed information about secondary structure at this point - e.g. "TURN, HELIX and STRAND").

    Question 5:
    • By using the combined evidence from the alignment and the investigation of the UniProt entries, argue if the P41363 protein can be used for detergents (washing-powder) or not - (why? / why not?).

Part 2 - about gaps and dubious alignments

  1. So far we have been comparing sequences that are relatively closely related. It's now the time to compare our procaryotic Savinase  (P29600) protease with a human peptidase (a Tripeptidyl Peptidase), and see if they are related at all.
>gi|339878|gb|AAA63263.1| tripeptidyl peptidase II

  1. Question 6: Compare Savinase to the human peptidase by global alinment and report the following:
    • Alignment score.
    • Identity and Similarity
    • How large a part of the alignment is gaps?

  2. Question 7: Repeat the alignment - this time using the local alignment algorithm - and report the same results as above.

  3. Question 8:
    • What is you conclusion comparing the gobal and local alignments?
    • Which type of alignemnt will be the best for finding similar parts of distantly related proteins?

  4. Now we need to conclude if we believe the two peptidases are truely evolutionary related. Before discussing confidence intervals of alignment score / homology, let's have a look at how an alignment between two unrealted sequences looks like.

    The sequence below is a UniPrto entry on Alpa Globin  from Sus scrofa (the domestic pig).
>P01965|HBA_PIG Hemoglobin subunit alpha - Sus scrofa
  1.  Align Savinase and the Alpha Globin sequence using global alignment.

    Question 9:
    • How does the alinment look? (Alignment score etc).

  2. Repeat the alignment using local alignment.

    Question 10:
    • How does this alignment look compared to the global alignment?
    • How does the length of the alignment compare to the length of the two proteins we're trying to align?

    • Judging from these two alignments, will you conclude that the sequences are related? (Why / why not).

  3. Question 11: Comparing the Savinase/Alpha Globin alignemnts to the previous Savinase / Human Peptidase alignment - how will you judge the peptidase alignments now? (More/Less confidence?).

  4. In this case it turn out we have access to additional information about the human peptidase. The protein originates from a translation of the GenBank entry: M55169. Look up this entry in GenBank and follow the links to the protein product and see if you can find out if any well-known protein domains (or other relevant information) are found in this entry.

    Question 12:
    • Can you find any additional information that will increase/decrease you confidence in the alignment?

  5. Notice: As a rule of thumb the limit for detecting if two protein sequences are related through a simple pairwise alignment is around 25-30% similarity over at least 100 aminoacids.

  6. Release the Gaps!: As the last part of the comparison of unrealed sequences - let's allow as many gaps as possible, by making it almost "free" for the alorithm in insert gaps.
    • Set the Gap opening penalty to 1.0 (smallest possible value).
    • Set the Gap elongation penalty to 0.1 (smallest possible value).
    • Choose globalt alignment.

  7. Align Savinase + Alpha Globin once more.

    Question 13:
    • How does the quality paramters look this time? (Score, gaps, similarity etc).
    • Is this alignment biological meaningful at all?

Important: remember to reset the alignment parameters before continueing.

Part 3 - alignment matrices

Concerning alignment matrices:

EBI has a webpage with an exellent review of the contruction and use of alignment matrices linked from their alignment page (direct link:  http://www.ebi.ac.uk/help/matrix_frame.html).

Quote from this page - with my higlighting:
"It is assumed that the sequences being sought have an evolutionary ancestral sequence in common with the query sequence. The best guess at the actual path of evolution is the path that requires the fewest evolutionary events. All substitutions are not equally likely and should be weighted to account for this. Insertions and deletions are less likely than substitutions and should be weighted to account for this. It is necessary to consider that the choice of search algorithm influences the sensitivity and selectivity of the search. The choice of similarity matrix determines both the pattern and the extent of substitutions in the sequences the database search is most likely to discover."

Now it's time to experiment with some of the alignment matrices offered for alignment at the EBI. Notice that virtualy all protein alignment programs default to use the BLOSUM62 matrix. This is simply due to the fact that BLOSUM62 is one of the best all-around matrices to use, when no prior knowledge of homology can be assumed.

The numerical part of the name of a BLOSUM matrix is a reference to the % homology of the sequences used to construct the alignment matrix. BLOSUM80 is thus derived from a BLOCK alignment with an avarage homology to 80%.

The PAM matrices is contructed from fundamentaly different data than the BLOSUM matrices (see the webpage linked above for details). It's possible to "convert" between the matrices using the table below. Will will need this later, since the EBI server only offers a limited numer of BLOSUM matrices.

PAM100 ==> Blosum90
PAM120 ==> Blosum80
PAM160 ==> Blosum60
PAM200 ==> Blosum52
PAM250 ==> Blosum45
  1. Let's revisit the alignment of the two prokaryotic proteases (Savinase + P41363). We already know from the first alignments we made using the default matrix, that the sequences are very similair. Let's go for a "higher" matrix this time.

    Construct a new local alignment using the PAM100 matrix.

    Question 14:
    • Report the details about the alignment (score, length, %ident, % similarity)

    • WHY do we get different values now? Is the length affected?

  2. Let's have a look at the more divergent sequences now: Savinase + the human Peptidase.

    Construct a new local alignment using the BLOSUM40 matrix,

    Question 15:

    • Report the alignment details (yet again).

    • Is the alignment significantly different from the one before?
    • Was the effect of changeing the alignment matrix greater for the high-homology sequence set or the low-homology sequence set?