Pairwise alignment exercises

Global alignment

Below, you see two protein sequences.  They are both globins from a midge, Chironomus thummi thummi, a very small and annoying insect.

>GLB7_CHITH 145  P02226 GLOBIN CTT-VIIA. 
APLSADQASLVKSTWAQVRNSEVEILAAVFTAYPDIQARFPQFAGKDVAS
IKDTGAFATHAGRIVGFVSEIIALIGNESNAPAVQTLVGQLAASHKARGI
SQAQFNEFRAGLVSYVSSNVAWNAAAESAWTAGLDNIFGLLFAAL

>GLBP_CHITH 152  P11582 GLOBIN CTT-E/E' PRECURSOR. 
MKFIILALCVAAASALSGDQIGLVQSTYGKVKGDSVGILYAVFKADPTIQ
AAFPQFVGKDLDAIKGGAEFSTHAGRIVGFLGGVIDDLPNIGKHVDALVA
THKPRGVTHAQFNNFRAAFIAYLKGHVDYTAAVEAAWGATFDAFFGAVFA
KM
These sequences are in the FASTA format, a very extensively used format for input to bioinformatics programs: a line beginning with a ">" contains the name of a sequence plus optional comments, while the other lines until the next ">" contains the sequence itself.

Do a global alignment of these two protein sequences, using the ALIGN service at the GENESTREAM network server, IGH, Montpellier, France.  Hint: you can copy the sequences and sequence names from this page and paste them into the input windows at the French site.

Take a look at the result.  Note that there is a gap in GLBP_CHITH - what is the corresponding sequence of GLB7_CHITH?  This is an authentic example! (nature truly is fascinating...)  If you don't believe me, retrieve the original database entries for GLBP_CHITH and GLB7_CHITH from the SWISS-PROT database.

Local alignment

Now try a local alignment of the same two sequences, using the LALIGN service instead.  Compare the output with that of ALIGN. You will get the ten best-scoring local alignments, sorted by decreasing similarity score.  Instead of LALIGN, you can also try the SIM alignment tool for protein sequences, or the ACNUC/LFasta tool for nucleic acid sequences.

More examples of pairwise alignments

Repeat the above procedure, using ALIGN and LALIGN (or SIM or ACNUC/LFasta) on the following pairs of protein and nucleotide sequences.  Below, sequences in FASTA format are included as links to local files.

GLPA_HUMAN and GLP_HORSE
(Glycophorin from Man and his Horse).  Note that the overall similarity is lower than between the two midge proteins.  How does that effect the relationship between global and local alignment?

CEPPINS and ATRINS_cDNA
Insulin gene and cDNA (coding part only) from two monkey species.  Note the intron.

HUMHBA4 and HSAGL1
One of these sequences is much longer than the other.  Can you see what the relationship is between these two?  You can find the answer in the database annotations for HUMHBA4 and HSAGL1.  Are both global and local alignments relevant in this case?

Matrix and gap penalty choice

At the SIM and ACNUC/LFasta servers, you can set the gap penalties, and at SIM you also have a choice between a number of PAM and BLOSUM substitution matrices.  Try to repeat some of the local alignments you have just made while varying these alignment parameters.  Observe how the alignment length and % identity of the local alignments depends upon the matrix entropy.  Tip: In order to avoid getting drowned in output, you can set the number of alignments to be computed to only 1.