Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Pairwise alignments


Your first alignment

  1. Do a global alignment of two protein sequences:

    align GLBE_CHITH.aa GLB7A_CHITH.aa

    These sequences are two globins from a midge, a very small and annoying insect.

    Q1: Note the length of longest sequence, the percent identity, and the alignment score for the alignment. Also note that there is a gap in GLBE_CHITH.aa - what is the corresponding 6 amino acid long sequence of GLB7A_CHITH.aa? Make a note of the sequence. This is an authentic example! Nature truly is fascinating...

  2. Try local alignment:

    lalign GLBE_CHITH.aa GLB7A_CHITH.aa

    and compare the output with that of align. You will get the ten best-scoring local alignments (10 is the default), sorted by decreasing similarity score. The output will fill several screens. You can "pipe" the output into the program "less" and view it one page at a time:

    lalign GLBE_CHITH.aa GLB7A_CHITH.aa | less

    (Press spacebar to go forward, "b" to go backward, and "q" to quit). Alternatively, you can control the number of local alignments shown by appending a number as the third argument to lalign, e.g.

    lalign GLBE_CHITH.aa GLB7A_CHITH.aa 1

    will give you only the best local alignment.

    Q2: Note the length of the overlap, the percent identity, and the alignment score for the best local alignment.

  3. Make a dot plot:

    The program plalign makes dot plots in the PostScript format.

    plalign GLBE_CHITH.aa GLB7A_CHITH.aa > globin-dotplot.ps

    The PostScript output is saved to the file globin-dotplot.ps. You can now view the PostScript output with the command:

    gv globin-dotplot.ps &


More examples of pairwise alignments of protein and nucleotide sequences

Repeat the above procedure, using align, lalign, and plalign, on the following pairs of sequences instead of GLBE_CHITH.aa and GLB7A_CHITH.aa:

GLPA_HUMAN.aa and GLP_HORSE.aa

Glycophorin from Man and Horse.

align GLPA_HUMAN.aa GLP_HORSE.aa

Q3: Note the length of the longest sequence, the percent identity, and the alignment score for the global GLP-alignment.

lalign GLPA_HUMAN.aa GLP_HORSE.aa

Q4: Note the length of the overlap, the percent identity, and the alignment score for the best local GLP-alignment.

plalign GLPA_HUMAN.aa GLP_HORSE.aa > GLP-dotplot.ps
...and then (of course) view the result with gv, like before.

Note that the overall similarity is lower than between the two midge proteins. How does that effect the relationship between global and local alignment?

X61092.nuc and BT006808.nuc

Insulin gene and cDNA (coding part only) from a monkey and a human. Note the intron.

align X61092.nuc BT006808.nuc

Q5: Note the start and stop position of the intron in the monkey sequence.

lalign X61092.nuc BT006808.nuc

Q6: Note the start and stop positions of the two best local alignments relative to the monkey sequence.

plalign X61092.nuc BT006808.nuc > insulin-dotplot.ps

HUMHBA4.nuc and V00488.nuc

align HUMHBA4.nuc V00488.nuc

Q7: Note the length of the two sequences.

Q8: One of these sequences is much longer than the other. Can you see what the relationship is between these two? You can find the answer in the database annotations by searching for HUMHBA4 and V00488 in GenBank. Specifically, read the DEFINITION field in the two database entries and write a brief explanation of what the two sequences are.

lalign HUMHBA4.nuc V00488.nuc

Q9: Note the position of the best local alignment relative to HUMHBA4 numbering. What does the best local alignment correspond to? (scroll down to the feature table for the corresponding part of the HUMHBA4 sequence).

plalign HUMHBA4.nuc V00488.nuc > HBA-dotplot.ps


Impact of Substitution Matrix

In your working directory, there is also a number of substitution matrix files with the extension .mat. These can be used instead of the default BLOSUM50 matrix in the local alignments. To select the BLOSUM30 matrix, for instance, write:

lalign -s blosum30.mat GLPA_HUMAN.aa GLP_HORSE.aa 1

(The "1" at the end makes lalign show only the best alignment - we recommend that you use this setting in this part of the exercise).

Compare the lalign results you obtain with blosum30, blosum50, and blosum90 on the glycophorin sequences:

Q10: For each of the matrices below, note alignment length, percent identity, and alignment score of the best local alignment.

lalign -s blosum30.mat GLPA_HUMAN.aa GLP_HORSE.aa 1

lalign -s blosum50.mat GLPA_HUMAN.aa GLP_HORSE.aa 1

lalign -s blosum90.mat GLPA_HUMAN.aa GLP_HORSE.aa 1

Observe how the alignment length, percent identity, and alignment score of the local alignments depends upon the substitution matrix (recall that small BLOSUM numbers means the matrix is designed from proteins with very low similarity, and high numbers means the underlying proteins are very similar).


Impact of Gap Penalties

In order to illustrate the effects of different gap penalties, we have constructed three versions of the BLOSUM50 matrix with modified gap penalties.

Matrix File Name Gap Opening Gap Elongation
blosum50.mat -12 -2
blosum50-nogap.mat -1200 -1000
blosum50-longgap.mat -12 0
blosum50-cheapgap.mat -6 -2

Compare the results you obtain by using blosum50 with different gap-penalties:

Q11: For each of the matrices below note alignment length and number of gaps in the best local alignment.

lalign -s blosum50.mat GLPA_HUMAN.aa GLP_HORSE.aa 1

lalign -s blosum50-nogap.mat GLPA_HUMAN.aa GLP_HORSE.aa 1

lalign -s blosum50-longgap.mat GLPA_HUMAN.aa GLP_HORSE.aa 1

lalign -s blosum50-cheapgap.mat GLPA_HUMAN.aa GLP_HORSE.aa 1

Feel free to play around with gap penalties on the other matrices also. Just make a copy of the matrix file you want to modify, and edit the copied file. The gap penalties (gap initiation and gap elongation penalty) are the two numbers in the third line of the matrix file.