|
Pairwise alignments
Your first alignment
- Do a global alignment of two protein sequences:
align GLBE_CHITH.aa GLB7A_CHITH.aa
These sequences are two globins from a midge, a very small and annoying
insect.
Q1: Note the length of longest sequence, the percent identity,
and the alignment score for the alignment. Also note that there is a gap
in GLBE_CHITH.aa - what is the corresponding 6 amino acid long
sequence of GLB7A_CHITH.aa? Make a note of the sequence. This
is an authentic example! Nature truly is fascinating...
- Try local alignment:
lalign GLBE_CHITH.aa GLB7A_CHITH.aa
and compare the output with that of align. You will get the
ten best-scoring local alignments (10 is the default), sorted by decreasing
similarity score. The output will fill several screens.
You can "pipe" the output into the program "less" and view it one page at a
time:
lalign GLBE_CHITH.aa GLB7A_CHITH.aa | less
(Press spacebar to go forward, "b" to go backward, and "q" to quit). Alternatively, you can
control the number of local alignments shown by appending a number
as the third argument to lalign, e.g.
lalign GLBE_CHITH.aa GLB7A_CHITH.aa 1
will give you only the best local alignment.
Q2: Note the length of the overlap, the percent identity, and the alignment
score for the best local alignment.
- Make a dot plot:
The program plalign makes dot plots in the PostScript
format.
plalign GLBE_CHITH.aa GLB7A_CHITH.aa > globin-dotplot.ps
The PostScript output is saved to the file globin-dotplot.ps. You
can now view the PostScript output with the command:
gv globin-dotplot.ps &
More examples of pairwise alignments of protein and nucleotide sequences
Repeat the above procedure, using align, lalign,
and plalign, on the following pairs of sequences instead of
GLBE_CHITH.aa and GLB7A_CHITH.aa:
- GLPA_HUMAN.aa and GLP_HORSE.aa
Glycophorin from Man and Horse.
align GLPA_HUMAN.aa GLP_HORSE.aa
Q3: Note the length of the longest sequence, the percent
identity, and the alignment score for the global GLP-alignment.
lalign GLPA_HUMAN.aa GLP_HORSE.aa
Q4: Note the length of the overlap, the percent identity, and
the alignment score for the best local GLP-alignment.
plalign GLPA_HUMAN.aa GLP_HORSE.aa > GLP-dotplot.ps
...and then (of course) view the result with gv, like before.
Note that the overall similarity is lower than between the two midge
proteins. How does that effect the relationship between global and local
alignment?
- X61092.nuc and BT006808.nuc
Insulin gene and cDNA (coding part only) from a monkey and a human. Note
the intron.
align X61092.nuc BT006808.nuc
Q5: Note the start and stop position of the intron in the monkey
sequence.
lalign X61092.nuc BT006808.nuc
Q6: Note the start and stop positions of the two best local alignments
relative to the monkey sequence.
plalign X61092.nuc BT006808.nuc > insulin-dotplot.ps
- HUMHBA4.nuc and V00488.nuc
-
align HUMHBA4.nuc V00488.nuc
Q7: Note the length of the two sequences.
Q8: One of these sequences is much longer than the other. Can you see what
the relationship is between these two? You can find the answer in the database
annotations by searching for HUMHBA4 and V00488 in
GenBank.
Specifically, read the DEFINITION field in the two database entries
and write a brief explanation of what the two sequences are.
lalign HUMHBA4.nuc V00488.nuc
Q9: Note the position of the best local alignment relative to
HUMHBA4 numbering. What does the best local alignment correspond to?
(scroll down to the feature table for the corresponding part of the HUMHBA4 sequence).
plalign HUMHBA4.nuc V00488.nuc > HBA-dotplot.ps
Impact of Substitution Matrix
In your working directory, there is also a number of substitution matrix
files with the extension .mat.
These can be used instead of the default BLOSUM50 matrix in the local
alignments. To select the BLOSUM30 matrix, for instance, write:
lalign -s blosum30.mat GLPA_HUMAN.aa GLP_HORSE.aa 1
(The "1" at the end makes lalign show only the best
alignment - we recommend that you use this setting in this part of the
exercise).
Compare the lalign results you obtain with blosum30, blosum50, and blosum90 on the glycophorin
sequences:
Q10: For each of the matrices below, note alignment length, percent identity,
and alignment score of the best local alignment.
lalign -s blosum30.mat GLPA_HUMAN.aa GLP_HORSE.aa 1
lalign -s blosum50.mat GLPA_HUMAN.aa GLP_HORSE.aa 1
lalign -s blosum90.mat GLPA_HUMAN.aa GLP_HORSE.aa 1
Observe how the alignment length, percent identity, and alignment score of the local
alignments depends upon the substitution matrix (recall that small BLOSUM numbers means the
matrix is designed from proteins with very low similarity, and high numbers means the
underlying proteins are very similar).
Impact of Gap Penalties
In order to illustrate the effects of different gap penalties, we have
constructed three versions of the BLOSUM50 matrix with modified gap penalties.
| Matrix File Name | Gap Opening | Gap Elongation |
| blosum50.mat | -12 | -2 |
| blosum50-nogap.mat | -1200 | -1000 |
| blosum50-longgap.mat | -12 | 0 |
| blosum50-cheapgap.mat | -6 | -2 |
Compare the results you obtain by using blosum50 with different gap-penalties:
Q11: For each of the matrices below note alignment length and number of gaps in the best local
alignment.
lalign -s blosum50.mat GLPA_HUMAN.aa GLP_HORSE.aa 1
lalign -s blosum50-nogap.mat GLPA_HUMAN.aa GLP_HORSE.aa 1
lalign -s blosum50-longgap.mat GLPA_HUMAN.aa GLP_HORSE.aa 1
lalign -s blosum50-cheapgap.mat GLPA_HUMAN.aa GLP_HORSE.aa 1
Feel free to play around with gap penalties on the other matrices also.
Just make a copy of the matrix file you want to modify, and edit the copied
file. The gap penalties (gap initiation and gap elongation penalty) are
the two numbers in the third line of the matrix file.
|