Pairwise Alignment and Database Searching
Presentation
My presentation is available in Powerpoint
here
Pairwise alignment exercise
Global alignment
Below, you see two protein sequences. They are both globins from a midge, Chironomus thummi
thummi, a very small and annoying insect.
>GLB7_CHITH 145 P02226 GLOBIN CTT-VIIA.
APLSADQASLVKSTWAQVRNSEVEILAAVFTAYPDIQARFPQFAGKDVAS
IKDTGAFATHAGRIVGFVSEIIALIGNESNAPAVQTLVGQLAASHKARGI
SQAQFNEFRAGLVSYVSSNVAWNAAAESAWTAGLDNIFGLLFAAL
>GLBP_CHITH 152 P11582 GLOBIN CTT-E/E' PRECURSOR.
MKFIILALCVAAASALSGDQIGLVQSTYGKVKGDSVGILYAVFKADPTIQ
AAFPQFVGKDLDAIKGGAEFSTHAGRIVGFLGGVIDDLPNIGKHVDALVA
THKPRGVTHAQFNNFRAAFIAYLKGHVDYTAAVEAAWGATFDAFFGAVFA
KM
These sequences are given in the FASTA format, an extensively used format for input to
bioinformatics programs: a line beginning with a ">" contains the name of a sequence plus
optional comments, while the other lines until the next ">" contains the sequence itself.
Do a global alignment of these two protein sequences, using the ALIGN service at the GENESTREAM network
server, IGH, Montpellier, France.
Hint: You can copy the sequences and sequence names from this page and paste them into the
input windows at the French site. Note that only the sequences (not the header lines) should be
pasted.
Take a look at the result. Note that there is a gap in GLBP_CHITH - what is the
corresponding sequence of GLB7_CHITH? (This is an authentic example - you are welcome to
retrieve the original database entries for GLBP_CHITH and GLB7_CHITH
from the SWISS-PROT database.)
Local alignment
Now try a local alignment of the same two sequences, using the LALIGN service instead. Compare the output
with that of ALIGN. You will get the ten best-scoring local alignments, sorted by decreasing
similarity score. Note that by using LALIGN the alignment is truncated compared to the
global alignment.
Question: Does global or local alignment yield the highest alignment score? Why?
Question: The alignment program used BLOSUM50 to align the sequences. Does that make sense
given the alignments obtained?
Database search exercise
BLAST searches
Below are two protein sequences in FASTA format.
Perform a BLAST search against the SWISS-PROT database.
Use the
BLAST service
at the GENESTREAM server for this. NOTE: Make sure to select
the correct database (swissprot) and alignment method (blastp)
in the drop-down menus!
Question 1: Which functions would you assign to these two proteins based on your
BLAST results?
Question 2: Try using different substitution matrices when performing the
BLAST searches. How does this affect the expectation scores? (For instance, note the
E-values for the database hit "ADH3_ECOLI" using BLOSUM45, BLOSUM62, and BLOSUM80 and
compare).
>SOME_PROTEIN
STAGKVIKCKAAVLWEVKKPFSIEDVEVAPPKAYEVRIKMVAVGICRTDDHVVSGNLVTP
LPVILGHEAAGIVESVGEGVTTVKPGDKVIPLFTPQCGKCRVCKNPESNYCLKNDLGNPR
GTLQDGTRRFTCRGKPIHHFLGTSTFSQYTVVDENAVAKIDAASPLEKVCLIGCGFSTGY
GSAVNVAKVTPGSTCAVFGLGGVGLSAVMGCKAAGAARIIAVDINKDKFAKAKELGATEC
INPQDYKKPIQEVLKEMTDGGVDFSFEVIGRLDTMMASLLCCHEACGTSVIVGVPPASQN
LSINPMLLLTGRTWKGAVYGGFKSKEGIPKLVADFMAKKFSLDALITHVLPFEKINEGFD
LLHSGKSIRTVLTF
>LAST_ECOLI
MRITIILVAPARAENIGAAARAMKTMGFSDLRIVDSQAHLEPATRWVAHGSGDIIDNIKV
FPTLAESLHDVDFTVATTARSRAKYHYYATPVELVPLLEEKSSWMSHAALVFGREDSGLT
NEELALADVLTGVPMVADYPSLNLGQAVMVYCYQLATLIQQPAKSDATADQHQLQALRER
AMTLLTTLAVADDIKLVDWLQQRLGLLEQRDTAMLHRLLHDIEKNITK
Take-home messages:
- E-values are not absolute measures of how good a database hit is. In theory,
E-values depend on the sequence, the database, and the substitution matrix/scoring system.
(In practice, BLAST uses pre-computed score-distributions so BLAST E-values only depend
on substitution matrix - this means they are sometimes overestimated!).
- It is generally safe to assign function X to an unknown protein if it has many
strong hits to proteins with function X in the database. HOWEVER, be cautious when
a sequence only has hits to proteins with putative functions!
FASTA searches
Redo the analysis of LAST_ECOLI this time using FASTA3_T with the BLOSUM62 matrix to
search SWISS-PROT database. Use the FASTA3 service at the GENESTREAM server for
this. NOTE: again, make sure to select the correct database (swissprot) and substitution matrix
(BLOSUM62).
Question: How does the E-values compare to those obtained using BLAST
for a given substitution matrix? (For instance, note the E-value for YFHQ_ECOLI
for BLAST vs. FASTA, using BLOSUM62).
Take-home message: FASTA gives a better estimate of the real E-value (compared to BLAST) of
a database hit since it takes into account the actual score-distribution of the current
databasesearch. For some reason E-values computed by FASTA are usually worse (i.e., larger) than
E-values computed by BLAST.
Links to web-based tools
- DOTLET - an applet for making dotplots
- LALIGN - a tool for performing local (Smith-Waterman) alignment
- SIM - alternative local alignment tool
- FASTA - fast database search tool
- BLAST - faster database search tool
- CD-BLAST - Fast search of sequence against profile database
Links to online tutorials
You are not required to read any of the material below. But if you are looking for more information on sequence alignment these are definately good places to start:
|