Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Pairwise Alignment and Database Searching


Presentation

My presentation is available in Powerpoint here


Pairwise alignment exercise


Global alignment

Below, you see two protein sequences. They are both globins from a midge, Chironomus thummi thummi, a very small and annoying insect.

>GLB7_CHITH 145  P02226 GLOBIN CTT-VIIA. 
APLSADQASLVKSTWAQVRNSEVEILAAVFTAYPDIQARFPQFAGKDVAS
IKDTGAFATHAGRIVGFVSEIIALIGNESNAPAVQTLVGQLAASHKARGI
SQAQFNEFRAGLVSYVSSNVAWNAAAESAWTAGLDNIFGLLFAAL

>GLBP_CHITH 152  P11582 GLOBIN CTT-E/E' PRECURSOR. 
MKFIILALCVAAASALSGDQIGLVQSTYGKVKGDSVGILYAVFKADPTIQ
AAFPQFVGKDLDAIKGGAEFSTHAGRIVGFLGGVIDDLPNIGKHVDALVA
THKPRGVTHAQFNNFRAAFIAYLKGHVDYTAAVEAAWGATFDAFFGAVFA
KM

These sequences are given in the FASTA format, an extensively used format for input to bioinformatics programs: a line beginning with a ">" contains the name of a sequence plus optional comments, while the other lines until the next ">" contains the sequence itself.

Do a global alignment of these two protein sequences, using the ALIGN service at the GENESTREAM network server, IGH, Montpellier, France.

Hint: You can copy the sequences and sequence names from this page and paste them into the input windows at the French site. Note that only the sequences (not the header lines) should be pasted.

Take a look at the result. Note that there is a gap in GLBP_CHITH - what is the corresponding sequence of GLB7_CHITH? (This is an authentic example - you are welcome to retrieve the original database entries for GLBP_CHITH and GLB7_CHITH from the SWISS-PROT database.)


Local alignment

Now try a local alignment of the same two sequences, using the LALIGN service instead. Compare the output with that of ALIGN. You will get the ten best-scoring local alignments, sorted by decreasing similarity score. Note that by using LALIGN the alignment is truncated compared to the global alignment.

Question: Does global or local alignment yield the highest alignment score? Why?

Question: The alignment program used BLOSUM50 to align the sequences. Does that make sense given the alignments obtained?


Database search exercise


BLAST searches

Below are two protein sequences in FASTA format. Perform a BLAST search against the SWISS-PROT database. Use the BLAST service at the GENESTREAM server for this. NOTE: Make sure to select the correct database (swissprot) and alignment method (blastp) in the drop-down menus!

Question 1: Which functions would you assign to these two proteins based on your BLAST results?

Question 2: Try using different substitution matrices when performing the BLAST searches. How does this affect the expectation scores? (For instance, note the E-values for the database hit "ADH3_ECOLI" using BLOSUM45, BLOSUM62, and BLOSUM80 and compare).

>SOME_PROTEIN
STAGKVIKCKAAVLWEVKKPFSIEDVEVAPPKAYEVRIKMVAVGICRTDDHVVSGNLVTP
LPVILGHEAAGIVESVGEGVTTVKPGDKVIPLFTPQCGKCRVCKNPESNYCLKNDLGNPR
GTLQDGTRRFTCRGKPIHHFLGTSTFSQYTVVDENAVAKIDAASPLEKVCLIGCGFSTGY
GSAVNVAKVTPGSTCAVFGLGGVGLSAVMGCKAAGAARIIAVDINKDKFAKAKELGATEC
INPQDYKKPIQEVLKEMTDGGVDFSFEVIGRLDTMMASLLCCHEACGTSVIVGVPPASQN
LSINPMLLLTGRTWKGAVYGGFKSKEGIPKLVADFMAKKFSLDALITHVLPFEKINEGFD
LLHSGKSIRTVLTF
>LAST_ECOLI
MRITIILVAPARAENIGAAARAMKTMGFSDLRIVDSQAHLEPATRWVAHGSGDIIDNIKV
FPTLAESLHDVDFTVATTARSRAKYHYYATPVELVPLLEEKSSWMSHAALVFGREDSGLT
NEELALADVLTGVPMVADYPSLNLGQAVMVYCYQLATLIQQPAKSDATADQHQLQALRER
AMTLLTTLAVADDIKLVDWLQQRLGLLEQRDTAMLHRLLHDIEKNITK

Take-home messages:

  1. E-values are not absolute measures of how good a database hit is. In theory, E-values depend on the sequence, the database, and the substitution matrix/scoring system. (In practice, BLAST uses pre-computed score-distributions so BLAST E-values only depend on substitution matrix - this means they are sometimes overestimated!).
  2. It is generally safe to assign function X to an unknown protein if it has many strong hits to proteins with function X in the database. HOWEVER, be cautious when a sequence only has hits to proteins with putative functions!


FASTA searches

Redo the analysis of LAST_ECOLI this time using FASTA3_T with the BLOSUM62 matrix to search SWISS-PROT database. Use the FASTA3 service at the GENESTREAM server for this. NOTE: again, make sure to select the correct database (swissprot) and substitution matrix (BLOSUM62).

Question: How does the E-values compare to those obtained using BLAST for a given substitution matrix? (For instance, note the E-value for YFHQ_ECOLI for BLAST vs. FASTA, using BLOSUM62).

Take-home message: FASTA gives a better estimate of the real E-value (compared to BLAST) of a database hit since it takes into account the actual score-distribution of the current databasesearch. For some reason E-values computed by FASTA are usually worse (i.e., larger) than E-values computed by BLAST.


Links to web-based tools

  • DOTLET - an applet for making dotplots
  • LALIGN - a tool for performing local (Smith-Waterman) alignment
  • SIM - alternative local alignment tool
  • FASTA - fast database search tool
  • BLAST - faster database search tool
  • CD-BLAST - Fast search of sequence against profile database

Links to online tutorials

You are not required to read any of the material below. But if you are looking for more information on sequence alignment these are definately good places to start: