Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Database searches


Heuristic vs. non-heuristic methods

Pairwise alignment methods can be used to search databases: a query sequence is aligned to every single sequence in a database, and the alignments that have a significantly high score (the "hits") are then assumed to be homologous to the query. This is especially useful when you are interested in learning something about the structure and function of a hitherto uncharacterized protein.

For pairwise alignment it is always possible to use an alignment method that is guaranteed to find the optimal alignment (given a substitution matrix and gap penalty scores). However, if one wants to search a large database (say, hundreds of thousands of sequences), then even dynamic programming may be too slow. For this purpose it is often necessary to use a fast heuristic method, i.e., a method that is not guaranteed to find the optimal alignment, but which has been found empirically to yield reasonable results.


Sensitivity, specificity, and run time

You will first search a dataset of 122560 amino acid sequences (from the protein database SWISSPROT) using three different methods - one non-heuristic, and two based on different heuristics:

ssearch

Is non-heuristic and performs a full Smith/Waterman alignment of the query sequence to every sequence in the database:

date; ssearch -Q GLP_HORSE.aa b > GLP_HORSE.ssearch.results ; date

The command date, which is given before and after the ssearch command, prints the current time to the screen allowing you to time the run.

In the ssearch command, the letter "b" is an alias for the database file you are searching (its full path is "/usr/cbs/databases/userdb/blast/swiss.2003.fasta" which is a bit tedious to type repeatedly...). We have chosen to use a saved version from 2003, so that we know exactly which results to expect (this makes the searches somewhat faster than using the most up-to-date database).

The redirection symbol ">" ensures that the output from your search (there will be a lot of it) is saved to a file named "GLP_HORSE.ssearch.results".

Calculate the time it took for ssearch to finish (subtract start time from end time).

fasta

Employs a heuristic short-cut for selecting limited regions in a limited set of database sequences and then performs a Smith/Waterman alignment within these regions:

date; fasta -Q GLP_HORSE.aa b > GLP_HORSE.fasta.results ; date

Calculate the time it took for fasta to finish.

blastp

Uses another heuristic that enables it to skip most of the database. It then searches for high-scoring pairs of words and connects them with a gapped alignment:

date; blastall -p blastp -i GLP_HORSE.aa -d swiss2003 > GLP_HORSE.blast.results; date

Calculate the time it took for blastp to finish.

You can learn a lot about the three search methods by comparing these result files. The easiest way may be to open all three result files in separate nedit windows (e.g., "nedit GLP_HORSE.fasta.results &"). First take a look at the outputs and sort out what is what: the score distribution, the score and expectation values, the alignments.

Now answer the following questions:

  • Q1: Compare the running times you noted. Did the heuristic methods finish sooner?

  • Q2: There are 8 glycophorins in the database - how many do you find with the three different methods? Do you find all of them with any method?

  • Q3: Do the different search methods agree on E-values? Compare the three E-values for the hit to macaque glycophorin, "GLP_MACFU").

  • Q4: How many matches with E-values < 1 are reported with the three methods? Are all of these hits trustworthy? (GLP_HORSE.aa is a horse glycophorin - a major membrane protein in red blood cells. Do all the hits with E<1 look like they are related to this protein?)


Low complexity filter

Have a second look at the "GLP_HORSE.blast.results" file. If you inspect the alignments you will find that parts of the query sequence has been replaced by runs of X'es. BLAST does this by invoking another program, seg, which analyses the compositional complexity of a sequence. If a segment is found to be of low complexity it is "masked" (i.e., not used in the alignment). The purpose with this is to avoid spurious hits to other low complexity regions.

Run blast without this filter by using option -F F:

blastall -p blastp -F F -i GLP_HORSE.aa -d swiss2003 >GLP_HORSE.blast.no-seg.results

Again, inspect the results to see the effects of turning off the low-complexity filter for the BLAST search

Q5: Answer the following questions: How many glycophorins did you find now? What is the E-value for the hit to macaque glycophorin, "GLP_MACFU"? Did you get more spurious hits with the filter off (what is the total number of hits with E<1)?

More information is available in the seg documentation.


Matrix choice

Try to search the database for homologs to GLP_HORSE.aa using alternative matrices with the fasta program:

fasta -Q -s blosum90.mat GLP_HORSE.aa b > GLP_HORSE.fasta-blo90.results

Note the alignment score ("opt") and E-value for the mouse glycophorin, GLP_MOUSE.

fasta -Q -s blosum30.mat GLP_HORSE.aa b > GLP_HORSE.fasta-blo30.results

Note again the alignment score and E-value for the mouse glycophorin, GLP_MOUSE.

Q6: What is the impact of matrix choice on the significance of the GLP_MOUSE hit?


Impact of database size

You will now use the GLP_HORSE.aa sequence to search two different versions of the swissprot database: swiss90 is how the swissprot database looked in 1990 (17,994 sequences), while swiss2003 is the same database as it looked at the beginning of the year 2003 (122,560 sequences).

blastall -p blastp -i GLP_HORSE.aa -d swiss90 > GLP_HORSE.blast.swiss90.results

Note the E-value for the hit to GLP_MACFU (Macaque glycophorin).

blastall -p blastp -i GLP_HORSE.aa -d swiss2003 > GLP_HORSE.blast.swiss2003.results

Again, note the E-value for the hit to GLP_MACFU (Macaque glycophorin).

Q7: Compare the results. What is the impact of database size on E-values? Why?


DNA vs. DNA and Protein vs. DNA

Search a set of primate DNA sequences from the GenBank database with BT006808.nuc as the query sequence:

blastall -p blastn -i BT006808.nuc -d gbpri -e 2 -v 1000 > BT006808.blast.results

(The "-e 2" option tells BLAST to only report hits with E-values less than 2. The "-v 1000" option tells BLAST to report at most 1000 hits).

The protein sequence BT006808.aa is the translation of BT006808.nuc.  You can use this as a query sequence against a DNA database by using a version of BLAST that translates the nucleotide database while searching:

blastall -p tblastn -i BT006808.aa -d gbpri -e 2 -v 1000 > BT006808.Tblast.results

Q8: Which of the two approaches is able to find a match between insulin and human insulin-like growth factor? Why do you think this is so?

There are three useful BLAST versions for different types of nucleotide/protein comparison (fasta comes with the same options):

  • blastx: Nucleotide query - Protein database.
  • tblastn: Protein query - Translated nucleotide database.
  • tblastx: Nucleotide query - Translated nucleotide database (!!)

Web interface

The BLAST programs are of course also accesible on the web:

NCBI blastp interface

You may want to try to see if you can figure out where you control all the options we experimented with above.