|
Database searches
Heuristic vs. non-heuristic methods
Pairwise alignment methods can be used to search databases: a query
sequence is aligned to every single sequence in a database, and the
alignments that have a significantly high score (the "hits") are then
assumed to be homologous to the query. This is especially useful when
you are interested in learning something about the structure and
function of a hitherto uncharacterized protein.
For pairwise alignment it is always possible to use an alignment method that
is guaranteed to find the optimal alignment (given a substitution matrix and
gap penalty scores). However, if one wants to search a large database (say,
hundreds of thousands of sequences), then even dynamic programming may be too
slow. For this purpose it is often necessary to use a fast heuristic method,
i.e., a method that is not guaranteed to find the optimal alignment, but
which has been found empirically to yield reasonable results.
Sensitivity, specificity, and run time
You will first search a dataset of 122560 amino acid sequences (from the
protein database SWISSPROT) using three different methods - one non-heuristic,
and two based on different heuristics:
ssearch Is non-heuristic and performs a full
Smith/Waterman alignment of the query sequence to every
sequence in the database:
date; ssearch -Q GLP_HORSE.aa b > GLP_HORSE.ssearch.results ; date
The command date, which is given before and after the
ssearch command, prints the current time to the screen allowing
you to time the run.
In the ssearch command, the letter "b" is an alias for
the database file you are searching (its full path is
"/usr/cbs/databases/userdb/blast/swiss.2003.fasta" which is a
bit tedious to type repeatedly...). We have chosen to use a saved
version from 2003, so that we know exactly which results to expect (this
makes the searches somewhat faster than using the most up-to-date
database).
The redirection symbol ">" ensures that the output from
your search (there will be a lot of it) is saved to a file named
"GLP_HORSE.ssearch.results".
Calculate the time it took for ssearch to finish (subtract
start time from end time).
fasta
Employs a heuristic short-cut for
selecting limited regions in a limited set of database sequences
and then performs a Smith/Waterman alignment within these
regions:
date; fasta -Q GLP_HORSE.aa b > GLP_HORSE.fasta.results ; date
Calculate the time it took for fasta to finish.
blastp
Uses another heuristic that enables it to skip most of the
database. It then searches for high-scoring pairs of words and connects
them with a gapped alignment:
date; blastall -p blastp -i GLP_HORSE.aa -d swiss2003 > GLP_HORSE.blast.results; date
Calculate the time it took for blastp to finish.
You can learn a lot about the three search methods by comparing
these result files. The easiest way may be to open all three result
files in separate nedit windows (e.g., "nedit
GLP_HORSE.fasta.results &"). First take a look at the outputs
and sort out what is what: the score distribution, the score and
expectation values, the alignments.
Now answer the following questions:
Q1: Compare the running times you noted. Did the heuristic
methods finish sooner?
Q2: There are 8 glycophorins in the database - how many do
you find with the three different methods? Do you find all of them with
any method?
Q3: Do the different search methods agree on E-values?
Compare the three E-values for the hit to macaque glycophorin,
"GLP_MACFU").
Q4: How many matches with E-values < 1 are reported
with the three methods? Are all of these hits trustworthy? (GLP_HORSE.aa
is a horse glycophorin - a major membrane protein in red blood cells. Do
all the hits with E<1 look like they are related to this protein?)
Low complexity filter
Have a second look at the "GLP_HORSE.blast.results" file. If
you inspect the alignments you will find that parts of the query
sequence has been replaced by runs of X'es. BLAST does this by invoking
another program, seg, which analyses the compositional
complexity of a sequence. If a segment is found to be of low complexity
it is "masked" (i.e., not used in the alignment). The purpose with this
is to avoid spurious hits to other low complexity regions.
Run blast without this filter by using option -F F:
blastall -p blastp -F F -i GLP_HORSE.aa -d swiss2003 >GLP_HORSE.blast.no-seg.results
Again, inspect the results to see the effects of turning off the
low-complexity filter for the BLAST search
Q5: Answer the following questions: How many glycophorins did
you find now? What is the E-value for the hit to macaque glycophorin,
"GLP_MACFU"? Did you get more spurious hits with the filter off (what is
the total number of hits with E<1)?
More information is available in the seg documentation.
Matrix choice
Try to search the database for homologs to GLP_HORSE.aa
using alternative matrices with the fasta program:
fasta -Q -s blosum90.mat GLP_HORSE.aa b > GLP_HORSE.fasta-blo90.results
Note the alignment score ("opt") and E-value for the mouse glycophorin, GLP_MOUSE.
fasta -Q -s blosum30.mat GLP_HORSE.aa b > GLP_HORSE.fasta-blo30.results
Note again the alignment score and E-value for the mouse
glycophorin, GLP_MOUSE.
Q6: What is the impact of matrix choice on the significance
of the GLP_MOUSE hit?
Impact of database size
You will now use the GLP_HORSE.aa sequence to search two different
versions of the swissprot database: swiss90 is how the swissprot
database looked in 1990 (17,994 sequences), while swiss2003 is the same
database as it looked at the beginning of the year 2003 (122,560
sequences).
blastall -p blastp -i GLP_HORSE.aa -d swiss90 > GLP_HORSE.blast.swiss90.results
Note the E-value for the hit to GLP_MACFU (Macaque glycophorin).
blastall -p blastp -i GLP_HORSE.aa -d swiss2003 > GLP_HORSE.blast.swiss2003.results
Again, note the E-value for the hit to GLP_MACFU (Macaque glycophorin).
Q7: Compare the results. What is the impact of database size
on E-values? Why?
DNA vs. DNA and Protein vs. DNA
Search a set of primate DNA sequences from the GenBank database with
BT006808.nuc as the query sequence:
blastall -p blastn -i BT006808.nuc -d gbpri -e 2 -v 1000 > BT006808.blast.results
(The "-e 2" option tells BLAST to only report hits with E-values less
than 2. The "-v 1000" option tells BLAST to report at most 1000
hits).
The protein sequence BT006808.aa is the translation of
BT006808.nuc. You can use this as a query sequence
against a DNA database by using a version of BLAST
that translates the nucleotide database while searching:
blastall -p tblastn -i BT006808.aa -d gbpri -e 2 -v 1000 > BT006808.Tblast.results
Q8: Which of the two approaches is able to find a match
between insulin and human insulin-like growth factor? Why
do you think this is so?
There are three useful BLAST versions for different types of
nucleotide/protein comparison (fasta comes with the same options):
- blastx: Nucleotide query - Protein database.
- tblastn: Protein query - Translated nucleotide database.
- tblastx: Nucleotide query - Translated nucleotide database (!!)
Web interface
The BLAST programs are of course also accesible on the web:
NCBI blastp interface
You may want to try to see if you can figure out where you
control all the options we experimented with above.
|