BLAST is a service of the National Center for Biotechnology Information (NCBI). A nucleotide or protein sequence sent to the BLAST server is compared against databases at the NCBI and a summary of matches is returned to the user.
The www BLAST server can be accessed through the home page of the NCBI at www.ncbi.nlm.nih.gov. Stand-alone BLAST binaries can be obtained from the NCBI FTP site. See the Stand-Alone Blast section for details.
The BLAST 2.0 release has significant differences from the BLAST 1.4 release. These include significant performance enhancements, the addition of 'gapping' routines, position-specific-iterated BLAST (see the PSI-Blast section) as well as extensive changes to the text report (see below), and the format of the databases (see the Stand-Alone Blast section). The options available and their command-line appearance have also changed substantially.
The BLAST 2.0 programs are described in a Nucleic Acids Research article. Please cite this reference if you publish the results of your BLAST query.
The BLAST family of programs allows all combinations of DNA or protein query sequences with searches against DNA or protein databases:
blastp compares an amino acid query sequence against a
protein sequence database.
blastn compares a nucleotide query sequence against a
nucleotide sequence database.
blastx compares the six-frame conceptual translation
products of a nucleotide query sequence (both
strands) against a protein sequence database.
tblastn compares a protein query sequence against a
nucleotide sequence database dynamically
translated in all six reading frames (both
strands).
tblastx compares the six-frame translations of a nucleo-
tide query sequence against the six-frame transla-
tions of a nucleotide sequence database.
The default matrix for all protein-protein comparisons is BLOSUM62.
Version 2.0 of BLAST allows the introduction of gaps (deletions and insertions) into alignments. With a gapped alignment tool, homologous domains do not have to be broken into several segments. Also, the scoring of gapped results tends to be more biologically meaningful than ungapped results.
The programs, blastn and blastp, offer fully gapped alignments. blastx and tblastn have 'in-frame' gapped alignments and use sum statistics to link alignments from different frames. tblastx provides only ungapped alignments.
The sequence sent to the BLAST server should be in FASTA format, described in http://www.ncbi.nlm.nih.gov/BLAST/fasta.html.
A number of databases are also available. They are described in http://www.ncbi.nlm.nih.gov/BLAST/blast_databases.html.
The BLAST report consists of a number of sections. The descriptions below are for a blastp comparison, but the format for the other programs is analogous.
The BLAST report is not intended to be a parseable document. It is subject to change with little or no notice.
The BLAST report starts with some header information that lists the type of program (here blastp), the version (here 2.0.1), and a release date. Also listed are a reference to the BLAST program, the query definition line, and summary of the database used.
BLASTP 2.0.1 [Aug-20-1997]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped
BLAST and PSI-BLAST: a new generation of protein database search programs",
Nucleic Acids Res. 25:3389-3402.
Query= gi|129295|sp|P01013|OVAX_CHICK gene X protein - chicken (fragment)
(232 letters)
Database: Non-redundant SwissProt sequences
59,576 sequences; 21,219,450 total letters
One-line descriptions of the database matches found are presented next. These
include a database sequence identifier, the corresponding definition line, as
well as the score (in bits) and the statistical significance ('E value') for this
match (please see the section on statistics for an explanation of bits and
significance). Consider the output below, from a gapped blastp comparison of
SwissProt accession P01013 against the SwissProt database.
High E
Sequences producing significant alignments: Score Value
sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED) 442 e-124
sp|P01014|OVAY_CHICK GENE Y PROTEIN (OVALBUMIN-RELATED) 353 9e-98
sp|P01012|OVAL_CHICK OVALBUMIN (PLAKALBUMIN) (ALLERGEN GAL D II) 278 5e-75
sp|P19104|OVAL_COTJA OVALBUMIN 268 5e-72
sp|P48595|BOMA_HUMAN BOMAPIN (PROTEASE INHIBITOR 10) 199 2e-51
sp|P29508|SCC1_HUMAN SQUAMOUS CELL CARCINOMA ANTIGEN 1 (SCCA-1) ... 198 5e-51
sp|P80229|ILEU_PIG LEUKOCYTE ELASTASE INHIBITOR (LEI) (LEUCOCYTE... 197 1e-50
sp|P48594|SCC2_HUMAN SQUAMOUS CELL CARCINOMA ANTIGEN 2 (SCCA-2) ... 196 2e-50
sp|P50453|PTI9_HUMAN CYTOPLASMIC ANTIPROTEINASE 3 (CAP3) (PROTEA... 195 6e-50
sp|P05619|ILEU_HORSE LEUKOCYTE ELASTASE INHIBITOR (LEI) 193 2e-49
The first match, in this case, is the actual query sequence. The identifiers shown here are all from SwissProt, so they all have 'sp' in the first field, followed by the accession, and then a Locus name. The syntax of these identifiers is discussed in more detail in the appendices of ftp://ncbi.nlm.nih.gov/blast/db/README The definition lines are taken from the definition line in the database, with the ellipsis (e.g., P29508) indicating that the definition line was too long to for the space available.
Ungapped alignments and results from blastx and tblastn will have an additional column ('N'), displaying the number of different segment pairs used to produce the alignment, according to the Karlin-Altschul statistics.
Each alignment is preceded by the sequence identifier, the full definition line and the length of the database sequence. Next come the score (in bits as well as the raw score) as well as the statistical significance of the match, followed by the number of identities and positive matches according to the scoring system (e.g., BLOSUM62) and, if applicable, the number of gaps in the alignment. Finally the actual alignment is shown, with the query on top and the database match labeled as 'Sbjct'. Between the two sequences the residue is shown if it is conserved, a '+' is shown if there is a positive match. One or more dashes, '-', indicates insertions or deletions. The example below is the third sequence listed in the one-line descriptions above.
>sp|P01012|OVAL_CHICK OVALBUMIN (PLAKALBUMIN) (ALLERGEN GAL D II)
Length = 386
Score = 278 bits (744), Expect = 5e-75
Identities = 149/231 (64%), Positives = 182/231 (78%), Gaps = 2/231 (0%)
Query 2 IKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNS 61
I+++L SS D T +VLVNAI FKG+W+ AF EDT+ MPF VT+QESKPVQMM
Sbjct 158 IRNVLQPSSVDSQTAMVLVNAIVFKGLWEKAFKDEDTQAMPFRVTEQESKPVQMMYQIGL 217
Query 62 FNVATLPAEKMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKR 121
F VA++ +EKMKILELPFASG +SMLVLLPDEVS LE++E INFEKLTEWT+ N ME+R
Sbjct 218 FRVASMASEKMKILELPFASGTMSMLVLLPDEVSGLEQLESIINFEKLTEWTSSNVMEER 277
Query 122 RVKVYLPQMKIEEKYNLTSVLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSE 181
++KVYLP+MK+EEKYNLTSVLMA+G+TD+F SANL+GISSAESLKISQAVH A E++E
Sbjct 278 KIKVYLPRMKMEEKYNLTSVLMAMGITDVFSSSANLSGISSAESLKISQAVHAAHAEINE 337
Query 182 DGIEMAGSTGVIEDIKHSPESEQFRADHPFLFLIKHNPTNTIVYFGRYWSP 232
G E+ GS + + SE+FRADHPFLF IKH TN +++FGR SP
Sbjct 338 AGREVVGSAEA--GVDAASVSEEFRADHPFLFCIKHIATNAVLFFGRCVSP 386
The last section lists specifics about the database searched as well as statistical and search parameters used:
Database: Non-redundant SwissProt sequences
Posted date: Aug 14, 1997 9:52 AM
Number of letters in database: 21,219,450
Number of sequences in database: 59,576
Lambda K H
0.317 0.132 0.377
Gapped
Lambda K H
0.255 0.0350 0.190
Matrix: BLOSUM62
Gap Penalties: Existence: 10, Extension: 1
Number of Hits to DB: 8938654
Number of Sequences: 59576
Number of extensions: 335248
Number of successful extensions: 1188
Number of sequences better than 10: 116
Number of HSP's better than 10.0 without gapping: 106
Number of HSP's successfully gapped in prelim test: 10
Number of HSP's that attempted gapping in prelim test: 868
Number of HSP's gapped (non-prelim): 120
length of query: 232
length of database: 21219450
effective HSP length: 52
effective length of query: 180
effective length of database: 18121498
effective search space: -1033097656
T: 11
A: 40
X1: 16 ( 7.3 bits)
X2: 40 (14.7 bits)
X3: 67 (24.6 bits)
S1: 41 (21.7 bits)
S2: 64 (28.4 bits)
One may judge the results of a blast search by two numbers. One is the 'bit' score, which is defined as:
S' (bits) = [lambda * S (raw) - ln K] / ln 2
where lambda and K are Karlin-Altschul parameters. The expression of the score in terms of bits makes it independent of the scoring system used (i.e., which matrix). The Expect value estimates the statistical significance of the match, specifying the number of matches, with a given score, that are expected in a search of a database of this size absolutely by chance. An Expect value of two, with a given score, would indicate that two matches with this score, are expected purely by chance. The expect value changes with the size of the database (in a larger database more chance matches with a given score are expected) and is the most intuitive way to rank results or compare the results of one query run against two different databases.
This section is only applicable if a users wishes to run stand-alone BLAST at their own institution. One reason to do so might be the wish to use private databases not available at the NCBI.
Users of www or network BLAST do not need to read these sections.
BLAST binaries are provided for IRIX6.2, Solaris2.5, DEC OSF1 (ver. 4), and Win32 systems. We will attempt to produce binaries for other platforms upon request.
Stand-alone binaries are available from ftp://ncbi.nlm.nih.gov/blast/executables.
The source code for BLAST 2.0 is part of the NCBI toolkit. The NCBI toolkit may be obtained from ftp://ncbi.nlm.nih.gov/toolbox/ncbi_tools. Use the makedemo Makefiles to build blastpgp, blastall, and formatdb after compiling the rest of the toolkit (see Compiling Blast).
Please remember to FTP in binary mode.
A new tool, formatdb, should be used to format the FASTA databases for both protein and DNA databases for BLAST 2.0. This must be done before blastall or blastpgp can be run locally. The format of the databases has been changed substantially from the BLAST 1.4 release. A major improvement in this format over the old one is that ambiguity information for DNA sequences is now retrieved from the files produced by formatdb, rather than from the original FASTA file. The original FASTA file is no longer needed for the BLAST runs. Formatdb may be obtained from with the other BLAST binaries from the executables directory (see above). Usage of formatdb may be obtained by executing formatdb and a dash:
formatdb -
formatdb arguments:
-t Title for database file [String] Optional
-i Input file for formatting (this parameter must be set) [File In]
-l Logfile name: [File Out] Optional
default = formatdb.log
-p Type of FASTA file
T - protein
F - nucleotide [T/F] Optional
default = T
-o Parse options
T - True: Parse SeqId and create indexes.
F - False: Do not parse SeqId. Do not create indexes.
[T/F] Optional
default = F
If the "-o" option is TRUE, then the database identifiers must follow the convention described in the appendices of ftp://ncbi.nlm.nih.gov/blast/db/README It is always advantageous to use the '-o' option if the database identifiers are in the format specified above. If the database identifiers are in the parseable formatdb produces additional indices allowing retrieval from the databases by identifier. The databases on the NCBI FTP site contain parseable identifiers. If the first word on a FASTA defintion line is a unique identifier (e.g., ">3091 Alcoho de...") then it is sufficient to insert "lcl|" between the ">" and the word (e.g., ">lcl|3091 Alcohol de...").
1.) If ASN.1 is to be produced from blastall or blastpgp, then "-o" must be TRUE. 2.) master-slave alignments are desired (i.e., the '-m' option with a non-zero value is used). 3.) The gi's are desired as part of the output (i.e., '-I' is used). 4.) fastacmd is used to fetch sequences from the database by accession or gi.
Blastall may be used to perform all five flavors of blast comparison. One may obtain the blastall options by executing 'blastall -' (note the dash). A typical blastall to perform a blastn search (nucl. vs. nucl.) of a file called QUERY would be:
blastall -p blastn -d nr -i QUERY -o out.QUERY
The output is placed into the output file out.QUERY and the search is performed against the 'nr' database. If a protein vs. protein search is desired, then 'blastn' should be replaced with 'blastp' etc.
Blastpgp performs gapped blastp searches and can be used to perform iterative searches in psi-blast mode. See the PSI-Blast section for a description of this binary. The options may be obtained by executing 'blastpgp -'.
Blast 2.0 uses threads to perform multi-processing searches. OS requirements on SGI's are IRIX 6 (with relevant threads patches, see below), any Solaris version, or a version of DEC UNIX. IRIX 5 may be used if multi-processing is not enabled.
SGI recommends the following threads patches on IRIX6 systems: For 6.2 systems, install SG0001404, SG0001645, SG0002000, SG0002420 and SG0002458 (in that order) For 6.3 systems, install SG0001645, SG0002420 and SG0002458 (in that order) For 6.4 systems, install SG0002194, SG0002420 and SG0002458 (in that order) These patches can be obtained by calling SGI customer service or from the web: http://support.sgi.com/
BLAST uses memory-mapped files (on UNIX and NT systems), so it runs best if it can read the entire BLAST database into memory, then keep on using it there. Resources consumed reading a database into memory can easily outweight the cost of a BLAST search, so that the memory of a machine is normally more important than the CPU speed. This means that one should have sufficient memory for the largest BLAST database one will use, then run all the searches against this databases in serial, then run queries against another database in serial. This guarantees that the database will be read into memory only once. As of this date (Aug. 1997) the EST FASTA file is about 500 Meg, which translates to about 170-200 Meg of BLAST database. At least another 100-200 Meg should be allowed for memory consumed by the actual BLAST program. All of the FASTA databases together are about 1.5 Gig, the BLAST databases produced from this will probably be about another Gig or so. 4 Gig of disk space, to make room for software and output, is probably a pretty good bet.
BLAST needs to know where the NCBI data is. This is specified by the main configuration file for the NCBI toolkit (".ncbirc" on UNIX systems, ncbi.ini on Windows, analogous names on other platforms). If BLAST is the ONLY NCBI application that will be used, it is sufficient to have the following two-line configuration file:
[NCBI] Data=/am/ncbiapdata/data
BLAST looks for the file 'seqcode.val', 'gc.code', and 'BLOSUM62' in the "Data" directory (e.g., "/am/ncbiapdata/data/seqcode.val"). A directory different than "/am/ncbiapdata/data" can be used if this is desired. The files seqcode.val, gc.val, and BLOSUM62 can be found in the data directory of the toolbox (i.e., ncbi/data). The .ncbirc should be either in the directory from which BLAST is called, the user's home directory, or in the directory set by the environment variable "NCBI".
On UNIX systems environment variables can be setenv to specify the directory of the database (BLASTDB) and matrices (BLASTMAT).
On non-UNIX systems it is currently necessary to run BLAST from the same directory as the databases, or explicitly write out the path. BLAST will soon read the NCBI configuration file for database directory information.
BLAST 2.0 uses the dust low-complexity filter for blastn and seg for the other programs. 'dust' is an integral part of the NCBI toolkit and is accessed automatically. 'seg' is a stand-alone program written only for UNIX. It may be obtained from ftp://ncbi.nlm.nih.gov/pub/seg/seg/. The environment variable for filters is BLASTFILTER.
The FASTA files used by the NCBI to produce BLAST databases are available on the NCBI FTP site in ftp://ncbi.nlm.nih.gov/blast/db/. Please see the README for details.
This section provides abbreviated instructions on building BLAST
for some popular platforms. It also provides guidance on how to
build the toolkit in a threaded manner, so that multi-threaded
BLAST may be run. It is still recommended that the README provided
with the NCBI toolkit be referred to.
To make BLAST it is first necessary to make the standard NCBI libraries (this
actually contains most of the BLAST source code). It is then necessary to
compile the demo's, which contains blastall, blastpgp, and formatdb. BLAST
does not require the network or vibrant libraries.
===============================================================================
Solaris 2.5
===============================================================================
1.) Obtain the toolkit archive from the NCBI FTP site, download in binary mode,
uncompress, and untar.
2.) cd ncbi/build
3.) cp ../make/*.unx .
4.) mv makeall.unx makefile
5.) make with:
make LCL=sol CC=gcc OTHERLIBS="-lm -lthread"
6.) Make demos with:
make -f makedemo.unx CC=gcc OTHERLIBS="-lm -lthread" THREAD_OBJ="ncbithr.o"
===============================================================================
IRIX6
===============================================================================
1.) Obtain the toolkit archive from the NCBI FTP site, download in binary mode,
uncompress, and untar.
2.) cd ncbi/build
3.) cp ../make/*.unx .
4.) mv makeall.unx makefile
5.) Make with:
make LCL=sgi OTHERLIBS="-lm -lPW -lpthread" CFLAGS1="-c -O -DPOSIX_THREADS_AVAIL"
6.) Make demos with:
make -f makedemo.unx LCL=sgi OTHERLIBS="-lm -lPW -lpthread" THREAD_OBJ="ncbithr.o"
===============================================================================
DEC Alpha (v. 4.0)
===============================================================================
1.) Obtain the toolkit archive from the NCBI FTP site, download in binary mode,
uncompress, and untar.
2.) cd ncbi/build
3.) cp ../make/*.unx .
4.) mv makeall.unx makefile
5.) Make with:
make LCL=alf CC=cc RAN=ranlib OTHERLIBS="-lm -pthread"
6.) Make demos with:
make -f makedemo.unx LCL=alf CC=cc RAN=ranlib OTHERLIBS="-lm -pthread" THREAD_OBJ="ncbithr.o"
The format of the BLAST databases has changed for the 2.0 release and is not compatiable with the databases used in the 1.4 release. The change was made to eliminate an unpleasant feature of the 1.4 databases: ambiguity information for nucleotide sequences was not stored in the compressed file, but rather the original FASTA file had to be accessed for this information. This leads to significant slow-downs in BLAST comparisons for databases, such as dbest, that contain a large number of ambiguity characters.
The blastpgp program can do an iterative search in which
sequences found in one round of searching are used to build
a score model for the next round of searching. In this usage,
the program is called Position-Specific Iterated BLAST, or PSI-BLAST.
As explained in the accompanying paper, the BLAST algorithm is
not tied to a specific score matrix. Traditionally, it has been
implemented using an AxA substitution matrix where A is the alphabet size.
PSI-BLAST instead uses a QxA matrix, where Q is the length of the query
sequence; at each position the cost of a letter depends on the position
w.r.t. the query and the letter in the subject sequence.
The position-specific matrix for round i+1 is built from a constrained
multiple alignment among the query and the sequences found with
sufficiently low e-value in round i. The top part of the output for
each round distinguishes the sequences into: sequences found
previously and used in the score model, and sequences not used in the
score model. The output currently includes lots of diagnostics
requested by users at NCBI. To skip quickly from the output of
one round to the next, search for the string "producing", which is
part of the header for each round and likely does not appear elsewhere
in the output. PSI-BLAST "converges" and stops if all sequences
found at round i+1 below the e-value threshold were already in
the model at the beginning of the round.
There are three blastpgp parameters specifically for PSI-BLAST:
-j is the maximum number of rounds (default 1; i.e., regular BLAST)
-e is the e-value threshold for including sequences in the
score matrix model (default 0.01)
-c is the "constant" used in the pseudocount formula specified in the
paper (default 10)
References
Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database
search programs", Nucleic Acids Res. 25:3389-3402.
Karlin, Samuel and Stephen F. Altschul (1990). Methods for
assessing the statistical significance of molecular sequence
features by using general scoring schemes. Proc. Natl. Acad.
Sci. USA 87:2264-68.
Karlin, Samuel and Stephen F. Altschul (1993). Applications
and statistics for multiple high-scoring segments in molecu-
lar sequences. Proc. Natl. Acad. Sci. USA 90:5873-7.