Exercise: BLAST
Exercise written by: Rasmus
Wernersson
Introduction
In this exercise we will be using BLAST (Basic Local Alignment Search Tool) for searching sequence
databases such as GenBank (DNA data) and UniProt (protein). When using
BLAST for sequence searches it is of utmost importance to be able to
critically evalutate the statistical
significance of the results returned.
The BLAST software package is free to use (Open Source) and be be
installed on any local system - it's originally written for UNIX type
Operating Systems. The package contains both programs for performing
the actual sequence searches against preexisting databases (e.g. "blastn" for DNA databases and "blastp" for protein databases),
aswell as a tool for creating new databases from stratch (the "fortmatdb"
program).
In this exercise we will be using the Web-interface to BLAST hosted by
the NCBI. For our purpose there are several advantages to this approach:
- We don't have to mess around with a UNIX command promt.
- NCBI offers direct access to preformatted BLAST databases of all
the data that they host:
- GenBank (+ derivates)
- Full Genome database
- Protein database (Both from translated GenBank and UniProt)
It should be noted that running BLAST locally (for example at the
super-computer cluster at CBS/DTU) offers much more fine-grained
control of DATA and workflow (everything can be scripted/automated)
than running BLAST through a web-interface.
Links:
Part 1- Assesing the statistical significance of BLAST hits.
As discussed in the lecture, there will be a risk of getting false positive results (hits to
sequences that are not related to our input sequence) by pure
stocastical means. In this first part of the exercise we will be
investigating this further, by examining what happens when we submit randomly generated sequence to BLAST
searches.
Rather than giving out a set of pre-generated DNA/Peptide sequences
where you only have my word for their randomness, you'll be generating
you own random sequence by throwing
dice - four sided dice ("d4")
for DNA sequences and twenty-sided dice ("d20") for protein sequences (see the
tables below). If you make this exercise at home, and don't have access
to d4/d20 dice, you can use this small Java program instead: "Remote
Roller".
STEP 1 - DNA sequences and BLASTN:
- Generate 3 DNA seqeunces of length 25bp using the table below.

1 : A
2 : C
3 : G
4 : T
- QUESTION 1: Report the
three sequences in FASTA format (give them short UNIQUE names, e.g. "seq1", "seq2",
"seq3").
- We now need to do a BLASTN
search at NCBI.
- Follow the "Nucleotide BLAST"
link fro the main BLAST page.
- In the section "Program
Selection"
select the option "Somewhat similar
sequences (blastn)"
- Choose "Nucletide
Collection NR/NT" as the search database. NR is the "Non
Redundancy" database, which contains all non-redundant (non-identical)
sequences from GenBank and the full Genome databases.
- VERY
IMPORTANT:
For this special situation where we BLAST small artificial sequences we
need to turn off some the automatics NCBI incorporate when short
sequences are detected. Otherwise we'll not be able to see the intended
results.
Extend the "Algorithm parameters"
section (see the screen shot below) in order to gain access to
fine-tuning the options.
- Deselect the "Automaticlly adjust parameters for short
input sequences" option
Notice: It's also
possible to E-value cutoff ("Expect
threshold") and BLASTN "word
size" here.
- Paste in you three sequences in FASTA format and start the
BLAST search.

- Inspect the results.
Notice that NCBI supplies a summary table at the very top of the page,
and the individual alignment can be seen further down.
QUESTION 2:
- How big (in basepairs) is the database we used for the BLAST
search?
- Do you find any sequences that looks like your input
sequences
(you can paste in a few in your report).
- What is the typical length
of the hits (the alignment length)?
- What is the typical %
identity?
- In what range is the bit-scores
("max score)?
- Notice: This is conceptually the same as the "alignment score" we have already
met in the pairwise alignment exercise.
- What is the range of the E-values?
QUESTION 3:
- What is the biological
significance of these hits?
- Now let's try to perform a search against a different database. Open a new window/tab for this -
you'll need to compare the results in a moment.
This time choose the "Human genomic
plus transcripts" database (and remember to set the Algorithm parameters - same as
above), and run the BLAST search.
QUESTION 4:
- Report the same basic statistics as in question 2.
QUESTION 5:
- All human sequences are also
found in the NR database. In case you had any hits against human
sequences in the first run:
- Has any of the statistics changed (Bit score, E-value, etc.).
- Why would they change?
STEP 2 - Protein sequences and BLASTP:
- Now it's time to work with a set of protein sequences - generate
three sequences of length 25 aa using the table below.
Notice 1: The distribution
of aminoacids will be equal (5% prob) and this is different from true
biological sequneces - however this is not important for this first
part of the exercise.
Notice 2: Please recall
from the lecture that the way BLASTP selecets candidate sequences for
full Smith-Watermann alignment is different from BLASTN. (BLASTN - a
single short (11 bp +) perfect match hit is needed. BLASTP - a pair of "near match" hits of 3 aa with-in a
40 aa window is needed).
QUESTION 6: Report the
sequences in FASTA format (once again use short UNIQUE names).

1
: A
2 : R
3 : N
4 : D
5 : C
6 : Q
7 : E
8 : G
9 : H
10 : I
11 : L
12 : K
13 : M
14 : F
15 : P
16 : S
17 : T
18 : W
19 : Y
20 : V
- Locate the "Protein BLAST"
page at NCBI and choose BLASTP
as the algorithm to use.
Paste in your sequences in FASTA format, and choose the "NR" database (this is the protein
version, consisting of translated CDS'es, UniProt etc).
VERY
IMPORTANT: We also need to tweak the parameters this time - in
the "Algorithm Parameter"
section select BLOSUM62 as the
alignment matrix to use and set the "Expect
treshold" to 1000
(default:
10) - and DISABLE the "Short sequence" parameters as we did in the DNA
search a moment ago - otherwise our carefully tweaked parameters will
be ignored.
Perform the BLAST search.
- Inspect
the results:
QUESTION 7:
- How big is the database this time?
- What is the typical length of the alignment and do they contain
gaps?
- What is the range of E-values?
- Try to inspect a few of the alignments in details ("+" means
similar) - do you find any that look plausible, if we for a moment
ignore the length - (you can qoute a few in your report)?
- If we have used the default E-value cutoff of 10 would any hits have been found?
- What returns the most junk - DNA Blast or Protein Blast (take
the E-value into account).
Part 2 - using BLAST to transfer functional information by finding
homologs
One of the most common ways to use BLAST as a tool, is in the
situation where you have a sequence of unknown
function, and want to find out what function it has. Since a
large amount of sequence data has been gathered during the years,
changes are that an evolutionary
related sequence with known function has already been
identified. In general such a related sequence is known as a "homolog".
Homo-,
Ortho- and Paralogs:
- A Homolog is a general term that describe a sequence that is
related by any evolutionary means.
- An Ortholog ("Ortho" = True) is a sequence that is "the same gene" in a different
organism: The sequences shared a single
common ancestor sequence, and has now diverged through
speciation (e.g. the Alpha globin
gene in Human and Mouse).
- A Paralog arises due to a gene
duplication with-in a species. For example Alpha- and Beta-globin are each others
paralogs.
Notice that in both cases
it's possible to transfer information, for example information about
gene family / protein domains.
We have already touch upon comparison of (potential) evolutionary
related sequences in the pairwise alignment exercise. However, this
time we do not start out with two sequences we assume are related, but
we rather start out with a single sequence ("query sequence") which we will use
to search the databases for homologs (we often informally speak of "BLAST hits", when discussing the
sequences found).
Lets start out with a sequence that will produce some good hits in
the database. The sequence below is a full-legth transcript (mRNA) from
a prokaryote. Let's find out what it is.
>Unknown_transcript01
CCACTTGAAACCGTTTTAATCAAAAACGAAGTTGAGAAGATTCAGTCAACTTAACGTTAATATTTGTTTC
CCAATAGGCAAATCTTTCTAACTTTGATACGTTTAAACTACCAGCTTGGACAAGTTGGTATAAAAATGAG
GAGGGAACCGAATGAAGAAACCGTTGGGGAAAATTGTCGCAAGCACCGCACTACTCATTTCTGTTGCTTT
TAGTTCATCGATCGCATCGGCTGCTGAAGAAGCAAAAGAAAAATATTTAATTGGCTTTAATGAGCAGGAA
GCTGTTAGTGAGTTTGTAGAACAAGTAGAGGCAAATGACGAGGTCGCCATTCTCTCTGAGGAAGAGGAAG
TCGAAATTGAATTGCTTCATGAATTTGAAACGATTCCTGTTTTATCCGTTGAGTTAAGCCCAGAAGATGT
GGACGCGCTTGAACTCGATCCAGCGATTTCTTATATTGAAGAGGATGCAGAAGTAACGACAATGGCGCAA
TCAGTGCCATGGGGAATTAGCCGTGTGCAAGCCCCAGCTGCCCATAACCGTGGATTGACAGGTTCTGGTG
TAAAAGTTGCTGTCCTCGATACAGGTATTTCCACTCATCCAGACTTAAATATTCGTGGTGGCGCTAGCTT
TGTACCAGGGGAACCATCCACTCAAGATGGGAATGGGCATGGCACGCATGTGGCCGGGACGATTGCTGCT
TTAAACAATTCGATTGGCGTTCTTGGCGTAGCGCCGAGCGCGGAACTATACGCTGTTAAAGTATTAGGGG
CGAGCGGTTCAGGTTCGGTCAGCTCGATTGCCCAAGGATTGGAATGGGCAGGGAACAATGGCATGCACGT
TGCTAATTTGAGTTTAGGAAGCCCTTCGCCAAGTGCCACACTTGAGCAAGCTGTTAATAGCGCGACTTCT
AGAGGGGTTCTTGTTGTAGCGGCATCTGGGAATTCAGGTGCAGGCTCAATCAGCTATCCGGCCCGTTATG
CGAACGCAATGGCAGTCGGAGCGACTGACCAAAACAACAACCGCGCCAGCTTTTCACAGTATGGCGCAGG
GCTTGACATTGTCGCACCAGGTGTAAACGTGCAGAGCACATACCCAGGTTCAACGTATGCCAGCTTAAAC
GGTACATCGATGGCTACTCCTCATGTTGCAGGTGCAGCAGCCCTTGTTAAACAAAAGAACCCATCTTGGT
CCAATGTACAAATCCGCAATCATCTAAAGAATACGGCAACGAGCTTAGGAAGCACGAACTTGTATGGAAG
CGGACTTGTCAATGCAGAAGCGGCAACACGCTAATCAATAATAATAGGAGCTGTCCCAAAAGGTCATAGA
TAAATGACCTTTTGGGGTGGCTTTTTTACATTTGGATAAAAAAGCACAAAAAAATCGCCTCATCGTTTAA
AATGAAGGTACC
- Peform a BLAST search in the NR/NT
database (BLASTN) using default
settings.
QUESTION 8:
- Do we get any significant hits?
- What kind of genes (function) do we find?
- How is the distribution of E-values - a broad spectrum from
good to bad, or a sharp division into good/bad scores?
- Now let's try to do the same at the protein level.
- Find the longest ORF using VirtualRibosome
(hint: remeber to search all positive reading frames) and save of copy the sequence in
FASTA format.
- BLAST the sequence (BLASTP)
against the NR database.
QUESTION 9:
- Do we find any conserved protein-domains? (Indicated at the
very top of the result page, and during the search). Identifying known protein domains can
provide important clues to the function of an unknown protein.
- Do we find any significant hits? (E-value?)
- Are all the best hits the same category of enzymes?
- How does the distribution of E-values looks compared to the
DNA search?
- From what you have seen, what is best for identifying
intermediate quality
hits - DNA or Protein BLAST?
- In the previous section we have been cheating a bit by using a
sequence
that was already in the database - let's move to the following sequence
instead.
The sequence is a DNA fragment
fron an unknown non-cultivatable microorganism. It was cloned and
sequenced directly from DNA extracted from a soil-sample, and it goes
by the poetic name "CLONE12". It was amplified using degenerated PCR
primers that target the middle ("core
cloning") of the sequence of a group of known enzymes. (I can guarantee
this particular sequence is not in the BLAST databases, since I have
cloned and sequenced it myself, and it has never been submitted to
GenBank).
LOCUS
CLONE12.DNA 609 BP
DS-DNA
UPDATED 06/14/98
DEFINITION UWGCG file capture
ACCESSION -
KEYWORDS -
SOURCE -
COMMENT
Non-sequence data from original file:
BASE
COUNT 174 A 116
C 162 G 157
T 0 OTHER
ORIGIN ?
clone12.dna Length:
609 Jun 13, 1998 - 03:39 PM Check: 6014 ..
1 AACGGGCACG GGACGCATGT AGCTGGAACA GTGGCAGCCG TAAATAATAA TGGTATCGGA
61 GTTGCCGGGG TTGCAGGAGG AAACGGCTCT ACCAATAGTG GAGCAAGGTT AATGTCCACA
121
CAAATTTTTA ATAGTGATGG GGATTATACA AATAGCGAAA CTCTTGTGTA CAGAGCCATT
181
GTTTATGGTG CAGATAACGG AGCTGTGATC TCGCAAAATA GCTGGGGTAG TCAGTCTCTG
241
ACTATTAAGG AGTTGCAGAA AGCTGCGATC GACTATTTCA TTGATTATGC AGGAATGGAC
301
GAAACAGGAG AAATACAGAC AGGCCCTATG AGGGGAGGTA TATTTATAGC TGCCGCCGGA
361
AACGATAACG TTTCCACTCC AAATATGCCT TCAGCTTATG AACGGGTTTT AGCTGTGGCC
421
TCAATGGGAC CAGATTTTAC TAAGGCAAGC TATAGCACTT TTGGAACATG GACTGATATT
481
ACTGCTCCTG GCGGAGATAT TGACAAATTT GATTTGTCAG AATACGGAGT TCTCAGCACT
541
TATGCCGATA ATTATTATGC TTATGGAGAG GGAACATCCA TGGCTTGTCC ACATGTCGCC
601
GGCGCCGCC
//
- QUESTION 10:
Your task is now to find out what
kind of enzyme this sequence is likely to encode, using the methods you
have learned.
Notice: The sequence
is (more or less) in GenBank format and the NCBI BLAST server expects
the input to be in FASTA format, or to be "raw" unformatted sequence.
There are two solutions to this:
- Copy the
sequence into a text-editor and manually create a FASTA file ("search and replace" is useful for
the reformatting).
This is the most
robust solution: it will always work.
- Hope the
creaters of the web-server you're using were kind enough to
automatically remove non-DNA letters (paste in ONLY the DNA lines) -
this turns out to be the case for both NCBI BLAST and VirtualRibosome, but it cannot be universally
relied upon.
Consider the following before you start on solving this task:
- Based on the information given: is the sequence
protein-coding?
- If it is, can you trust it will contain both a START and STOP
codon?
- Do we know if the sequence is sense of anti-sense?
Significance: Let's put the
criteria for significance at 10-10 (remeber: the higher the E-value,
the worse the significance).
Cover the following in
your answer:
- What tool(s) and database will be relevant to use?
- Document the results from the different BLAST searches - what
works and what does not work?
- You can copy in small snippets of the result to document
what you observe.
- In conclusion: What kind of enzyme CLONE12? Gather as much
evidence as possible.
Part 3 - BLAST'ing Genomes
So far we have been using BLAST to search in the big broad databases
that covers at huge set of sequence from a large range of organisms. In
this final part of the exercise we will be doing some more focused
searches in smaller databases by trageting specific genomes.
Typically this will be useful if you have a gene of known function from
one organism (say a cell-cycle controlling gene from Yeast, Saccharomyces
cerevisiae) and want to find the human homolog/ortholog to this
gene (genes that control cell division are often involved in cancer).
When you have been performing the BLAST searches, you have probably
already noticed, that's it possible to search specifically in the Human
and Mouse genomes (these database only
contains sequences from Human/Mouse). It's also possible to restrict
the
output from searches in the large databases (e.g. NR) to specific
organisms.
A growing number of organisms have been fully sequenced, and the
research teams resposible for a large scale genome project typically
put up their
own Web resouces for accessing the data. For example the Yeast genome
is principally hosted in the Saccharomyces
Genome Database
(SGD - www.yeastgenome.org)
- it should be noted that SGD also offers BLAST as a means to search
the database.
For the purpose of this
exercise we will be using the genome resources hosted at the NCBI
(with a short digression to SGD):
http://www.ncbi.nlm.nih.gov/Genomes/
- Let's do a small study of the relationship between the histones found in Yeast and in Human
(evolutionary distance: ~1-1.5 billion years).
Look up the HTA2 gene in SGD
(use the Quick Search box). Notice that a brief description about the
function of the gene and it's protein product is displayed (a huge
amout of additional information can be found further down the page -
much of it Yeast specific).
QUESTION 11: What
information is given about the relationship between this gene and the
gene "HTA1"?
Browse the page and locate the link to the protein sequence - keep the window open, or save the sequence
to a file, we'll need it in a moment.
- Now return to the NCBI Genome page.
Notice that an overview of the organisms for which genomes are
available is
shown in a box to the right (section: "Organism-specific")
- for each organism the information available is shown using a single
letter code ("B" = BLAST). You
can use this to open a BLAST page dedicated to that specific genome
(you can search both DNA and proteins translated from the genes).
Before we start looking in the human genome, let's find out if we can
locate the HTA2 gene in the NCBI version of the Yeast genome.
- Go to the BLAST page for Yeast (Click "B").
- Choose "RefSeq protein" as the database
- Use the HTA2 protein sequence as query.
QUESTION 12:
- How many high-confidence hits do we get?
- Does the hits make sense, from what you have read about HTA2
at the SGD webpage?
- The next step is to search the translated version of the human
genome .
Go to the dedicated BLAST page for Human (click "B").
- Choose "RefSeq protein" as the database.
- Notice: a larger number of databases are offered compared
to Yeast. This is simply due to the fact that the identification of the
genes in the human genome is much more troublesome than in Yeast - and
therefore a number of alternative interpretations of the
genome/proteome is offered. (In Yeast virtually
all protein coding genes has been experimentally verfied).
- Choose "BLASTP" as the method - and start you search for HTA2
homologs.
QUESTION 13:
- How many high-confidence hits are found?
- These protein originates from a number of genes - but how
many UNIQUE genes?
- Hint: Some of the proteins are iso-forms that originates from alternative splicing (one gene
-> multiple iso-forms).
Notice: In the BLAST
alignments - some parts of the sequences are maked in grey - these are low-complexity regions, that BLAST
by default ignores in the comparision.
- As the very last thing today, try to disbale the low-complexity
filter,
and re-run the search (in a new windows/tab):
QUESTION 14: Do we get
longer alignments this time?
- Concluding remarks: Today we have been using BLAST to
find a number of homologues genes (and protein-products). If we want to
go even deeper into the analysis of the homologs, the next logical step
would be to build a dataset of the full-length versions of the
sequences we have found (not just the part found by the local alignment
in BLAST).
A further analysis could consist of a series of pairwise alignments (for finding
out what is similar/different between pairs of sequences) or a multiple alignment which could form
the basis of establishing the evolutionary relationship between the
entire set of seqeunces.
BLAST can also be used as way to build a dataset of sequences base on a
known "seed" sequence. As we saw in the GenBank exercise, free-text
searching in the GenBank can be difficult, and if we for instance
wanted to build a dataset of variants of the insulin gene, an
easiy way to go around this would be to BLAST the normal version of the
insulin against the sequence database of choice, and pick the best
matching hits from here.
|