Day 3: Blast and Blast Matrices
Today you will be making a blast matrix, and you will also do some
blast searches at NCBI. You will first start the blast matrix script,
and then go on to doing the blast search. Running the blastmatrix
scripts can take a lot of time, so that is why we are doing it this way.
You will be both looking at files and editing files today. This is why
you need to get a a program which can transfer images from the CBS
computers to your computer, and which will let you interact with it.
Those of you who have a CBS laptop have this installed already, but
those of you who are working on your own computer need to install it.
Asli has kindly created a how-to guide for this that you can access here:
How to get X up and working
Once you have got X working you can start to look at files, and to edit
files. There are two programs that you will use to do this with.
this is a program that will let you edit a text file and save it. You
can use the mouse to click inside of the window and to interact with
this is a program which will let you look at a postscript file.
Postscript is actually a language for describing graphics which was
developed for getting pictures out on a printer. Postscript files are
acutally textfiles which you can look at and - strictly speaking - edit
(not recommended unless you really know what you are doing). Ghostview
takes the postscript and interprets how it would look on a page and
displays it on your screen.
The blast matrix perl script
comparison of multiple organisms. For every organism, it calculates how
proteins are homologous to all organism in the comparison. Including 4
in a single comparison will leave 4 x 4 = 16 cells in the matrix.
The numbers that appear in each square are as follows: the first
number is the percent of proteins in the total set of gene families
that are identical. The first of the numbers below is the number of
gene families that are identical in the two, while the second number is
the total number of gene families. I.e., the first is the intersection
of the two sets of gene familes, whereas the second is the union of the
of the matrix represent a special case, since this is equal to the
compared to itself. Naturally, when aligning a given gene with itself,
have perfect match alignment, and these hits are therefore excluded
diagonal. This leaves the diagonal as a measure of the number of
(homology within genomes) whereas all other cells represent the
(homology between genomes)
The input for the script is an XML
file. We have provided a template for this file, and shall add only a
section defining which genomes to
Log in to the CBS computers
Find a program which will let you log into the
Computer name: login.cbs.dtu.dk
User name: studXXX
You will get your password from the teachers.
After that you need to log into the computer
where we will do the exercises, which is named sbiology.
# log into CBS
ssh -Y sbiology
setenv MAKEFILES /home/people/pfh/bin/Makefile
a mistake made last time.
Last time we did something that needs to be fixed this time around.
Prodigal does not predict proteins, it predicts orfs, that is, the
sequence in the files are DNA, not protein sequences. If we use this
for making a blast matrix, this will cause trouble.
PS: IMPORTANT: the rest of this exercise depends on the proteins being
stored in a directory called data/prodigal. If they are not there, I
recommend that you go back and do that part from Exercise 2.
You are free to have data in other places, but then you will
have to change things below as needed.
# convert Prodigal predictions into protein
Can you tell what the program 6pack is
likely to be doing?
foreach i (*.proteins.fsa)
mv $i $i:r.orfs.fsa
cat $i:r.orfs.fsa | 6pack -1 > $i
directory to save the blast matix in.
# make directory which the blastmatrix should live in
- Look at the file and answer the questions below:
this plot, can you identify genomes which share homology?
Can you find genomes, which has a high degree of paralogs (homology
within the genome?)
you identify the least related proteomes?
Sequence searching and BLAST
The goal is to
sequence data from the NCBI. The Basic Local Alignment Search Tool
is an essential tool for comparing a DNA or protein sequence to other
sequences in various organisms. Two of the most common uses are to a)
determine the identity of a particular sequence and b) identify closely
related organisms that also contain this particular DNA sequence.
A Slide Show Introduction
Begin by linking to a BLAST for
beginners tutorial that is simple and easy to follow
Let the slide show guide your learning by clicking
on the bright green
arrow to proceed through the pages.
Using BLAST to identify a fake sequence and a
Begin by linking to the NCBI
homepage. Select BLAST. With
your new knowledge of Sequence Searching and BLAST, let's begin with a
sequence you make up and then your one of your fasta sequences.
- Select nucleotide BLAST
under the basic BLAST category.
- Input your own nucleotides
(A,T,G,C) that fill one complete line into the Search
Box. This is referred to as the query sequence.
- VERY IMPORTANT - Click on the
circle for 'Others (nr etc.) under Choose Search Set
- Select BLAST! at
end of page. A new window appears.
- Wait for the results page to
launch. The wait time depends on the type of search you are doing and
how many other researchers are using the NCBI website at the same time
- Did your fake sequence produce a
hit? (probably not since a significant hit is below E-10 usually)If
yes, how many?
- How many sequences did it search
- How many nucleotide letters did
it search in
- How long (query length) is the sequence that
you used to search the database?
- What is the E-value and bit score of the best hit (in this
case, the first matching sequence)?
- What is the most likely identity
sequence? (click on the blue link to the left of the top hit) What is
the title of the scientific publication that reported this sequence
(click on the PUBMED link)
- Go back twice when you're done.
- Select Home at
the top of the BLAST page.
- Select nucleotide BLAST
under the Basic BLAST category.
- Now enter only the first TWO ROWS
base pairs of the sequence below into the Search box.
What do you observe about the
E-values? What is
the E-value and score of the best hit (the first matching sequence)?
Is the identity of the best hit
when you used the complete nucleotide sequence? Is it the same gene as
From the two BLAST searches, what
deduce about how the length of a query sequence affects your confidence
in the sequence search?