Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other
CBS >> CBS Courses >> Scientific Communication of Comparative Genomics >> Course Programme >> Day 3

Day 3: Blast and Blast Matrices

Today's program

Today you will be making a blast matrix, and you will also do some blast searches at NCBI. You will first start the blast matrix script, and then go on to doing the blast search. Running the blastmatrix scripts can take a lot of time, so that is why we are doing it this way.

X client

You will be both looking at files and editing files today. This is why you need to get a a program which can transfer images from the CBS computers to your computer, and which will let you interact with it. Those of you who have a CBS laptop have this installed already, but those of you who are working on your own computer need to install it.  Asli has kindly created a how-to guide for this that you can access here:

How to get X up and working


Once you have got X working you can start to look at files, and to edit files. There are two programs that you will use to do this with.
  • nedit: this is a program that will let you edit a text file and save it. You can use the mouse to click inside of the window and to interact with the menues. 
  • ghostview: this is a program which will let you look at a postscript file. Postscript is actually a language for describing graphics which was developed for getting pictures out on a printer. Postscript files are acutally textfiles which you can look at and - strictly speaking - edit (not recommended unless you really know what you are doing). Ghostview takes the postscript and interprets how it would look on a page and displays it on your screen. 

Blast matrices

The blast matrix perl script performs an all-against-all BLAST comparison of multiple organisms. For every organism, it calculates how many proteins are homologous to all organism in the comparison. Including 4 organisms in a single comparison will leave 4 x 4 = 16 cells in the matrix.  The numbers that appear in each square are as follows: the first number is the percent of proteins in the total set of gene families that are identical. The first of the numbers below is the number of gene families that are identical in the two, while the second number is the total number of gene families. I.e., the first is the intersection of the two sets of gene familes, whereas the second is the union of the two. The diagonal of the matrix represent a special case, since this is equal to the genome compared to itself. Naturally, when aligning a given gene with itself, it will have perfect match alignment, and these hits are therefore excluded from the diagonal. This leaves the diagonal as a measure of the number of paraloges (homology within genomes) whereas all other cells represent the orthologs (homology between genomes)

The input for the script is an XML file. We have provided a template for this file, and shall add only a section defining which genomes to be compared.
  • Log in to the CBS computers

    Find a program which will let you log into the CBS computers. 

    Computer name:
    User name: studXXX

    You will get your password from the teachers.

    After that you need to log into the computer where we will do the exercises, which is named sbiology.

    # log into CBS
    ssh -Y sbiology
    setenv MAKEFILES /home/people/pfh/bin/Makefile
    umask 022
  • Fix a mistake made last time.

    Last time we did something that needs to be fixed this time around. Prodigal does not predict proteins, it predicts orfs, that is, the sequence in the files are DNA, not protein sequences. If we use this for making a blast matrix, this will cause trouble.

    PS: IMPORTANT: the rest of this exercise depends on the proteins being stored in a directory called data/prodigal. If they are not there, I recommend that you go back and do that part from Exercise 2.  You are free to have data in other places, but then you will have to change things below as needed.

    # convert Prodigal predictions into protein
    cd ~/data/prodigal
    foreach i (*.proteins.fsa)
    mv $i $i:r.orfs.fsa
    cat $i:r.orfs.fsa | 6pack -1 > $i
    QUESTION: Can you tell what the program 6pack is likely to be doing?
  • Make directory to save the blast matix in.
    # make directory which the blastmatrix should live in
    cd ~/
    mkdir blastmatrix
    cd blastmatrix
  • Create configuration file.

    You create the configuration file by running the shell script - feel free to examine it using less).

    The scripts simply finds all the proteins.fsa files in your data directory and builds the configuration file accordingly.

    if prodigal files are not running fine, replace prodigal with refseq in the script below.
    # run script to make configuration file
    sh ~karinl/scripts/makebm/ ../data/prodigal> blastmatrix.xml
    # look at the file:
    less blastmatrix.xml
  • Edit the file

    You will now use the program nedit to order the genomes the way you want them. You do this by copy-pasting a section starting and end with the words <source> and </source> - both the 'sources' need to be included in the copy-paste.

    # edit your file 
    nedit blastmatrix.xml
    # use the menu system to save and close the window.
  • Running blast matrix program Running the BLAST matrix program scales badly with the number of proteomes (n2) and therefore takes a long time to finish for all your genomes. Luckily the program features a caching system which enables it to reuse results from searches it has already performed. For those of you who are interrested, the program uses MD5 checksums of the input proteomes to keep track of the results - this is a very convinient way to keep track of changes in large data files! This can take a long time to run, go do the BLAST exercise below while you wait for it to finish.
    ~pfh/scripts/blastmatrix/blastmatrix blastmatrix.xml >
  • Look at the results

    Use the ghostview program to look at the file.

    # view the matrix
  • Look at the file and answer the questions below:

    QUESTION: From this plot, can you identify genomes which share homology? Can you find genomes, which has a high degree of paralogs (homology within the genome?)
    QUESTION: Can you identify the least related proteomes?

Sequence searching and BLAST

The goal is to retrieve genetic sequence data from the NCBI. The Basic Local Alignment Search Tool (BLAST) is an essential tool for comparing a DNA or protein sequence to other sequences in various organisms. Two of the most common uses are to a) determine the identity of a particular sequence and b) identify closely related organisms that also contain this particular DNA sequence.

A Slide Show Introduction

Begin by linking to a BLAST for beginners tutorial that is simple and easy to follow

Let the slide show guide your learning by clicking on the bright green arrow to proceed through the pages.

Using BLAST to identify a fake sequence and a real sequence

Begin by linking to the NCBI homepage. Select BLAST. With your new knowledge of Sequence Searching and BLAST, let's begin with a sequence you make up and then your one of your fasta sequences. sequence.

  • Select nucleotide BLAST under the basic BLAST category.
  • Input your own nucleotides (A,T,G,C) that fill one complete line into the Search Box. This is referred to as the query sequence.
  • VERY IMPORTANT - Click on the circle for 'Others (nr etc.) under Choose Search Set
  • Select BLAST! at end of page. A new window appears.
  • Wait for the results page to automatically launch. The wait time depends on the type of search you are doing and how many other researchers are using the NCBI website at the same time you are!
  1. Did your fake sequence produce a significant hit? (probably not since a significant hit is below E-10 usually)If yes, how many?
  2. How many sequences did it search in the database?
  3. How many nucleotide letters did it search in the database?
  • Select Home at the top of the BLAST page
  • Select nucleotide BLAST under the Basic BLAST category
  • Enter your sequence below into the Search box. Here everyone will BLAST the same sequence provided to you below.
  • Select BLAST! A new window appears
  • Select Format! and you will have to wait for the results page to appear.
  1. How long (query length) is the sequence that you used to search the database?
  2. What is the E-value and bit score of the best hit (in this case, the first matching sequence)?
  3. What is the most likely identity of this sequence? (click on the blue link to the left of the top hit) What is the title of the scientific publication that reported this sequence (click on the PUBMED link)
  • Go back twice when you're done.
  • Select Home at the top of the BLAST page.
  • Select nucleotide BLAST under the Basic BLAST category.
  • Now enter only the first TWO ROWS base pairs of the sequence below into the Search box.

  1. What do you observe about the E-values? What is the E-value and score of the best hit (the first matching sequence)?
  2. Is the identity of the best hit different from when you used the complete nucleotide sequence? Is it the same gene as identified before?
  3. From the two BLAST searches, what can you deduce about how the length of a query sequence affects your confidence in the sequence search?