Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other
CBS >> CBS Courses >> Comparative Microbial Genomics Analysis >> Day 3

Day 3: Blast matrices and genome atlases

Blast matrices

The blast matrix perl script performs an all-against-all BLAST comparison of multiple organisms. For every organism, it calculates how many proteins are homologous to all organism in the comparison. Including 4 organisms in a single comparison will leave 4 x 4 = 16 cells in the matrix.  The numbers that appear in each square are as follows: the first number is the percent of proteins in the total set of gene families that are identical. The first of the numbers below is the number of gene families that are identical in the two, while the second number is the total number of gene families. I.e., the first is the intersection of the two sets of gene familes, whereas the second is the union of the two. The diagonal of the matrix represent a special case, since this is equal to the genome compared to itself. Naturally, when aligning a given gene with itself, it will have perfect match alignment, and these hits are therefore excluded from the diagonal. This leaves the diagonal as a measure of the number of paraloges (homology within genomes) whereas all other cells represent the orthologs (homology between genomes)

The input for the script is an XML file. We have provided a template for this file, and shall add only a section defining which genomes to be compared.
  • Log in to the CBS computers

    Find a program which will let you log into the CBS computers. 

    Computer name:
    User name: studXXX

    You will get your password from the teachers.

    After that you need to log into the computer where we will do the exercises, which is named sbiology.

    # log into CBS
    ssh -Y ibiology
    setenv MAKEFILES /home/people/pfh/bin/Makefile
    umask 022

  • Make directory to save the blast matix in.
    # make directory which the blastmatrix should live in
    cd ~/
    mkdir blastmatrix
    cd blastmatrix
  • Create configuration file.

    You create the configuration file by running the shell script - feel free to examine it using less).

    The scripts simply finds all the proteins.fsa files in your data directory and builds the configuration file accordingly.

    if prodigal files are not running fine, replace prodigal with refseq in the script below.
    # run script to make configuration file
    sh ~karinl/scripts/makebm/ ../data/prodigal> blastmatrix.xml
    # look at the file:
    less blastmatrix.xml
  • Edit the file

    You will now use the program nedit to order the genomes the way you want them. You do this by copy-pasting a section starting and end with the words <source> and </source> - both the 'sources' need to be included in the copy-paste.

    # edit your file 
    nedit blastmatrix.xml
    # use the menu system to save and close the window.
  • Running blast matrix program Running the BLAST matrix program scales badly with the number of proteomes (n2) and therefore takes a long time to finish for all your genomes. Luckily the program features a caching system which enables it to reuse results from searches it has already performed. For those of you who are interrested, the program uses MD5 checksums of the input proteomes to keep track of the results - this is a very convinient way to keep track of changes in large data files! This can take a long time to run, go do the BLAST exercise below while you wait for it to finish.
    ~pfh/scripts/blastmatrix/blastmatrix blastmatrix.xml >
  • Look at the results

    Use the ghostview program to look at the file.

    # view the matrix
  • Look at the file and answer the questions below:

    QUESTION: From this plot, can you identify genomes which share homology? Can you find genomes, which has a high degree of paralogs (homology within the genome?)
    QUESTION: Can you identify the least related proteomes?


Many of the properties that we have shown you today can be presented in a genome atlas, a circular map of the genome. You have seen several of them in the lectures up till now.

We have prepared atlases for many genomes already.

Access the webpages with the prepared atlases. Select ONE of your genomes to examine. 

Things to look at:

BASE Atlas
  • Is your organism AT or GC rich? Does that correspond to what you found earlier?
  • Where is the replication origin in your organism? (HINT: look at the distribution of Gs and As).
  • Are there any regions in your genome that are more AT or GC rich than the rest of the genome?
  • Can you identify the leading and the lagging strand of your genome?
  • Which strand are the genes on? Is there any tendencies for the genes to be either on the leading or the lagging strand, or are they randomly distributed?
  • Can you find any regions that might be highly expressed (HINT: very flexible regions). Can you tell why this region could be higly expressed? What happens if this region also easily melts - would this help or hinder expression?
  • Can you find any regions that can mutate easily (HINT: AT rich regions that can melt easily).
  • Can you find any regions that might be protected against mutation (HINT: rigid regions that won't melt easily).

  • Can you find any globally repeated sequences, either direct or inverted, in this genome? How are these repeats located in relation to the genes in the genome?
  • Are there more local than global repeats, in your opinion?

  • How many rRNA genes does it have? Where are they located (close or far away from the origin of replication, randomly distributed or something else?).
  • Do any of the rRNAs have tRNAs in them?
  • Do the rRNA genes have any special features in the structural parameters? Are there repeats in this region, is the DNA especially flexible, or something else?