Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other
CBS >> CBS Courses >> Brasil Workshop 2009 >> Course Programme >> Day 2

Day 2: Blast, Blast matrices and 16S trees

Blast matrices

IMPORTANT: This part depends on your gbk files being in the directory called Data, and that they have names which will enable you to identify them in a tree. If they are not, go back to Day and rename the gbk files.
Blast matrix

The blast matrix perl script performs an all-against-all BLAST comparison of multiple organisms. For every organism, it calculates how many proteins are homologous to all organism in the comparison. Including 4 organisms in a single comparison will leave 4 x 4 = 16 cells in the matrix. The diagonal of the matrix represent a special case, since this is equal to the genome compared to itself. Naturally, when aligning a given gene with itself, it will have perfect match alignment, and these hits are therefore excluded from the diagonal. This leaves the diagonal as a measure of the number of paraloges (homology within genomes) whereas all other cells represent the orthologs (homology between genomes)

The input for the script is an XML file. We have provided a template for this file, and shall add only a section defining which genomes to be compared.
  1. Log in again if you need to:
    # log in to the computers again as, then:
    ssh -Y <your_username>
    ssh -Y sbiology
    setenv MAKEFILES /home/people/pfh/bin/Makefile
  2. Make directory to save the blast matix in.
    # make directory which the blastmatrix should live in
    cd ~/
    mkdir blastmatrix
    cd blastmatrix
  3. Create configuration file. This is done by the shell script - feel free to examine it using less).

    The scripts simply finds all the proteins.fsa files in your data directory and builds the configuration file accordingly.

    # run script to make configuration file
    sh ~karinl/scripts/makebm/ > blastmatrix.xml
    # look at the file:
    less blastmatrix.xml
  4. Running blast matrix program Running the BLAST matrix program scales badly with the number of proteomes (n2) and therefore takes a long time to finish for all your genomes. Luckily the program features a caching system which enables it to reuse results from searches it has already performed. For those of you who are interrested, the program uses MD5 checksums of the input proteomes to keep track of the results - this is a very convinient way to keep track of changes in large data files! This can take a long time to run, go do the BLAST exercise below while you wait for ti to finish.
    ~pfh/scripts/blastmatrix/blastmatrix blastmatrix.xml >
  5. Transfer this file to your computer:

    Open a new Console window. In this window do the following:

    scp <yourUserName> .
  6. Look at the file and answer the questions below:

    QUESTION: From this plot, can you identify genomes, which appear to share homology? Can you find genomes, which has a high degree of paralogs (homology within the genome?)
    QUESTION Can you identify the least related proteomes?

Sequence searching and BLAST

The goal is to retrieve genetic sequence data from the NCBI. The Basic Local Alignment Search Tool (BLAST) is an essential tool for comparing a DNA or protein sequence to other sequences in various organisms. Two of the most common uses are to a) determine the identity of a particular sequence and b) identify closely related organisms that also contain this particular DNA sequence.

A Slide Show Introduction

Begin by linking to a BLAST for beginners tutorial that is simple and easy to follow

Let the slide show guide your learning by clicking on the bright green arrow to proceed through the pages.

Using BLAST to identify a fake sequence and a real sequence

Begin by linking to the NCBI homepage. Select BLAST. With your new knowledge of Sequence Searching and BLAST, let's begin with a sequence you make up and then your one of your fasta sequences. sequence.

  • Select nucleotide BLAST under the basic BLAST category.
  • Input your own nucleotides (A,T,G,C) that fill one complete line into the Search Box. This is referred to as the query sequence.
  • VERY IMPORTANT - Click on the circle for 'Others (nr etc.) under Choose Search Set
  • Select BLAST! at end of page. A new window appears.
  • Wait for the results page to automatically launch. The wait time depends on the type of search you are doing and how many other researchers are using the NCBI website at the same time you are!
  1. Did your fake sequence produce a significant hit? (probably not since a significant hit is below E-10 usually)If yes, how many?
  2. How many sequences did it search in the database?
  3. How many nucleotide letters did it search in the database?
    • Select Home at the top of the BLAST page
    • Select nucleotide BLAST under the Basic BLAST category
    • Enter your sequence below into the Search box. Here everyone will BLAST the same sequence provided to you below.

    • Select BLAST! A new window appears
    • Select Format! and you will have to wait for the results page to appear.

  4. How long (query length) is the sequence that you used to search the database?
  5. What is the E-value and bit score of the best hit (in this case, the first matching sequence)?
  6. What is the most likely identity of this sequence? (click on the blue link to the left of the top hit) What is the title of the scientific publication that reported this sequence (click on the PUBMED link)
    • Go back twice when you're done.
    • Select Home at the top of the BLAST page.
    • Select nucleotide BLAST under the Basic BLAST category.
    • Now enter only the first TWO ROWS base pairs of the sequence below into the Search box.

  7. What do you observe about the E-values? What is the E-value and score of the best hit (the first matching sequence)?
  8. Is the identity of the best hit different from when you used the complete nucleotide sequence? Is it the same gene as identified before?
  9. From the two BLAST searches, what can you deduce about how the length of a query sequence affects your confidence in the sequence search?

  10.  Done!

Predicting 16S rRNA genes

In some cases the annotations of rRNAs can be inaccurate. Ribosomal RNAs can be predicted using a program called RNAmmer. This program can be either used on the web, through the website mentioned above, or through a web servics script that uses this program via the internet.
  1. Log in if you need to:

    # log in to the computers again as, then:
    ssh -Y <your_username>
    ssh -Y sbiology
    setenv MAKEFILES /home/people/pfh/bin/Makefile
  2. Create a directory to store things in
    # create directory for holding the rRNA predictions, and go into that directory
    cd ~/
    mkdir rRNA
    cd ~/rRNA
  3. Predict rRNAs
    # get just the fasta sequence for these genomes and predict the rRNAs
    foreach i (../data/*refseq.gbk)
    gmake $i:r.fsa
    perl ~karinl/scripts/rnammer/ bac ssu < $i:r.fsa > $i.rrna.fsa
    cp $i.rrna.fsa .
  4. Have a look at one of the files.
    # look at one of the files
    less <filename>.rrna.fsa 

    You should have a set of sequences in FASTA format in your file, each approximately 1500 nt long.

  5. Count the number of rRNAs

    # count the number of rRNAs within each of these
    foreach i (*.rrna.fsa)
    echo $i
    grep -c ">" $i

Making a 16S tree

The 16S rRNA phylogenetic tree is one of the most commonly used ways of elucidating the relationships between organisms. This gene is one of three rRNA genes which together with around 50? proteins form the ribosome, the molecular machine responsible for translating mRNAs into proteins. The 16S gene contains around 1500 nucleotides in prokaryotes (~1800 nt  in eukaryotes; 18S), and is thus larger than the around 120 nucletotide 5S, and smaller than the almost 3000 nt long 23S gene. This means that the gene is long enough to contain both variable regions - which enables us to discern between closely related genomes, and conserved regions - which enable us to evaluate more long range relationships. It should be noted that the difference between 16S sequences should preferrably be above 3% to ensure a valid phylogenetic signal. That is, the 16S rRNA gene sequences can only give good discriminatory resolution for sequences which vary with more than 3%.

In order to make 16S trees you will need 16S sequences, an alignment program to make an alignment of your 16S sequences, a program that can make trees from the alignments, and a program to view the trees with. We will be using the program Jalview for this part of the course, since it can do all of this, from making an alignment to viewing the finished tree.
  1. Getting your sequences

    First you will need to get to gather together one 16S sequence from each of your genomes.
    This is done by opening each of the rRNA files you created above, finding the first 16S sequence, and copy-pasting it into a text document. Save this file as <sequenceGroupName>.fasta.
    Use less to look at your rRNA files, as you saw yesterday. The commands to use less are shown below again
    SPACE : one screenful forward 
    b : one screenful backward
    RETURN : one line forward
    k : one line backward
    g : top (beginning) of file
    G : bottom (end) of file
    q : leave the session

    Once you have copied your 16S sequences into your document, add the name of the genome in the line that starts wiith ">".

  2. Open Jalview

    Open the Jalview website in a new web browser window. Go to Download, press Start with Webstart. There might appear a window which asks you to allow you to open this website.Click Allow

    A window will appear on your screen, allow it to open. It will next load some examples. Wait until this has completed. Then, close the examples windows, but not the program window itself.

  3. Load fasta file

    Go to File Input Alignment  - From File, and select the fasta file you created earlier. 

  4. Look at the fasta sequences

    Look at the fasta sequences. Experiment with having various color options turned on. Which ones make sense for nucleotide sequences?
  5. Align the sequences

    Go to Web Services - MAFFT. A new window will appear, telling you that the alignment is being processed. Wait until it is done - the aligned sequences will appear in a new window with an additional ??something in the title of the window, telling you that this is an alignment of your previous sequences.

    Go to Save, and save this new alignment under the name <sequenceGroupName_aligned>,fasta

    Use the same colors for this new alignment as for the original unaligned sequences. Can you see any differences between the two?

    Close the original file, so as to avoid confusion.
  6. Create tree

    Go to Calculate - Neighbor Joining using % Identity. You will get a new window displaying your tree. Save this tree by going to File in that window, and clicking Save As - PNG.

    Question: Does your tree make biological sense? How does this tree compare to the blast matrix that you found above? If you were to sort the genomes by what you saw in the blast matrix, would this sorting be different than what you see in the 16S tree?
There are very many methods and various programs for making phylogenetic trees. Which program you should use, and which method to choose, depends to a large extent on your data. The method presented above, Neighbor Joining, is in fact one of the methods that are not that good for this kind of analysis, but it is very often used anyway due to it being very fast.