Day 2: Blast, Blast matrices and 16S trees
Blast matrices
IMPORTANT:
This part depends on your
gbk files being in the directory called Data, and that they have names
which will enable you to identify them in a tree. If they are not, go
back to Day and rename the gbk files.
Blast matrix
The blast matrix perl script
performs an
all-against-all BLAST
comparison of multiple organisms. For every organism, it calculates how
many
proteins are homologous to all organism in the comparison. Including 4
organisms
in a single comparison will leave 4 x 4 = 16 cells in the matrix. The
diagonal
of the matrix represent a special case, since this is equal to the
genome
compared to itself. Naturally, when aligning a given gene with itself,
it will
have perfect match alignment, and these hits are therefore excluded
from
the
diagonal. This leaves the diagonal as a measure of the number of
paraloges
(homology within genomes) whereas all other cells represent the
orthologs
(homology between genomes)
The input for the script is an XML
file. We have provided a template for this file, and shall add only a
section defining which genomes to
be compared.
- Log in again if you need to:
# log in to the computers again as, then: ssh -Y <your_username>@login.cbs.dtu.dk ssh -Y sbiology setenv MAKEFILES /home/people/pfh/bin/Makefile
- Make directory to save the blast matix in.
# make directory which the blastmatrix should live in cd ~/ mkdir blastmatrix cd blastmatrix
- Create configuration file. This is done by the
shell script makebm.sh - feel free to
examine
it using less).
The scripts simply finds all the
proteins.fsa
files in your
data directory and builds the configuration file accordingly.
# run script to make configuration file sh ~karinl/scripts/makebm/makebm.sh > blastmatrix.xml # look at the file: less blastmatrix.xml
- Running blast matrix
program
Running the BLAST matrix program scales badly with the number of
proteomes (n2) and
therefore takes a long time to finish for all your genomes. Luckily the
program features a
caching system which enables it to reuse results from searches it has
already
performed. For those of you who are interrested, the program uses MD5 checksums
of the input proteomes
to keep track of the results - this is a very convinient way to keep
track of
changes in large data files! This can take a long time to run, go do
the BLAST exercise below while you wait for ti to finish.
~pfh/scripts/blastmatrix/blastmatrix blastmatrix.xml > blastmatrix.ps
- Transfer this file to your computer:
Open a new Console window. In this window do the following:
scp <yourUserName>@login.cbs.dtu.dk:blastmatrix/blastmatrix.ps . gv blastmatrix.ps
- Look at the file and answer the questions below:
QUESTION:
From
this plot, can you identify genomes, which appear to share homology?
Can you find genomes, which has a high degree of paralogs (homology
within the genome?)
QUESTION
Can
you identify the least related proteomes?
Sequence searching and BLAST
The goal is to
retrieve genetic
sequence data from the NCBI. The Basic Local Alignment Search Tool
(BLAST)
is an essential tool for comparing a DNA or protein sequence to other
sequences in various organisms. Two of the most common uses are to a)
determine the identity of a particular sequence and b) identify closely
related organisms that also contain this particular DNA sequence.
A Slide Show Introduction
Begin by linking to a BLAST for
beginners tutorial that is simple and easy to follow
http://www.digitalworldbiology.com/dwb/Tutorials/Entries/2009/1/26_BLAST_for_Beginners.html.
Let the slide show guide your learning by clicking
on the bright green
arrow to proceed through the pages.
Using BLAST to identify a fake sequence and a
real sequence
Begin by linking to the NCBI
homepage. Select BLAST. With
your new knowledge of Sequence Searching and BLAST, let's begin with a
sequence you make up and then your one of your fasta sequences.
sequence.
- Select nucleotide BLAST
under the basic BLAST category.
- Input your own nucleotides
(A,T,G,C) that fill one complete line into the Search
Box. This is referred to as the query sequence.
- VERY IMPORTANT - Click on the
circle for 'Others (nr etc.) under Choose Search Set
- Select BLAST! at
end of page. A new window appears.
- Wait for the results page to
automatically
launch. The wait time depends on the type of search you are doing and
how many other researchers are using the NCBI website at the same time
you are!
- Did your fake sequence produce a
significant
hit? (probably not since a significant hit is below E-10 usually)If
yes, how many?
- How many sequences did it search
in the
database?
- How many nucleotide letters did
it search in
the database?
- How long (query length) is the sequence that
you used to search the database?
- What is the E-value and bit score of the best hit (in this
case, the first matching sequence)?
- What is the most likely identity
of this
sequence? (click on the blue link to the left of the top hit) What is
the title of the scientific publication that reported this sequence
(click on the PUBMED link)
- Go back twice when you're done.
- Select Home at
the top of the BLAST page.
- Select nucleotide BLAST
under the Basic BLAST category.
- Now enter only the first TWO ROWS
base pairs of the sequence below into the Search box.
- What do you observe about the
E-values? What is
the E-value and score of the best hit (the first matching sequence)?
- Is the identity of the best hit
different from
when you used the complete nucleotide sequence? Is it the same gene as
identified before?
- From the two BLAST searches, what
can you
deduce about how the length of a query sequence affects your confidence
in the sequence search?
- Done!
Predicting 16S rRNA genes
In some
cases the annotations of rRNAs can be inaccurate. Ribosomal RNAs can be
predicted using a program called RNAmmer. This program can be either
used on the web, through the website mentioned above, or through a
web servics script that uses this program via the internet.
- Log in if you need to:
# log in to the computers again as, then: ssh -Y <your_username>@login.cbs.dtu.dk ssh -Y sbiology setenv MAKEFILES /home/people/pfh/bin/Makefile
- Create a directory to store things in
# create directory for holding the rRNA predictions, and go into that directory cd ~/ mkdir rRNA cd ~/rRNA
- Predict rRNAs
# get just the fasta sequence for these genomes and predict the rRNAs foreach i (../data/*refseq.gbk) gmake $i:r.fsa perl ~karinl/scripts/rnammer/rnammer.pl bac ssu < $i:r.fsa > $i.rrna.fsa cp $i.rrna.fsa . end
- Have a look at one of the files.
# look at one of the files less <filename>.rrna.fsa
You should have a set of sequences in FASTA format in your file, each
approximately 1500 nt long.
- Count the number of rRNAs
# count the number of rRNAs within each of these foreach i (*.rrna.fsa) echo $i grep -c ">" $i end
Making a 16S tree
The 16S rRNA phylogenetic tree is one of the
most commonly
used ways of elucidating the relationships between organisms. This gene
is one of three rRNA genes which together with around 50? proteins form
the ribosome, the molecular machine responsible for translating mRNAs
into proteins. The 16S gene contains around 1500 nucleotides in
prokaryotes (~1800 nt in eukaryotes; 18S), and is thus larger
than the around 120 nucletotide 5S, and smaller than the almost 3000 nt
long 23S gene. This means that the gene is long enough to contain both
variable regions - which enables us to discern between closely related
genomes, and conserved regions - which enable us to evaluate more long
range relationships. It should be noted that the difference between 16S
sequences should preferrably be above 3% to ensure a valid phylogenetic
signal. That is, the 16S rRNA gene sequences can only give good
discriminatory resolution for sequences which vary with more than 3%.
In order to make 16S trees you will need 16S sequences, an alignment
program to make an alignment of your 16S sequences, a program that can
make trees from the alignments, and a program to view the trees with.
We will be using the program Jalview for this part of the course, since
it can do all of this, from making an alignment to viewing the finished
tree.
- Getting
your
sequences
First you will need to get to gather together one 16S sequence from
each of your genomes.
This is done by opening each of the rRNA files you created above,
finding the first 16S sequence, and copy-pasting it into a text
document. Save this file as
<sequenceGroupName>.fasta.
Use less to look at your rRNA files, as you saw yesterday. The commands
to use less are shown below again
SPACE : one screenful forward b : one screenful backward RETURN : one line forward k : one line backward g : top (beginning) of file G : bottom (end) of file q : leave the session
Once you
have copied your 16S sequences into your document, add the name of the
genome in the line that starts wiith ">".
- Open
Jalview
Open the Jalview
website in a new web browser window. Go to Download,
press Start with Webstart.
There might appear a window which asks you to allow you to open this
website.Click Allow.
A window will appear on your screen, allow it to open. It will next
load some examples. Wait until this has completed. Then, close the
examples windows, but not the program window itself.
- Load
fasta
file
Go to File – Input Alignment - From File, and select the fasta
file you created earlier.
- Look
at the
fasta
sequences
Look at the fasta sequences. Experiment with having various color
options turned on. Which ones make sense for nucleotide sequences?
- Align
the
sequences
Go to Web Services - MAFFT. A new window will appear,
telling you that the alignment is being processed. Wait until it is
done - the aligned sequences will appear in a new window with an
additional ??something in the title of the window, telling you that
this is an alignment of your previous sequences.
Go to Save, and save
this new alignment under the name <sequenceGroupName_aligned>,fasta
Use the same colors for this new alignment as for the original
unaligned sequences. Can you see any differences between the two?
Close the original file, so as to avoid confusion.
- Create
tree
Go to Calculate - Neighbor Joining using % Identity.
You will get a new window displaying your tree. Save this tree by going
to File in that window,
and clicking Save As - PNG.
Question:
Does your tree make biological sense? How does this tree compare to the
blast matrix that you found above? If you were to sort the genomes by
what you saw in the blast matrix, would this sorting be different than
what you see in the 16S tree?
There are very many methods and various
programs for making
phylogenetic trees. Which program you should use, and which method to
choose, depends to a large extent on your data. The method presented
above, Neighbor Joining, is in fact one of the methods that are not
that good for this kind of analysis, but it is very often used anyway
due to it being very fast.
|