Day 3: Blast matrices and genome
atlases
Blast matrices
The blast matrix perl script
performs an
all-against-all BLAST
comparison of multiple organisms. For every organism, it calculates how
many
proteins are homologous to all organism in the comparison. Including 4
organisms
in a single comparison will leave 4 x 4 = 16 cells in the matrix.
The numbers that appear in each square are as follows: the
first
number is the percent of proteins in the total set of gene families
that are identical. The first of the numbers below is the number of
gene families that are identical in the two, while the second number is
the total number of gene families. I.e., the first is the intersection
of the two sets of gene familes, whereas the second is the union of the
two. The
diagonal
of the matrix represent a special case, since this is equal to the
genome
compared to itself. Naturally, when aligning a given gene with itself,
it will
have perfect match alignment, and these hits are therefore excluded
from
the
diagonal. This leaves the diagonal as a measure of the number of
paraloges
(homology within genomes) whereas all other cells represent the
orthologs
(homology between genomes)
The input for the script is an XML
file. We have provided a template for this file, and shall add only a
section defining which genomes to
be compared.
-
Log in to the CBS computers
Find a program which will let you log into the
CBS computers.
Computer name: login.cbs.dtu.dk
User name: studXXX
You will get your password from the teachers.
After that you need to log into the computer
where we will do the exercises, which is named sbiology.
# log into CBS ssh -Y ibiology setenv MAKEFILES /home/people/pfh/bin/Makefile umask 022
- Look at the file and answer the questions below:
QUESTION:
From
this plot, can you identify genomes which share homology?
Can you find genomes, which has a high degree of paralogs (homology
within the genome?)
QUESTION: Can
you identify the least related proteomes?
Atlas
Many of the properties that we have shown you today can be presented in
a genome atlas, a circular map of the genome. You have seen several of
them in the lectures up till now.
We have prepared atlases for many genomes already.
Access
the webpages with the prepared atlases. Select ONE of your
genomes to examine.
Things to look at:
BASE Atlas
- Is your organism AT or GC rich? Does that
correspond to what you found earlier?
- Where is the replication origin in your
organism? (HINT: look at the distribution of Gs and As).
- Are there any regions in your genome that are
more AT or GC rich than the rest of the genome?
- Can you identify the leading and the lagging
strand of your genome?
- Which strand are the genes on? Is there any
tendencies for the genes to be either on the leading or the lagging
strand, or are they randomly distributed?
STRUCTURE
Atlas
- Can you find any regions that might be highly
expressed (HINT: very flexible regions). Can you tell why this region
could be higly expressed? What happens if this region also easily melts
- would this help or hinder expression?
- Can you find any regions that can mutate
easily
(HINT: AT rich regions that can melt easily).
- Can you find any regions that might be
protected against mutation (HINT: rigid regions that won't melt easily).
REPEAT Atlas
- Can you find any globally repeated sequences,
either direct or inverted, in this genome? How are these repeats
located in relation to the genes in the genome?
- Are there more local than global repeats, in
your opinion?
GENOME Atlas
- How many rRNA genes does it have? Where are
they located (close or far away from the origin of replication,
randomly distributed or something else?).
- Do any of the rRNAs have tRNAs in them?
- Do the rRNA genes have any special features
in
the structural parameters? Are there repeats in this region, is the DNA
especially flexible, or something else?
|