Day 3: Blast matrices and genome
The blast matrix perl script
comparison of multiple organisms. For every organism, it calculates how
proteins are homologous to all organism in the comparison. Including 4
in a single comparison will leave 4 x 4 = 16 cells in the matrix.
The numbers that appear in each square are as follows: the
number is the percent of proteins in the total set of gene families
that are identical. The first of the numbers below is the number of
gene families that are identical in the two, while the second number is
the total number of gene families. I.e., the first is the intersection
of the two sets of gene familes, whereas the second is the union of the
of the matrix represent a special case, since this is equal to the
compared to itself. Naturally, when aligning a given gene with itself,
have perfect match alignment, and these hits are therefore excluded
diagonal. This leaves the diagonal as a measure of the number of
(homology within genomes) whereas all other cells represent the
(homology between genomes)
The input for the script is an XML
file. We have provided a template for this file, and shall add only a
section defining which genomes to
Log in to the CBS computers
Find a program which will let you log into the
Computer name: login.cbs.dtu.dk
User name: studXXX
You will get your password from the teachers.
After that you need to log into the computer
where we will do the exercises, which is named sbiology.
# log into CBS
ssh -Y ibiology
setenv MAKEFILES /home/people/pfh/bin/Makefile
- Look at the file and answer the questions below:
this plot, can you identify genomes which share homology?
Can you find genomes, which has a high degree of paralogs (homology
within the genome?)
you identify the least related proteomes?
Many of the properties that we have shown you today can be presented in
a genome atlas, a circular map of the genome. You have seen several of
them in the lectures up till now.
We have prepared atlases for many genomes already.
the webpages with the prepared atlases. Select ONE of your
genomes to examine.
Things to look at:
- Is your organism AT or GC rich? Does that
correspond to what you found earlier?
- Where is the replication origin in your
organism? (HINT: look at the distribution of Gs and As).
- Are there any regions in your genome that are
more AT or GC rich than the rest of the genome?
- Can you identify the leading and the lagging
strand of your genome?
- Which strand are the genes on? Is there any
tendencies for the genes to be either on the leading or the lagging
strand, or are they randomly distributed?
- Can you find any regions that might be highly
expressed (HINT: very flexible regions). Can you tell why this region
could be higly expressed? What happens if this region also easily melts
- would this help or hinder expression?
- Can you find any regions that can mutate
(HINT: AT rich regions that can melt easily).
- Can you find any regions that might be
protected against mutation (HINT: rigid regions that won't melt easily).
- Can you find any globally repeated sequences,
either direct or inverted, in this genome? How are these repeats
located in relation to the genes in the genome?
- Are there more local than global repeats, in
- How many rRNA genes does it have? Where are
they located (close or far away from the origin of replication,
randomly distributed or something else?).
- Do any of the rRNAs have tRNAs in them?
- Do the rRNA genes have any special features
the structural parameters? Are there repeats in this region, is the DNA
especially flexible, or something else?