 |
Day 3: Pan and core genome
plots, Emboss and Blast atlases
Pan and core genome plots
Pan and core genome plots are graphs that display to what extent gene
familes are conserved within a set of genomes. Conservation is
evaluated by first BLASTing the proteomes of the genomes againt each
other. This is done in a certain order, in that for every proteome,
it performs a BLAST search against all previous proteomes.
The result
is a set of numbers specific for that time point that represents the
proteome in
the order of the input list, showing:
- Number of new genes
- Number of new families
- Size of core genome
- Size of pan genome
Two genes are considered to belong to the same gene family if the two
are more than 50% identical over more than 50% of the length of the
longest of the two genes.
We have prepared a script which produces such a pan- and coregenome
plot, provided a list of proteomes.The result is a set of numbers
specific for that time point that represents the
proteome in the order of the input list.
The script will accept a number of proteomes
(pr1, pr2, .. prN) and perform a BLAST search of each proteome against
all the previous:
- pr2 against pr1
- pr3 against pr1+pr2
- pr4 against pr1+pr2+pr3
- ...
- prN against pr1+pr2+pr3 ... pr[N-1]
After these searches, the program will derive the number of core and
pan
proteins for each proteome. The output list will the be redirected into
an R-script
which plots all the
core/pan values as a function of the proteome number. Just like the
BLAST matrix
script you tried yesterday, this script will cache all the BLAST
results. In the event you
change the order of the input proteins, all BLAST searches must be
carried out
again. Therefore, we have prepared two runs for you:
- First, log in and create a directory for this
work.
# log in to the computers again as, then: ssh -Y <your_username>@login.cbs.dtu.dk ssh -Y sbiology setenv MAKEFILES /home/people/pfh/bin/Makefile
- Create a directory where this work will be
done.
# Ensure you are in the right place cd ~/ mkdir coregenome cd coregenome
- Create configuration file for this program
# create config file sh ~karinl/scripts/core/coregenome.sh > pancoregenomelist.txt perl ~pfh/scripts/coregenome/coregenome pancoregenomelist.txt > pancoregenomeplot.ps
- Transfer file to your own computer and view it.
Open a new Console window. In this window do the following:
# Transfer file and view it. scp <yourUserName>@login.cbs.dtu.dk:coregenome/pancoregenomeplot.ps . gv pancoregenomeplot.ps
Look
at the plot. Can you tell how many gene families, approximately, your
genomes have in common? How many gene families are there in total for
your genomes?
EMBOSS
EMBOSS is a collection of software tools that are freely available at
http://emboss.bioinformatics.nl/.
Go and have a look – scroll
through the the list of tools to get a feel what you can do with
EMBOSS. If the site/analysis is too slow, ask me and I will tell you
what to do.
- Using dottup from the EMBOSS package, see how
colinear the three of your genome . Select two that you think
are very similar, and another that should be very different from the
other two.
Find the refseq ID for your three genomes. Enter your ids in the box
where it says To access a sequence
from a database, enter the USA here:
The format of what you enter here is
refseq:<refseq_id>
Also set the window size to 18.
Based on the results, which pair of genomes are more similar to
each other? Does this conform with what you expected?
-
What happens if you change a few of the dottup parameters (e.g. word
size)?
- Using polydot from the EMBOSS package, see how
colinear your genomes are (set
word size = 20).
- Determine the following things using your genomes:
The %GC content and the dinucleotide relative
abundance using programs from the EMBOSS
package for your three genomes.
Based on the results, which pair of sequences is most similar?
Also, check the “deltarho-website” at
http://deltarho.amc.nl/cgi-bin/bin/index.cgi.
Blast atlases
Precalculated genome atlases
Configurable blast atlases
|
|
Exercises created by Karin Lagesen, Peter F.
Hallin and Tom Coenye
|
|
|