Day 4: Pan and core genome plots and Blast
atlases
Pan and core genome plots
Pan and core genome plots are graphs that display to what extent gene
familes are conserved within a set of genomes. Conservation is
evaluated by first BLASTing the proteomes of the genomes againt each
other. This is done in a certain order, in that for every proteome,
it performs a BLAST search against all previous proteomes.
The result
is a set of numbers specific for that time point that represents the
proteome in
the order of the input list, showing:
- Number of new genes
- Number of new families
- Size of core genome
- Size of pan genome
Two genes are considered to belong to the same gene family if the two
are more than 50% identical over more than 50% of the length of the
longest of the two genes.
We have prepared a script which produces such a pan- and coregenome
plot, provided a list of proteomes.The result is a set of numbers
specific for that time point that represents the
proteome in the order of the input list.
The script will accept a number of proteomes
(pr1, pr2, .. prN) and perform a BLAST search of each proteome against
all the previous:
- pr2 against pr1
- pr3 against pr1+pr2
- pr4 against pr1+pr2+pr3
- ...
- prN against pr1+pr2+pr3 ... pr[N-1]
After these searches, the program will derive the number of core and
pan
proteins for each proteome. The output list will the be redirected into
an R-scriptwhich
plots all the
core/pan values as a function of the proteome number. Just like the
BLAST matrix
script you tried last week, this script will cache all the BLAST
results. In the event you
change the order of the input proteins, all BLAST searches must be
carried out
again. However, since you last week did a blast matrix, all
of
these results are still stored, so changing the order should not be a
problem this time.
- First, log in and create a directory for this
work. You will also need X to look at the results. See the previous
exercise for how to do this.
# log in to the computers again, then ssh -Y ibiology umask 022 setenv MAKEFILES /home/people/pfh/bin/Makefile
- Create a directory where this work will be
done.
# Ensure you are in the right place cd ~/ mkdir coregenome cd coregenome
- Create configuration file for this program
# create config file sh ~karinl/scripts/core/coregenome.sh ../data/prodigal > pancoregenomelist.txt
- Look at this file using nedit (remember, you
need to have X activated!)
The order the organisms are listed in in the file decides the order of
the organisms in the plot. The field on the left is the name of the
organism, while the protein file for this organism is listed on the
right.
# look at, and maybe edit using nedit nedit pancoregenomelist.txt
Save the file.
- Run the pan coregenome plot program.
# run the program. perl ~pfh/scripts/coregenome/coregenome pancoregenomelist.txt > pancoregenomeplot.ps
- Examine the plot:
# View the plot gv pancoregenomeplot.ps
Look
at the plot. Can you tell how many gene families, approximately, your
genomes have in common? How many gene families are there in total for
your genomes?
Blast atlases
Blast atlases are similar to the genome atlases that you looked at
during Day 3, but in addition to showing genomic properties it also
shows blast hits to the target genome.
A blast matrix is always made with a reference organism in the
'middle'. All genomic properties that are shown in the atlas relate to
this one organism. Next, other organisms that you wish to compare to
the reference organism are searched for genes that are similar to those
found in the reference organism. These hits are then shown in the atlas
as lines where regions in the reference organism have been found to
have a match in the searched organism. One lane per searched organism
is shown.
Note: the genes in the reference organism are naturally enough shown in
the order they are found in the organism. The hits to a gene are shown
where the reference gene is, that is, no inference can be made about
the location about the matching gene in the searched genomes.
Zoomable web version
These can be found here: Zoomable
atlases
In this version, you find and choose your reference organism first, and
then add 'BLAST LANES', one for each of the other organisms you wish to
display. Then you press 'submit' and wait a bit. Note: this will only
work if you have the the latest java version installed.
Getting your files to your computer
You have now several postscript files in your directories that you
might want to have on your computer.
If you have a mac, you can use the ps files directly. If you have a windows computer,
you need to do a bit of conversion first. Here is what you do:
You use a command called ps2epsi like this:
ps2epsi <filename>ps
You then have a <filename>.epsi file in your directory.
This file needs to be renamed <filename>.eps
mv <filename>.epsi <filename>.eps
You can then transfer this file to your computer.
Transfer
If you have a mac, use something like Fugu.
If you have windows, use something like WinSCP.
Both of these are graphical secure copy programs. Install them, and
connect to login.cbs.dtu.dk with your stud-account.
You can then get the ps or eps files to your computer, and you can then
insert them into your documents.
|