Day 4: Pan and core genome plots
Pan and core genome plots
Pan and core genome plots are graphs that display to what extent gene
familes are conserved within a set of genomes. Conservation is
evaluated by first BLASTing the proteomes of the genomes againt each
other. This is done in a certain order, in that for every proteome,
it performs a BLAST search against all previous proteomes.
is a set of numbers specific for that time point that represents the
the order of the input list, showing:
Two genes are considered to belong to the same gene family if the two
are more than 50% identical over more than 50% of the length of the
longest of the two genes.
- Number of new genes
- Number of new families
- Size of core genome
- Size of pan genome
We have prepared a script which produces such a pan- and coregenome
plot, provided a list of proteomes.The result is a set of numbers
specific for that time point that represents the
proteome in the order of the input list.
The script will accept a number of proteomes
(pr1, pr2, .. prN) and perform a BLAST search of each proteome against
all the previous:
After these searches, the program will derive the number of core and
proteins for each proteome. The output list will the be redirected into
plots all the
core/pan values as a function of the proteome number. Just like the
script you tried last week, this script will cache all the BLAST
results. In the event you
change the order of the input proteins, all BLAST searches must be
again. However, since you last week did a blast matrix, all of
these results are still stored, so changing the order should not be a
problem this time.
- pr2 against pr1
- pr3 against pr1+pr2
- pr4 against pr1+pr2+pr3
- prN against pr1+pr2+pr3 ... pr[N-1]
- First, log in and create a directory for this
work. You will also need X to look at the results. See the previous exercise for how to do this.
# log in to the computers again, then
ssh -Y sbiology
setenv MAKEFILES /home/people/pfh/bin/Makefile
- Create a directory where this work will be
# Ensure you are in the right place
- Create configuration file for this program
# create config file
sh ~karinl/scripts/core/coregenome.sh ../data/prodigal > pancoregenomelist.txt
- Look at this file using nedit (remember, you need to have X activated!)
The order the organisms are listed in in the file decides the order of
the organisms in the plot. The field on the left is the name of the
organism, while the protein file for this organism is listed on the
# look at, and maybe edit using nedit
Save the file.
- Run the pan coregenome plot program.
# run the program.
perl ~pfh/scripts/coregenome/coregenome pancoregenomelist.txt > pancoregenomeplot.ps
- Examine the plot:
# View the plot
at the plot. Can you tell how many gene families, approximately, your
genomes have in common? How many gene families are there in total for