The objective of this exercise is to learn how to use a simple method to
estimate core and pan genomes. The method used in this exercise is dependent on
the order in which the proteomes are presented to the method. The method can
be generalized, however, by randomly sampling all the possible combinations.
We will not do this in the exercise, since it is very computationally
intensive. If you are interested in this, however, you are welcome to ask about
Key tools used in this exercise:
Run the fixR script
Do this by downloading the script, opening a terminal and typing::
Then R should work nicely and be integrated with the emacs.editor.
To make typing easier
You can consider running the fix_keyboard script again.
How to save data to a usb-stick
This is provided as a service, if you feel you need it. It is not something
you need for the exercise.
After inserting your usb stick, you can type the following to access it.
mount -w -o users /dev/sdb1 /mnt/tmp
Now you can access the contents of the usb stick at /mnt/tmp.
Before removing the usb stick, type:
Estimating the pan and core genomes
As has been discussed in a number of the talks, there seems, in groups of
organisms, to be some core genes, which are shared between all the genomes,
and some are specifik to a smaller subgroup.
Download this example list of genomes: list.txt.
Now open the list in an editor, emacs for instance.
emacs list.txt &
In the list you see each proteins.fsa file preceeded by the organism name.
This list configures the core genome script. The script takes each of the fasta
files in turn and uses blast to find which proteins are new (those are added to the pan genome), and which have been found in all the genomes (these constitute the core genome).
As you can imagine, this procedure depends tremendously on which order the
genomes are presented. So you should take care to sort the list.
A reasonable way to do this, is to group the genomes by similarity, and
to have the biggest genomes first.
When you are done editing the list, save it by pressing Ctrl+x Ctrl+s, and quit
emacs with Ctrl+x Ctrl+c.
You are now ready to start the script. It might take quite a long time to run.
coregenome-1.1 list.txt > data.dat
Status of the blasts will be printed continuously, so you can see how the
program is progressing.
While you are waiting, you should register at the CCAMMERA webpage for the
exercise tomorrow. the webpage is:
You can take a look at the zoomable atlas page as well.
After a long wait, possibly more than half an hour, the blasts will be done,
and the results are written to the data.dat file.
You are now ready to make the plot by typing:
R --vanilla < ~/bin/coreplot.R
Notice that this is yet another way to run R scripts.
The R script assumes that the output is called data.dat and is present in the
directory from where the script is run.
The output is placed in a file called plot.ps, which you can view with the
Does the plot look as you would expect?
If the proteome order is properly sampled, it is possible to estimate the size
of the core and pan genome, as the asymptotes of the core and pan genome graphs.