Exercise M22


The objective of this exercise is to learn how to use a simple method to estimate core and pan genomes. The method used in this exercise is dependent on the order in which the proteomes are presented to the method. The method can be generalized, however, by randomly sampling all the possible combinations. We will not do this in the exercise, since it is very computationally intensive. If you are interested in this, however, you are welcome to ask about it.

Key tools used in this exercise:
Run the fixR script Do this by downloading the script, opening a terminal and typing::
sh fixR
Then R should work nicely and be integrated with the emacs.editor.

To make typing easier You can consider running the fix_keyboard script again.
sh fix_keyboard
How to save data to a usb-stick
This is provided as a service, if you feel you need it. It is not something you need for the exercise. After inserting your usb stick, you can type the following to access it.
mount -w -o users /dev/sdb1 /mnt/tmp
Now you can access the contents of the usb stick at /mnt/tmp.
Before removing the usb stick, type:
umount /mnt/tmp
  1. Estimating the pan and core genomes

    As has been discussed in a number of the talks, there seems, in groups of organisms, to be some core genes, which are shared between all the genomes, and some are specifik to a smaller subgroup.
    Download this example list of genomes: list.txt.
    Now open the list in an editor, emacs for instance.
    emacs list.txt &
    In the list you see each proteins.fsa file preceeded by the organism name. This list configures the core genome script. The script takes each of the fasta files in turn and uses blast to find which proteins are new (those are added to the pan genome), and which have been found in all the genomes (these constitute the core genome).
    As you can imagine, this procedure depends tremendously on which order the genomes are presented. So you should take care to sort the list.
    A reasonable way to do this, is to group the genomes by similarity, and to have the biggest genomes first.
    When you are done editing the list, save it by pressing Ctrl+x Ctrl+s, and quit emacs with Ctrl+x Ctrl+c. You are now ready to start the script. It might take quite a long time to run.
    coregenome-1.1 list.txt > data.dat
    Status of the blasts will be printed continuously, so you can see how the program is progressing.

    While you are waiting, you should register at the CCAMMERA webpage for the exercise tomorrow. the webpage is:


    You can take a look at the zoomable atlas page as well.

    http://www.cbs.dtu.dk/services/GenomeAtlas/suppl/zoomatlas/ After a long wait, possibly more than half an hour, the blasts will be done, and the results are written to the data.dat file. You are now ready to make the plot by typing:
    R --vanilla < ~/bin/coreplot.R
    Notice that this is yet another way to run R scripts.
    The R script assumes that the output is called data.dat and is present in the directory from where the script is run.

    The output is placed in a file called plot.ps, which you can view with the command:
    gv plot.ps
    Does the plot look as you would expect?

    If the proteome order is properly sampled, it is possible to estimate the size of the core and pan genome, as the asymptotes of the core and pan genome graphs.