Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Exercise in Pangenomics

Written by: Carsten Friis
Adapted from an exercise by: Kristoffer Kiil


The objective of this exercise is to learn how to use a simple method to estimate the core and pan genomes of selected organisms. The method used in this exercise is dependent on the order in which the proteomes are presented to the method. The method can be generalized, however, by randomly sampling all the possible combinations. We will not do this in the exercise, since it is very computationally intensive. If you are interested in this, however, you are welcome to ask about it.


Estimating the pan and core genomes

As has been discussed there seems to exists for groups of highly related organisms, some core genes, which are shared between all the genomes, and some are which are specific to smaller subgroups.

  1. Start by opening a terminal window to our servers
  2. How to do this depends on your computer. Do not hesitate to contact a teacher for assistance if needed.

  3. Now copy the data directory to your working directory
  4. cp -r /home/projects/carsten/MVNexercises/Pangenomics ~/
    
  5. Then change your working directory to the data directory
  6. cd ~/Pangenomics
    
  7. Open the file list.txt in an editor, nedit for instance
  8. This file contains a list of genomes whose core- and pan-genomes we wish to investigate

    nedit list.txt &
    

    In the list you see each *.proteins.fsa file preceeded by the organism name. This list configures the core genome script. The script takes each of the fasta files in turn and uses blast to find which proteins are new (those are added to the pan genome), and which have been found in all the previous genomes (which constitutes the core genome).

    This procedure depends tremendously on which order the genomes are presented. So you should take care to sort the list. A reasonable way to do this, is to group the genomes by taxonomic similarity, and to have the biggest genomes first. If you are unfamiliar with the organisms you can look up the Complete Microbial Genomes at NCBI.

    When you are done editing the list, save it.

  9. You are now ready to start the script
  10. ./coregenome-1.1 list.txt > data.dat
    

    Note: It might take a while to run... and produce several warning messages, which you can safely just ignore. The status of the blasts will be printed continuously, so you can see how the program is progressing.

    While you are waiting, you can take a look at the zoomable atlas page

    http://www.cbs.dtu.dk/services/GenomeAtlas/suppl/zoomatlas/

    Eventually, the blasts will be done, and the results are written to the data.dat file.

  11. You are now ready to make a core/pan-genome plot
  12. R --vanilla < coreplot.R
    

    You may notice that this command actually executes an R script. R is a statistical environment you will hear more about later, and which is ubiquitously linked with microarray analysis these days. You will also get to run R interactively.

    For now, though, just execute the command and know that the R script reads the data in the file data.dat which must be present in the directory from where the script is run. The resulting plot is placed in a file called plot.ps.

  13. You can view the plot with the command:
  14. gv plot.ps &
    

    If the proteome order is properly sampled, it is possible to estimate the size of the core and pan genome, as the asymptotes of the core and pan genome graphs.

    • Does the core- and pan-genomes look as you would expect?

  15. If you have time, try to change the order of the entries in list.txt and rerun the exercise
    • What impact did it have on the plot?