As has been discussed there seems to exists for groups of highly related organisms, some core genes, which are shared between all the genomes, and some are which are specific to smaller subgroups.
- Start by opening a terminal window to our servers
How to do this depends on your computer. Do not hesitate to contact a teacher for assistance if needed.
- Now copy the data directory to your working directory
cp -r /home/projects/carsten/MVNexercises/Pangenomics ~/
- Then change your working directory to the data directory
- Open the file list.txt in an editor, nedit for instance
This file contains a list of genomes whose core- and pan-genomes we wish to investigate
nedit list.txt &
In the list you see each *.proteins.fsa file preceeded by the organism name. This list configures the core genome script. The script takes each of the fasta files in turn and uses blast to find which proteins are new (those are added to the pan genome), and which have been found in all the previous genomes (which constitutes the core genome).
This procedure depends tremendously on which order the genomes are presented. So you should take care to sort the list. A reasonable way to do this, is to group the genomes by taxonomic similarity, and to have the biggest genomes first. If you are unfamiliar with the organisms you can look up the Complete Microbial Genomes at NCBI.
When you are done editing the list, save it.
- You are now ready to start the script
./coregenome-1.1 list.txt > data.dat
Note: It might take a while to run... and produce several warning messages, which you can safely just ignore. The status of the blasts will be printed continuously, so you can see how the program is progressing.
While you are waiting, you can take a look at the zoomable atlas page
Eventually, the blasts will be done, and the results are written to the data.dat file.
- You are now ready to make a core/pan-genome plot
R --vanilla < coreplot.R
You may notice that this command actually executes an R script. R is a statistical environment you will hear more about later, and which is ubiquitously linked with microarray analysis these days. You will also get to run R interactively.
For now, though, just execute the command and know that the R script reads the data in the file data.dat which must be present in the directory from where the script is run. The resulting plot is placed in a file called plot.ps.
- You can view the plot with the command:
gv plot.ps &
If the proteome order is properly sampled, it is possible to estimate the size of the core and pan genome, as the asymptotes of the core and pan genome graphs.