|
Exercise - Module M18 - Core- and Pan genomes. Metagenomics.
Setup exercise
- Set environment and copy files required for the exercise
umask 022
cp -rL ~www/pub/CBS/courses/thaiworkshop08/m18 ~/
cd ~/m18
BLASTatlas for Burkholderia
The previous BLAST atlase you constructed shows the homology between your own bacterium and
those of the other groups - all of which were belonging to different bacterial genera. Now,
we will use the same method to compare multiple Burkholderia species.
We have prepared a set of files containing the proteomes of all the currently
sequenced Burholderia genomes. Burholderia has three chromosomes and all
replicons including plasmids have been merged into a single file for each strain.
- Examine the file list
less ./Burkholderia_blastatlas.conf
- Download genome sequence and annotated proteins using web services
perl getseq.pl BX571965 > BX571965.fsa
perl getprot.pl BX571965 > BX571965.proteins.fsa
- Get genbank record and extract annotations
getgene BX571965 | saco_convert -I genbank -O annotation > BX571965.ann
less BX571965.ann
- Running BLASTatlas web service
perl genomeatlas.pl -ref BX571965.fsa -t "B. pseudomallei K96243, chr. I" \
-dnap "Intrinsic Curvature,Stacking Energy,Position Preference,Percent AT" \
-proteins BX571965.proteins.fsa -ann BX571965.ann -blastcfg Burkholderia_blastatlas.conf > Burkholderia_blastatlas.pdf
Open the PDF file: m18/Burkholderia_blastatlas.pdf
P. marinus BLAST atlas)
- Download genome sequence and annotated proteins using web services
perl getseq.pl CP000111 > CP000111.fsa
perl getprot.pl CP000111 > CP000111.proteins.fsa
- Get genbank record and extract annotations
getgene CP000111 | saco_convert -I genbank -O annotation > CP000111.ann
perl genomeatlas.pl -ref CP000111.fsa -t "P. marinus str. MIT 9312" \
-dnap "Intrinsic Curvature,Stacking Energy,Position Preference,Percent AT" \
-proteins CP000111.proteins.fsa -ann CP000111.ann -blastcfg Pmarinus_blastatlas.conf > Pmarinus_blastatlas.pdf
Open the PDF file: m18/Pmarinus_blastatlas.pdf
Core- and pan genome for Burkholderia
We have prepared a script which performs a number of BLAST searches, provided a
list of proteomes. For every proteome that occurs in the input to the program,
it performs a BLAST search against all previously occurring proteomes. The result
is a set of numbers specific for that time point that represents the proteome in
the order of the input list, showing:
- Number of new genes
- Number of new families
- Size of core genome
- Size of pan genome
The script will accept a number of proteomes
(pr1, pr2, .. prN) and perform a BLAST search of each proteome against all the previous:
- pr2 against pr1
- pr3 against pr1+pr2
- pr4 against pr1+pr2+pr3
- ...
- prN against pr1+pr2+pr3 ... pr[N-1]
After these searches, the program will derive the number of core and pan
proteins for each proteome. The output list will the be redirected into an R-script which plots all the
core/pan values as a function of the proteome number. Just like the BLAST matrix
script you tried yesterday, this script will cache all the BLAST results. In the event you
change the order of the input proteins, all BLAST searches must be carried out
again. Therefor, we have prepared two runs for you:
-
less burkholderia.listA
perl coregenome-1.2 < burkholderia.listA > data.dat
less data.dat
R --vanilla < coreplot.R
mv plot.ps coreplotA.ps
gmake coreplotA.pdf
Open the PDF file: m18/coreplotA.pdf
-
less burkholderia.listB
perl coregenome-1.2 < burkholderia.listB > data.dat
less data.dat
R --vanilla < coreplot.R
mv plot.ps coreplotB.ps
gmake coreplotB.pdf
Open the PDF file: m18/coreplotB.pdf
QUESTION What is the difference between the two input lists - and what is the difference between the out output plots?
|