|
Exercise M18
OBJECTIVES
In this exercies, we are going to wrap up the results you have been generating over the previous exercises. During the
Taxonomy lectures, we have been discussing alternative phylogenetic markers by which tree could be established. In this exercise
we will try and build heatmaps and dendrograms based on the basic properties like CDS count, tRNA count, Codon usage etc.
The results from this will be compared to a basic 16S rRNA tree which you are going to build.
Second, we will make an all-against-all comparison of all the proteins which you have predicted and finally, we will look at
protin localization using a DNA atlas for visualization.
|
-
LOG IN
Open a session on 'genome.cbs.dtu.dk', and from this login to ibiology and create directories for you to work in:
ssh -X ibiology
mkdir M18
cd M18
mkdir source
-
COMPARE NUMBER OF GENES
As was pointed out in lecture M8, the number of predicted coding sequences may vary from what is annotated. We will start out
by making a relational query to the CBS genome atlas database, and compare our predicted number of genes with what
is annotated. The atlas database keep the number of predicted CDS in the column 'NGENES'.
|
mysql
mysql>use brazilworkshop06
mysql>select genomes.organism, atlasdb.accession,count(*) as EasyGene_predicted,ngenes as GBK_Annotated,
(atlasdb.ngenes-count(*)) as Dif from genomes,features,genomeatlas.atlasdb where genomes.accession = features.accession and
features.user like 'phd01%' and featuretype='CDS' and atlasdb.accession = features.accession and genomes.user=features.user and
atlasdb.segment = 'main' group by atlasdb.accession;
mysql>exit
Comment on your result.
-
CONSTRUCT TREE BASED ON 16s rRNA
|
First, we will extract the first predicted 16s rRNA gene for every genome. The code is very similar
as what you used in exercise M11: Here, we are just looking for featuretypes of 'rRNA' and in the end
for the perl script we place a 'last'-statement which tells perl to skip the rest, once the first gene
has been processed.
When done, try and view the fasta file you produce (press 'q' to quit)
|
-
rm -f source/all.16s.fsa
foreach accession (AE017042 AP008232 AP009048 BA000021 CP000034 CP000036 CP000243 CP000247)
mysql -B -N -D brazilworkshop06 -e "select substr(seq,start,stop-start+1) as sequence,\
replace(substring_index(genomes.organism,' ',4),' ','_') as name,dir from genomes,features where \
genomes.accession = features.accession and genomes.user = features.user and \
features.user = user() and featuretype='rRNA' and features.accession = '$accession'" | \
perl -ane '$F[0] = reverse ( $F[0] ) if $F[2] eq "-";$F[0] =~ tr/ATGC/TACG/ if $F[2] eq "-";print ">$F[1]\n"; for ( $x = 0 ; $x < length ( $F[0] ) ; $x = $x+60) { print substr ( $F[0] , $x , 60),"\n"; } last;' >> source/all.16s.fsa
end
less source/all.16s.fsa
|
Next, you will be using 'clustal' to produce an alignment of all the 16s genes and generate a tree (press 'q' to quit)
|
-
clustalw source/all.16s.fsa
echo "4\n1\nsource/all.16s.aln\n4\nall.16s.ph\n\nx\n" | clustalw
less all.16s.ph
|
Start an SSH session to 'organism' and go to the M18 directory. From here, you can start the program to view the tree
|
-
njplot all.16s.ph &
-
COMPARE AMINO ACID USAGE
|
We will now reuse the codon/amino acid usage data which you have already generated in Exercise Module 11 to produce a 2 dimensional cluster (a so-called heatmap). To do this, we will be using the statistical package R
|
This code will collect the data, and produce a TAB-delimited file containing
all Amino acids, their frequences of all genomes.
-
cat ../M11/source/CP000247.codon-usage.aa.list | perl -ane 'next unless $F[1] =~ /^[A-Z]+/;print "\t$F[1]"; ' > aa.dat
mysql -B -N -e 'select accession,organism from brazilworkshop06.genomes where user=USER()' > organism.names
foreach accession (CP000247 BA000021 AP008232 CP000034 AE017042 CP000036 CP000243 AP009048)
cat ../M11/source/$accession.codon-usage.aa.list | perl -ane 'BEGIN { foreach (`cat organism.names`) {chomp; @A = split (/\t+/,$_);$NAMES{$A[0]} = $A[1];} } next unless $F[1] =~ /^[A-Z]+/;$FREQ{$F[1]} = $F[3];push @AA,$F[1]; END { print "\n"; print "$NAMES{'$accession'}"; foreach $aa (@AA){print "\t",$FREQ{$aa};}}' >> aa.dat
end
Before proceding, you should view the output file you have just generated (press 'q' to quit)
-
less aa.dat
Start R. In the following code, the '>' indicate that you are
at the R prompt and you should therefore not include '>'
when you are executing the commands (just as is was case with the mysql client)
-
# Don't forget to start R!:
R
# define output file
postscript ('aa.ps')
# colors for the Amino Acid classes
aaclass <- c('#CC33FF','#CC33FF','#CC33FF','#CC33FF','#996699','#996699','#99CC99','#99CC99','#99CC99','#99CC99','#666633','#666633','#FF0000','#FF0000','#0000FF','#0000FF','#99CCCC','#99CCCC','#99CCCC','#99CCCC')
# load a custom version of the R program 'heatmap' - this document also contain the myPalette function
source ( "/usr/opt/www/pub/CBS/courses/brazilworkshop/files/heatmap.R" )
# read in the data that you have generated above
data <- as.matrix ( read.table ( 'aa.dat' , row.names=1, header= TRUE,sep = "\t" ) )
# generate the color palette
pal <- maPalette(low = rgb(10,10,10,maxColorValue = 10), high = rgb(0,0,10,maxColorValue = 10), mid= rgb(5,5,10,maxColorValue = 10),, k = 1024)
# set margins of the plot
par(oma=c(0, 0, 0, 0.5))
# draw the heatmap
new_heatmap ( data , cexCol=1,cexRow=1,colors = pal,ColSideColors=aaclass,scale="none")
# draw the legend
legend (0,1,c('Aliphatic','Structural','Aromatic','Sulfur','+','-','Polar'),fill=c('#CC33FF','#996699','#99CC99','#666633','#FF0000','#0000FF','#99CCCC','#99CCCC'))
dev.off()
# leave the program
quit(save="no")
Open the produced PostScript file using ghostview. You should open a seperate SSH session to 'organism' in order to start ghostview
-
ghostview aa.ps &
The dendrograms on the side of the plots show the correlation between properties you are
comparing ( amino acid frequences )
Does the organisms cluster as you would expcet?
-
COMPARE CODON USAGE
We will do the exact same thing as above with only minor changes. Now we will collect
the codon usage lists rather than the amino acid usage lists
|
This code will extract all codon usage usage information which you have generated in Exercise Module 10
-
cat ../M11/source/CP000247.codon-usage.triplets.list | perl -ane 'next unless $F[0] =~ /^[A-Z]+/;print "\t$F[0]"; ' > triplets.dat
mysql -B -N -e 'select accession,organism from brazilworkshop06.genomes where user=USER()' > organism.names
foreach accession (CP000247 BA000021 AP008232 CP000034 AE017042 CP000036 CP000243 AP009048)
cat ../M11/source/$accession.codon-usage.triplets.list | perl -ane 'BEGIN { foreach (`cat organism.names`) {chomp; @A = split (/\t+/,$_);$NAMES{$A[0]} = $A[1];} } next unless $F[0] =~ /^[A-Z]+/;$FREQ{$F[0]} = $F[1];push @TRI,$F[0]; END { print "\n"; print "$NAMES{'$accession'}"; foreach $tri (@TRI){print "\t",$FREQ{$tri};}}' >> triplets.dat
end
Start R. In the following code, the '>' indicate that you are within the R prompt
and you should therefor not include '>' when you are executing the commands.
-
# Don't forget:
R
# defined output file
postscript ('triplets.ps')
# load a custom version of the R program 'heatmap' - this document also contain the myPalette function
source ( "/usr/opt/www/pub/CBS/courses/brazilworkshop/files/heatmap.R" )
# read in the data that you have generated above
data <- as.matrix ( read.table ( 'triplets.dat' , row.names=1, header= TRUE,sep = "\t" ) )
# generate the color palette
pal <- maPalette(low = rgb(10,10,10,maxColorValue = 10), high = rgb(10,0,0,maxColorValue = 10), mid= rgb(10,5,5,maxColorValue = 10),, k = 1024)
# set margins of the plot
par(oma=c(0, 0, 0, 0.5))
# draw the heatmap
new_heatmap ( data , cexCol=0.7,cexRow=1,colors = pal,scale="none")
# draw the legend
dev.off()
# leave the program
quit(save="no")
On your organism session, open the produced PostScript file using ghostview:
-
ghostview triplets.ps &
The dendrograms on the side of the plots show the correlation between properties you are
comparing ( codon frequences)
Does the organisms cluster as you would expcet?
-
VISUALIZING PROTEIN HOMOLOGY - THE BLAST MATRIX
We have now used R to get an overview of the properties of multiple genomes. R is good
for manipulating large amounts of statistical material, graphics, microarray analysis etc. In the next section
we will focus on finding genome similarities based on protein homology. You have already in
Exercise Module 4 carried out BLAST searches from your prompt. We will introduce a tool which was developed in our
group at CBS which generates a comparison of n genomes by BLASTing all genomes
against each other (generating n2 blast reports). This task is computationally intensive, and the
exercies may take a while to do. As most bioinformaticians has experienced, there exist many good opertunities to take a coffee-break!
You will however be offered to use an alternative pre-generated dataset. This data set should (if you have done everything correctly until now)
be identical to your dataset.
When you have finished this part you will end up with a graphic - a matrix - showing protein homology between
all genomes, visualized by a color scale.
The program take a large amount of input parameters, so we decided to place the entire setup of the search
in a structured XML document (click here to see what XML means).
|
First, copy the template XML-file into you directory:
-
cp /usr/opt/www/pub/CBS/courses/brazilworkshop/files/matrix.xml my_matrix.xml
Second, copy the proteins which you have predicted into a new directory, called 'prot':
-
mkdir prot
cp ../M11/source/*.proteins.fsa prot
|
So far, we have been nice and added references to all the protein files which you have
produced. However, you should go through the file and see how it looks. You may perhaps want to change
title and subtitle of your chart - or if you added other genomes in exercies M2, you need to specify them here.
Try also and locate where the E-value cutoff and ALR (alignment length ratio)
is specified and where references to your protein data is specified.
|
-
jpico my_matrix.xml
|
Save the file (CTRL-O, return), and close the editor (CTRL-X) when your done. If the servers will get too overloaded, you can as mentioned, use pre-generated data.
You do this by copying blastreports and protein sequences from a previous run. The way these results
are being stored and reused, is by generating an MD5 checksum for each input sequence. An example of a
checksum could be '3c50ae755e4931a3577cc5b50222c7ef'. The MD5 checksum serves as a fingerprint and it (within an overwhelming
statistical likelihood) garantee that if a change occurs in the input sequence, the MD5 checksum
will also change. The blast report of genome A against genome B will therefor be stored in the file aa-bb.blastout.gz where aa and bb are the MD5 checksum of
genomes A and B, respectively. This is an important take-home-message; it's a nice approach for maintaining data integrity!
As a note, we can tell you, that MySQL offers MD5 checksum generation on the fly, through the command 'MD5(<string>)'.
|
Unless you insist not to, please work on the pre-generated data set, copy the BLAST reports into the working directory of the program:
-
mkdir temp
cp /usr/opt/www/pub/CBS/courses/brazilworkshop/docs/blastmatrix/*.gz temp/
cp /usr/opt/www/pub/CBS/courses/brazilworkshop/docs/blastmatrix/*.proteins.fsa prot/
Save the file and close the editor. Once you are done with the XML file, you are ready to launch the program:
-
perl /home/people/pfh/scripts/blastmatrix/blastmatrix my_matrix.xml > my_matrix.ps
Examine the PostScript by opening it on your 'organism' session and try and make sense out of it!
-
VISUALIZING PROTEIN HOMOLOGY AND LOCALIZATION - THE BLAST ATLAS
We will now make use of the GeneWiz which was developed at CBS. The program was made
for visualizeing DNA properties on a per-basepair level.
We now want to use the program for mapping protein homology within a reference genome, which in this case
will be the largest of the E.coli genomes (E.coli UTI89, around 4,500 genes, 5.07MB).
|
We first copy the gene annotations which you have made by doing a MySQL query. We will create a *.ann file for each genome.
Also, in order to calculate some of the genomic properties (AT-content, and GC-skew, tab files is being created.
-
cp ../M11/source/CP000243.proteins.fsa .
cp ../M2/source/CP000243.fsa .
saco_convert -I fasta -O tab CP000243.fsa > CP000243.tab
mysql -N -B -e "select 'source','1',length(seq),'+','source' from brazilworkshop06.genomes where accession = 'CP000243' and user=USER()" | ~pfh/scripts/genewiz/tabann2ann.pl > CP000243.ann
mysql -N -B -e "select featuretype,start,stop,dir,featuretype from brazilworkshop06.features where accession = 'CP000243' and user=USER() order by start" | ~pfh/scripts/genewiz/tabann2ann.pl >> CP000243.ann
Copy a template genewiz configuration and open it in nedit:
-
cp /usr/opt/www/pub/CBS/courses/brazilworkshop/files/CP000243.baseatlas.cf .
jpico CP000243.baseatlas.cf
Take a good look at the file! This is a very basic configuration: It instructs the GeneWiz program to
draw a circular representation of the genome including AT content and GC-skew only. We are going to
modify this file to include BLAST search hits in the other 7 genomes. To do this, we will use the C
preprocessor (cpp): Everywhere it says '# include >filename<' the content of file
>filename< will be inserted at this point. This is a convinient way of automatically
generate the content of the configuration file (remember you have seven genomes to reference).
Find the places in the file where it say '#include'. This is where 'cpp' will insert
GeneWiz-code which you are about to generate. You will place the code in three different files called
DATA, CIRCLES, and FILES. Now, mysql can do many things - and why shouldn't it also
be able to generate a bit of text for our GeneWiz configuration file? We will now start to generate the three files:
|
-
mysql -B -N -e "select concat('file ',accession,'.proteins.blastatlas.genomemap0.gz dat;') from brazilworkshop06.genomes where user=user() and accession != 'CP000243'" > FILES
mysql -B -N -e 'select concat("circle ",genomes.accession,".proteins.blastatlas.genomemap0.gz 1 \"",color,".cm2\" by 0.0 10.0;") from brazilworkshop06.genomes,brazilworkshop06.colors where user=user() and colors.accession=genomes.accession and genomes.accession != "CP000243" order by organism' > CIRCLES
mysql -B -N -e 'select concat("dat ",accession,".proteins.blastatlas.genomemap0.gz 1 0.0 0.0 0.0 \"",organism,"\" boxfilter 10;") from brazilworkshop06.genomes where user=user() and accession != "CP000243" ' > DATA
Try and have a look at the new individual files:
-
less DATA
less FILES
less CIRCLES
Use cpp to merge them, and try have a look at the result
-
cpp -w -P CP000243.baseatlas.cf > CP000243.baseatlas.cpp.cf
less CP000243.baseatlas.cpp.cf
Now we need to create files which instructs the BLAST scripts
which genomes should be BLASTed against which. Each such 'queryblast' file contains just
a link to the reference genome (asuming there is a file called CP000243.proteins.fsa
-
foreach accession (CP000247 BA000021 AP008232 CP000034 AE017042 CP000036 AP009048)
cp ../M11/source/$accession.proteins.fsa .
echo './CP000243' > $accession.proteins.queryblast
end
We feel confidennt that you are now familiar with how things are done at the prompt,
and we are going to leave the 'engine room' for a while.
We will let a program do all the boring stuff, and execute blasta. Basically, what will happen is, that the program will build maps of the E. coli UTI89 genomes
containing E-values for the hits found within the other genomes.
Execute a single script which would make all the data-files which are required by
your GeneWiz configuration file. We don't have time however, to go into details about this - it's a complete chapter of it's own.
|
-
perl /home/people/pfh/scripts/genewiz/cf_data_create.pl CP000243.baseatlas.cpp.cf
/usr/cbs/tools/lib/make/bin/genewiz.Linux -p blastatlas.ps CP000243.baseatlas.cpp.cf
Go to your 'organism' session and open the PostScript file in ghostview. Does the visual representation give you the same impression of genome similarity? Can you find larger regions where genome lack
similarity?
-
CONSTRUCT TREE BASED GENOME PROPERTIES
|
We will now collect all the results you have been generating so fare (number sigma factors, number of coding genes, tRNAs rRNAs, genome length, and TCS.
Just like be did before, will will let R generate a heatmap. This time, we will need to scale the data. This is done in the R-script by the function 'scale'.
|
-
mysql -D brazilworkshop06 -N -B -e "select accession,organism from genomes where user=user() order by accession" > col1
mysql -D brazilworkshop06 -B -e "select accession,count(*) as CDS from features where user=user() and featuretype='CDS' group by accession order by accession" > col2
mysql -D brazilworkshop06 -B -e "select accession,count(*) as tRNA from features where user=user() and featuretype='tRNA' group by accession order by accession" > col3
mysql -D brazilworkshop06 -B -e "select accession,count(*) as rRNA from features where user=user() and featuretype='rRNA' group by accession order by accession" > col4
mysql -D brazilworkshop06 -B -e "select accession,count(*) as s54 from features where user=user() and featuretype='CDS' and note like '%s54%' group by accession order by accession" > col5
mysql -D brazilworkshop06 -B -e "select accession,count(*) as s70 from features where user=user() and featuretype='CDS' and note like '%s70%' group by accession order by accession" > col6
mysql -D brazilworkshop06 -B -e "select accession,count(*) as ecf from features where user=user() and featuretype='CDS' and note like '%ecf%' group by accession order by accession" > col7
mysql -D brazilworkshop06 -B -e "select accession,count(*) as RRreciever from features where user=user() and featuretype='CDS' and note like '%RRreciever%' group by accession order by accession" > col8
mysql -D brazilworkshop06 -B -e "select accession,count(*) as HisKA_1 from features where user=user() and featuretype='CDS' and note like '%HisKA_1%' group by accession order by accession" > col9
mysql -D brazilworkshop06 -B -e "select accession,count(*) as HisKA_3 from features where user=user() and featuretype='CDS' and note like '%HisKA_3%' group by accession order by accession" > col10
mysql -D brazilworkshop06 -B -e "select accession,length(seq) as Length from genomes where user=user() group by accession order by accession" > col11
perl -e 'foreach $i (1..11) { open IN , "col$i"; while (< IN >) { chomp;($a , $c) = split (/\t+/,$_); $DATA{$a}{$i} = $c;} close IN;} END {foreach $a ( reverse sort keys %DATA) {foreach $i (1..11) {$DATA{$a}{$i} = 0 if $DATA{$a}{$i} eq "" and $i > 1; print $DATA{$a}{$i}; print "\t" if $i < 11;}print "\n";}}' > final.dat
R
postscript ('final.ps')
source ( "/usr/opt/www/pub/CBS/courses/brazilworkshop/files/heatmap.R" )
data <- as.matrix ( read.table ( 'final.dat' , row.names=1, header= TRUE,sep = "\t" ) )
pal <- maPalette(low = rgb(10,10,10,maxColorValue = 10), high = rgb(0,0,10,maxColorValue = 10), mid= rgb(5,5,10,maxColorValue = 10),, k = 1024)
par(oma=c(0, 0, 0, 0.5))
data <- scale(data)
new_heatmap ( data , cexCol=1,cexRow=1,colors = pal)
dev.off()
quit(save="no")
View the output postscript file on organism, and compare it with the rRNA tree that you made in the beginning of the exercise
|