|
Exercise 5
OBJECTIVES
In this exercies, we are going to wrap up the results you have been generating over the previous exercises. During the
Taxonomy lectures, we have been discussing alternative phylogenetic markers by which tree could be established. In this exercise
we will try and build heatmaps and dendrograms based on the basic properties like CDS count, tRNA count, Codon usage etc.
The results from this will be compared to a basic 16S rRNA tree which you are going to build.
Second, we will make an all-against-all comparison of all the proteins which you have predicted and finally, we will look at
protin localization using a DNA atlas for visualization.
|
-
LOG IN
Open a session on 'organism.cbs.dtu.dk', and from this login to ibiology and create directories for you to work in:
ssh -X ibiology
cd projects
mkdir Ex5
cd Ex5
mkdir source
-
COMPARE NUMBER OF GENES
The number of predicted coding sequences may vary from what is annotated. We will start out
by making a relational query to the CBS genome atlas database, and compare our predicted number of genes with what
is annotated. The atlas database keep the number of predicted CDS in the column 'NGENES'.
|
mysql
mysql>use cmp_genomics;
mysql>select genomes.organism, atlasdb.accession,count(*) as EasyGene_predicted,ngenes as GBK_Annotated,(atlasdb.ngenes-count(*)) as Dif
from genomes,features,genomeatlas.atlasdb
where genomes.accession = features.accession and features.user like 'phd21%' and featuretype='CDS'
and atlasdb.accession = features.accession and genomes.user=features.user and atlasdb.segment = 'main'
group by atlasdb.accession;
mysql>exit
Comment on your result.
-
CONSTRUCT TREE BASED ON 16s rRNA
|
First, we will extract the first predicted 16s rRNA gene for every genome. The code is very similar
as what you used in exercise 3: Here, we are just looking for featuretypes of 'rRNA' and in the end
for the perl script we place a 'last'-statement which tells perl to skip the rest, once the first gene
has been processed.
When done, try and view the fasta file you produce (press 'q' to quit)
|
-
rm -f source/all.16s.fsa
foreach accession (AE017042 AE016879 AL111168 AL645882 AP008232 AP009048 BA000021 CP000034)
mysql -B -N -D cmp_genomics -e "select substr(seq,start,stop-start+1) as sequence,\
replace(substring_index(genomes.organism,' ',4),' ','_') as name,dir from genomes,features where \
genomes.accession = features.accession and genomes.user = features.user and \
features.user = user() and featuretype='rRNA' and features.accession = '$accession'" | \
perl -ane '$F[0] = reverse ( $F[0] ) if $F[2] eq "-";$F[0] =~ tr/ATGC/TACG/ if $F[2] eq "-";\
print ">$F[1]\n"; \
for ( $x = 0 ; $x < length ( $F[0] ) ; $x = $x+60) { print substr ( $F[0] , $x , 60),"\n"; } last;' \
>> source/all.16s.fsa
end
less source/all.16s.fsa
|
Next, you will be using 'clustal' to produce an alignment of all the 16s genes and generate a tree (press 'q' to quit)
|
-
clustalw source/all.16s.fsa
echo "4\n1\nsource/all.16s.aln\n4\nall.16s.ph\n\nx\n" | clustalw
less all.16s.ph
|
Start an SSH session to 'organism' and go to the Ex5 directory. From here, you can start the program to view the tree
|
-
njplot all.16s.ph &
-
COMPARE AMINO ACID USAGE
|
We will now reuse the codon/amino acid usage data which you have already generated in Exercise 3 to produce a 2 dimensional cluster (a so-called heatmap).
To do this, we will be using the statistical package R
|
This code will collect the data, and produce a TAB-delimited file containing
all Amino acids, their frequences of all genomes.
-
cat ~/Ex3/source/CP000034.codon-usage.aa.list \
| perl -ane 'next unless $F[1] =~ /^[A-Z]+/;print "\t$F[1]"; ' \
> aa.dat
mysql -B -N -e 'select accession,organism from cmp_genomics.genomes where user=USER()' \
> organism.names
foreach accession (AE017042 AE016879 AL111168 AL645882 AP008232 AP009048 BA000021 CP000034)
cat ~/Ex3/source/$accession.codon-usage.aa.list \
| perl -ane 'BEGIN { foreach (`cat organism.names`) \
{chomp; @A = split (/\t+/,$_);$NAMES{$A[0]} = $A[1];} } \
next unless $F[1] =~ /^[A-Z]+/;$FREQ{$F[1]} = $F[3]; push @AA,$F[1]; \
END { print "\n"; print "$NAMES{'$accession'}"; foreach $aa (@AA){print "\t",$FREQ{$aa};}}' \
>> aa.dat
end
Before proceding, you should view the output file you have just generated (press 'q' to quit)
-
less aa.dat
Start R. In the following code, the '>' indicate that you are
at the R prompt and you should therefore not include '>'
when you are executing the commands (just as is was case with the mysql client)
-
# Don't forget to start R!:
R
# define output file
postscript ('aa.ps')
# colors for the Amino Acid classes
aaclass <- c('#CC33FF','#CC33FF','#CC33FF','#CC33FF','#996699','#996699','#99CC99','#99CC99','#99CC99','#99CC99','#666633','#666633','#FF0000','#FF0000','#0000FF','#0000FF','#99CCCC','#99CCCC','#99CCCC','#99CCCC')
# load a custom version of the R program 'heatmap' - this document also contain the myPalette function
source ( "/home/people/kiil/Exercises/files/heatmap.R" )
# read in the data that you have generated above
data <- as.matrix ( read.table ( 'aa.dat' , row.names=1, header= TRUE,sep = "\t" ) )
# generate the color palette
pal <- maPalette(low = rgb(10,10,10,maxColorValue = 10), high = rgb(0,0,10,maxColorValue = 10), mid= rgb(5,5,10,maxColorValue = 10),, k = 1024)
# set margins of the plot
par(oma=c(0, 0, 0, 0.5))
# draw the heatmap
new_heatmap ( data , cexCol=1,cexRow=1,colors = pal,ColSideColors=aaclass,scale="none")
# draw the legend
legend (0,1,c('Aliphatic','Structural','Aromatic','Sulfur','+','-','Polar'),fill=c('#CC33FF','#996699','#99CC99','#666633','#FF0000','#0000FF','#99CCCC','#99CCCC'))
dev.off()
# leave the program
quit(save="no")
Open the produced PostScript file using ghostview. You should open a seperate SSH session to 'organism' in order to start ghostview
-
ghostview aa.ps &
The dendrograms on the side of the plots show the correlation between properties you are
comparing ( amino acid frequences )
Does the organisms cluster as you would expect?
-
COMPARE CODON USAGE
We will do the exact same thing as above with only minor changes. Now we will collect
the codon usage lists rather than the amino acid usage lists
|
This code will extract all codon usage usage information which you have generated in Exercise 3
-
cat ~/Ex3/source/CP000034.codon-usage.triplets.list \
| perl -ane 'next unless $F[0] =~ /^[A-Z]+/;print "\t$F[0]"; ' \
> triplets.dat
mysql -B -N -e 'select accession,organism from cmp_genomics.genomes where user=USER()' \
> organism.names
foreach accession (AE017042 AE016879 AL111168 AL645882 AP008232 AP009048 BA000021 CP000034)
cat ~/Ex3/source/$accession.codon-usage.triplets.list \
| perl -ane 'BEGIN { foreach (`cat organism.names`) \
{chomp; @A = split (/\t+/,$_);$NAMES{$A[0]} = $A[1];} } \
next unless $F[0] =~ /^[A-Z]+/;$FREQ{$F[0]} = $F[1];push @TRI,$F[0]; \
END { print "\n"; print "$NAMES{'$accession'}"; \
foreach $tri (@TRI){print "\t",$FREQ{$tri};}}' \
>> triplets.dat
end
Start R. In the following code, the '>' indicate that you are within the R prompt
and you should therefor not include '>' when you are executing the commands.
-
# Don't forget:
R
# defined output file
postscript ('triplets.ps')
# load a custom version of the R program 'heatmap' - this document also contain the myPalette function
source ( "/home/people/kiil/Exercises/files/heatmap.R" )
# read in the data that you have generated above
data <- as.matrix ( read.table ( 'triplets.dat' , row.names=1, header= TRUE,sep = "\t" ) )
# generate the color palette
pal <- maPalette(low = rgb(10,10,10,maxColorValue = 10), high = rgb(10,0,0,maxColorValue = 10), mid= rgb(10,5,5,maxColorValue = 10),, k = 1024)
# set margins of the plot
par(oma=c(0, 0, 0, 0.5))
# draw the heatmap
new_heatmap ( data , cexCol=0.7,cexRow=1,colors = pal,scale="none")
# draw the legend
dev.off()
# leave the program
quit(save="no")
On your organism session, open the produced PostScript file using ghostview:
-
ghostview triplets.ps &
The dendrograms on the side of the plots show the correlation between properties you are
comparing ( codon frequences)
Does the organisms cluster as you would expect?
|