The CBS web site is under hardware upgrade. Please report broken functionality to webmaster@cbs.dtu.dk. For a quick fix, replace 'www' with 'genome' in CBS URLs.
Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Exercise 5


OBJECTIVES

In this exercies, we are going to wrap up the results you have been generating over the previous exercises. During the Taxonomy lectures, we have been discussing alternative phylogenetic markers by which tree could be established. In this exercise we will try and build heatmaps and dendrograms based on the basic properties like CDS count, tRNA count, Codon usage etc. The results from this will be compared to a basic 16S rRNA tree which you are going to build.

Second, we will make an all-against-all comparison of all the proteins which you have predicted and finally, we will look at protin localization using a DNA atlas for visualization.

  1. LOG IN

    Open a session on 'organism.cbs.dtu.dk', and from this login to ibiology and create directories for you to work in:
    ssh -X ibiology
    cd projects
    mkdir Ex5
    cd Ex5
    mkdir source
    
  2. COMPARE NUMBER OF GENES

    The number of predicted coding sequences may vary from what is annotated. We will start out by making a relational query to the CBS genome atlas database, and compare our predicted number of genes with what is annotated. The atlas database keep the number of predicted CDS in the column 'NGENES'.


    mysql
    mysql>use cmp_genomics;
    mysql>select genomes.organism, atlasdb.accession,count(*) as EasyGene_predicted,ngenes as GBK_Annotated,(atlasdb.ngenes-count(*)) as Dif 
          from genomes,features,genomeatlas.atlasdb  
          where genomes.accession = features.accession and features.user like 'phd21%' and featuretype='CDS' 
                and atlasdb.accession = features.accession and genomes.user=features.user and atlasdb.segment = 'main' 
          group by atlasdb.accession;
    mysql>exit
     
    Comment on your result.
  3. CONSTRUCT TREE BASED ON 16s rRNA

      First, we will extract the first predicted 16s rRNA gene for every genome. The code is very similar as what you used in exercise 3: Here, we are just looking for featuretypes of 'rRNA' and in the end for the perl script we place a 'last'-statement which tells perl to skip the rest, once the first gene has been processed. When done, try and view the fasta file you produce (press 'q' to quit)
    1. rm -f source/all.16s.fsa
      foreach accession (AE017042 AE016879 AL111168 AL645882 AP008232 AP009048 BA000021 CP000034)
         mysql -B -N -D cmp_genomics -e "select substr(seq,start,stop-start+1) as sequence,\
           replace(substring_index(genomes.organism,' ',4),' ','_') as name,dir from genomes,features where \
           genomes.accession = features.accession and genomes.user = features.user and \
           features.user = user() and featuretype='rRNA' and features.accession = '$accession'" | \
         perl -ane '$F[0] = reverse ( $F[0] ) if $F[2] eq "-";$F[0] =~ tr/ATGC/TACG/ if $F[2] eq "-";\
                    print ">$F[1]\n"; \
                    for ( $x = 0 ; $x < length ( $F[0] ) ; $x = $x+60) { print substr ( $F[0] , $x , 60),"\n"; } last;' \
         >> source/all.16s.fsa
      end
      less source/all.16s.fsa
      
    2. Next, you will be using 'clustal' to produce an alignment of all the 16s genes and generate a tree (press 'q' to quit)
    3. clustalw source/all.16s.fsa
      echo "4\n1\nsource/all.16s.aln\n4\nall.16s.ph\n\nx\n" | clustalw
      less all.16s.ph
      
    4. Start an SSH session to 'organism' and go to the Ex5 directory. From here, you can start the program to view the tree
    5. njplot all.16s.ph &
      
  4. COMPARE AMINO ACID USAGE

    We will now reuse the codon/amino acid usage data which you have already generated in Exercise 3 to produce a 2 dimensional cluster (a so-called heatmap). To do this, we will be using the statistical package R

      This code will collect the data, and produce a TAB-delimited file containing all Amino acids, their frequences of all genomes.
    1. cat ~/Ex3/source/CP000034.codon-usage.aa.list \
      | perl -ane 'next unless $F[1] =~ /^[A-Z]+/;print "\t$F[1]"; ' \
      > aa.dat
      mysql -B -N -e 'select accession,organism from cmp_genomics.genomes where user=USER()' \
      > organism.names
      foreach accession (AE017042 AE016879 AL111168 AL645882 AP008232 AP009048 BA000021 CP000034)
        cat ~/Ex3/source/$accession.codon-usage.aa.list \
        | perl -ane 'BEGIN { foreach (`cat organism.names`) \
                             {chomp; @A = split (/\t+/,$_);$NAMES{$A[0]} = $A[1];} } \
                     next unless $F[1] =~ /^[A-Z]+/;$FREQ{$F[1]} = $F[3]; push @AA,$F[1]; \
                     END { print "\n"; print "$NAMES{'$accession'}"; foreach $aa (@AA){print "\t",$FREQ{$aa};}}' \
        >> aa.dat
      end
      
    2. Before proceding, you should view the output file you have just generated (press 'q' to quit)
    3. less aa.dat
    4. Start R. In the following code, the '>' indicate that you are at the R prompt and you should therefore not include '>' when you are executing the commands (just as is was case with the mysql client)
    5. # Don't forget to start R!:
      R
      # define output file
      postscript ('aa.ps')
      # colors for the Amino Acid classes
      aaclass <- c('#CC33FF','#CC33FF','#CC33FF','#CC33FF','#996699','#996699','#99CC99','#99CC99','#99CC99','#99CC99','#666633','#666633','#FF0000','#FF0000','#0000FF','#0000FF','#99CCCC','#99CCCC','#99CCCC','#99CCCC')
      # load a custom version of the R program 'heatmap' - this document also contain the myPalette function
      source ( "/home/people/kiil/Exercises/files/heatmap.R" )
      # read in the data that you have generated above
      data <- as.matrix ( read.table ( 'aa.dat' , row.names=1, header= TRUE,sep = "\t" ) )
      # generate the color palette
      pal <- maPalette(low = rgb(10,10,10,maxColorValue = 10), high = rgb(0,0,10,maxColorValue = 10), mid= rgb(5,5,10,maxColorValue = 10),, k = 1024)
      # set margins of the plot
      par(oma=c(0, 0, 0, 0.5))
      # draw the heatmap
      new_heatmap ( data , cexCol=1,cexRow=1,colors = pal,ColSideColors=aaclass,scale="none")
      # draw the legend
      legend (0,1,c('Aliphatic','Structural','Aromatic','Sulfur','+','-','Polar'),fill=c('#CC33FF','#996699','#99CC99','#666633','#FF0000','#0000FF','#99CCCC','#99CCCC'))
      dev.off()
      # leave the program
      quit(save="no")
       
    6. Open the produced PostScript file using ghostview. You should open a seperate SSH session to 'organism' in order to start ghostview
    7. ghostview aa.ps &
      
    8. The dendrograms on the side of the plots show the correlation between properties you are comparing ( amino acid frequences )
      Does the organisms cluster as you would expect?
  5. COMPARE CODON USAGE

    We will do the exact same thing as above with only minor changes. Now we will collect the codon usage lists rather than the amino acid usage lists
      This code will extract all codon usage usage information which you have generated in Exercise 3
    1. cat ~/Ex3/source/CP000034.codon-usage.triplets.list \
      | perl -ane 'next unless $F[0] =~ /^[A-Z]+/;print "\t$F[0]"; ' \
      > triplets.dat
      mysql -B -N -e 'select accession,organism from cmp_genomics.genomes where user=USER()' \
      > organism.names
      foreach accession (AE017042 AE016879 AL111168 AL645882 AP008232 AP009048 BA000021 CP000034)
        cat ~/Ex3/source/$accession.codon-usage.triplets.list \
        | perl -ane 'BEGIN { foreach (`cat organism.names`) \
                             {chomp; @A = split (/\t+/,$_);$NAMES{$A[0]} = $A[1];} } \
                     next unless $F[0] =~ /^[A-Z]+/;$FREQ{$F[0]} = $F[1];push @TRI,$F[0]; \
                     END { print "\n"; print "$NAMES{'$accession'}"; \
                           foreach $tri (@TRI){print "\t",$FREQ{$tri};}}' \
        >> triplets.dat
      end
       
    2. Start R. In the following code, the '>' indicate that you are within the R prompt and you should therefor not include '>' when you are executing the commands.
    3. # Don't forget:
      R
      # defined output file
      postscript ('triplets.ps')
      # load a custom version of the R program 'heatmap' - this document also contain the myPalette function
      source ( "/home/people/kiil/Exercises/files/heatmap.R" )
      # read in the data that you have generated above
      data <- as.matrix ( read.table ( 'triplets.dat' , row.names=1, header= TRUE,sep = "\t" ) )
      # generate the color palette
      pal <- maPalette(low = rgb(10,10,10,maxColorValue = 10), high = rgb(10,0,0,maxColorValue = 10), mid= rgb(10,5,5,maxColorValue = 10),, k = 1024)
      # set margins of the plot
      par(oma=c(0, 0, 0, 0.5))
      # draw the heatmap
      new_heatmap ( data , cexCol=0.7,cexRow=1,colors = pal,scale="none")
      # draw the legend
      dev.off()
      # leave the program
      quit(save="no")
       
    4. On your organism session, open the produced PostScript file using ghostview:
    5. ghostview triplets.ps &
      
    6. The dendrograms on the side of the plots show the correlation between properties you are comparing ( codon frequences)
      Does the organisms cluster as you would expect?