Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Exercise M18


OBJECTIVES

In this exercies, we are going to wrap up the results you have been generating over the previous exercises. During the Taxonomy lectures, we have been discussing alternative phylogenetic markers by which tree could be established. In this exercise we will try and build heatmaps and dendrograms based on the basic properties like CDS count, tRNA count, Codon usage etc. The results from this will be compared to a basic 16S rRNA tree which you are going to build.

Second, we will make an all-against-all comparison of all the proteins which you have predicted and finally, we will look at protin localization using a DNA atlas for visualization.

  1. LOG IN

    Open a session on 'genome.cbs.dtu.dk', and from this login to ibiology and create directories for you to work in:
    ssh -X ibiology
    mkdir M18
    cd M18
    mkdir source
    
  2. COMPARE NUMBER OF GENES

    As was pointed out in lecture M8, the number of predicted coding sequences may vary from what is annotated. We will start out by making a relational query to the CBS genome atlas database, and compare our predicted number of genes with what is annotated. The atlas database keep the number of predicted CDS in the column 'NGENES'.


    mysql
    mysql>use brazilworkshop06
    mysql>select genomes.organism, atlasdb.accession,count(*) as EasyGene_predicted,ngenes as GBK_Annotated,
    (atlasdb.ngenes-count(*)) as Dif from genomes,features,genomeatlas.atlasdb  where genomes.accession = features.accession and 
    features.user like 'phd01%' and featuretype='CDS' and atlasdb.accession = features.accession and genomes.user=features.user and 
    atlasdb.segment = 'main' group by atlasdb.accession;
    mysql>exit
     
    Comment on your result.
  3. CONSTRUCT TREE BASED ON 16s rRNA

      First, we will extract the first predicted 16s rRNA gene for every genome. The code is very similar as what you used in exercise M11: Here, we are just looking for featuretypes of 'rRNA' and in the end for the perl script we place a 'last'-statement which tells perl to skip the rest, once the first gene has been processed. When done, try and view the fasta file you produce (press 'q' to quit)
    1. rm -f source/all.16s.fsa
      foreach accession (AE017042 AP008232 AP009048 BA000021 CP000034 CP000036 CP000243 CP000247)
         mysql -B -N -D brazilworkshop06 -e "select substr(seq,start,stop-start+1) as sequence,\
           replace(substring_index(genomes.organism,' ',4),' ','_') as name,dir from genomes,features where \
           genomes.accession = features.accession and genomes.user = features.user and \
           features.user = user() and featuretype='rRNA' and features.accession = '$accession'" | \
         perl -ane '$F[0] = reverse ( $F[0] ) if $F[2] eq "-";$F[0] =~ tr/ATGC/TACG/ if $F[2] eq "-";print ">$F[1]\n"; for ( $x = 0 ; $x < length ( $F[0] ) ; $x = $x+60) { print substr ( $F[0] , $x , 60),"\n"; } last;' >> source/all.16s.fsa
      end
      less source/all.16s.fsa
      
    2. Next, you will be using 'clustal' to produce an alignment of all the 16s genes and generate a tree (press 'q' to quit)
    3. clustalw source/all.16s.fsa
      echo "4\n1\nsource/all.16s.aln\n4\nall.16s.ph\n\nx\n" | clustalw
      less all.16s.ph
      
    4. Start an SSH session to 'organism' and go to the M18 directory. From here, you can start the program to view the tree
    5. njplot all.16s.ph &
      
  4. COMPARE AMINO ACID USAGE

    We will now reuse the codon/amino acid usage data which you have already generated in Exercise Module 11 to produce a 2 dimensional cluster (a so-called heatmap). To do this, we will be using the statistical package R

      This code will collect the data, and produce a TAB-delimited file containing all Amino acids, their frequences of all genomes.
    1. cat ../M11/source/CP000247.codon-usage.aa.list | perl -ane 'next unless $F[1] =~ /^[A-Z]+/;print "\t$F[1]"; ' > aa.dat
      mysql -B -N -e 'select accession,organism from brazilworkshop06.genomes where user=USER()' > organism.names
      foreach accession (CP000247 BA000021 AP008232 CP000034 AE017042 CP000036 CP000243 AP009048)
        cat ../M11/source/$accession.codon-usage.aa.list | perl -ane 'BEGIN { foreach (`cat organism.names`) {chomp; @A = split (/\t+/,$_);$NAMES{$A[0]} = $A[1];} } next unless $F[1] =~ /^[A-Z]+/;$FREQ{$F[1]} = $F[3];push @AA,$F[1]; END { print "\n"; print "$NAMES{'$accession'}"; foreach $aa (@AA){print "\t",$FREQ{$aa};}}' >> aa.dat
      end
       
    2. Before proceding, you should view the output file you have just generated (press 'q' to quit)
    3. less aa.dat
    4. Start R. In the following code, the '>' indicate that you are at the R prompt and you should therefore not include '>' when you are executing the commands (just as is was case with the mysql client)
    5. # Don't forget to start R!:
      R
      # define output file
      postscript ('aa.ps')
      # colors for the Amino Acid classes
      aaclass <- c('#CC33FF','#CC33FF','#CC33FF','#CC33FF','#996699','#996699','#99CC99','#99CC99','#99CC99','#99CC99','#666633','#666633','#FF0000','#FF0000','#0000FF','#0000FF','#99CCCC','#99CCCC','#99CCCC','#99CCCC')
      # load a custom version of the R program 'heatmap' - this document also contain the myPalette function
      source ( "/usr/opt/www/pub/CBS/courses/brazilworkshop/files/heatmap.R" )
      # read in the data that you have generated above
      data <- as.matrix ( read.table ( 'aa.dat' , row.names=1, header= TRUE,sep = "\t" ) )
      # generate the color palette
      pal <- maPalette(low = rgb(10,10,10,maxColorValue = 10), high = rgb(0,0,10,maxColorValue = 10), mid= rgb(5,5,10,maxColorValue = 10),, k = 1024)
      # set margins of the plot
      par(oma=c(0, 0, 0, 0.5))
      # draw the heatmap
      new_heatmap ( data , cexCol=1,cexRow=1,colors = pal,ColSideColors=aaclass,scale="none")
      # draw the legend
      legend (0,1,c('Aliphatic','Structural','Aromatic','Sulfur','+','-','Polar'),fill=c('#CC33FF','#996699','#99CC99','#666633','#FF0000','#0000FF','#99CCCC','#99CCCC'))
      dev.off()
      # leave the program
      quit(save="no")
       
    6. Open the produced PostScript file using ghostview. You should open a seperate SSH session to 'organism' in order to start ghostview
    7. ghostview aa.ps &
      
    8. The dendrograms on the side of the plots show the correlation between properties you are comparing ( amino acid frequences )
      Does the organisms cluster as you would expcet?
  5. COMPARE CODON USAGE

    We will do the exact same thing as above with only minor changes. Now we will collect the codon usage lists rather than the amino acid usage lists
      This code will extract all codon usage usage information which you have generated in Exercise Module 10
    1. cat ../M11/source/CP000247.codon-usage.triplets.list | perl -ane 'next unless $F[0] =~ /^[A-Z]+/;print "\t$F[0]"; ' > triplets.dat
      mysql -B -N -e 'select accession,organism from brazilworkshop06.genomes where user=USER()' > organism.names
      foreach accession (CP000247 BA000021 AP008232 CP000034 AE017042 CP000036 CP000243 AP009048)
        cat ../M11/source/$accession.codon-usage.triplets.list | perl -ane 'BEGIN { foreach (`cat organism.names`) {chomp; @A = split (/\t+/,$_);$NAMES{$A[0]} = $A[1];} } next unless $F[0] =~ /^[A-Z]+/;$FREQ{$F[0]} = $F[1];push @TRI,$F[0]; END { print "\n"; print "$NAMES{'$accession'}"; foreach $tri (@TRI){print "\t",$FREQ{$tri};}}' >> triplets.dat
      end
       
    2. Start R. In the following code, the '>' indicate that you are within the R prompt and you should therefor not include '>' when you are executing the commands.
    3. # Don't forget:
      R
      # defined output file
      postscript ('triplets.ps')
      # load a custom version of the R program 'heatmap' - this document also contain the myPalette function
      source ( "/usr/opt/www/pub/CBS/courses/brazilworkshop/files/heatmap.R" )
      # read in the data that you have generated above
      data <- as.matrix ( read.table ( 'triplets.dat' , row.names=1, header= TRUE,sep = "\t" ) )
      # generate the color palette
      pal <- maPalette(low = rgb(10,10,10,maxColorValue = 10), high = rgb(10,0,0,maxColorValue = 10), mid= rgb(10,5,5,maxColorValue = 10),, k = 1024)
      # set margins of the plot
      par(oma=c(0, 0, 0, 0.5))
      # draw the heatmap
      new_heatmap ( data , cexCol=0.7,cexRow=1,colors = pal,scale="none")
      # draw the legend
      dev.off()
      # leave the program
      quit(save="no")
       
    4. On your organism session, open the produced PostScript file using ghostview:
    5. ghostview triplets.ps &
      
    6. The dendrograms on the side of the plots show the correlation between properties you are comparing ( codon frequences)
      Does the organisms cluster as you would expcet?
  6. VISUALIZING PROTEIN HOMOLOGY - THE BLAST MATRIX

    We have now used R to get an overview of the properties of multiple genomes. R is good for manipulating large amounts of statistical material, graphics, microarray analysis etc. In the next section we will focus on finding genome similarities based on protein homology. You have already in Exercise Module 4 carried out BLAST searches from your prompt. We will introduce a tool which was developed in our group at CBS which generates a comparison of n genomes by BLASTing all genomes against each other (generating n2 blast reports). This task is computationally intensive, and the exercies may take a while to do. As most bioinformaticians has experienced, there exist many good opertunities to take a coffee-break!

    You will however be offered to use an alternative pre-generated dataset. This data set should (if you have done everything correctly until now) be identical to your dataset.

    When you have finished this part you will end up with a graphic - a matrix - showing protein homology between all genomes, visualized by a color scale.

    The program take a large amount of input parameters, so we decided to place the entire setup of the search in a structured XML document (click here to see what XML means).


      First, copy the template XML-file into you directory:
    1. cp /usr/opt/www/pub/CBS/courses/brazilworkshop/files/matrix.xml my_matrix.xml
      
    2. Second, copy the proteins which you have predicted into a new directory, called 'prot':
    3. mkdir prot
      cp ../M11/source/*.proteins.fsa prot
      
    4. So far, we have been nice and added references to all the protein files which you have produced. However, you should go through the file and see how it looks. You may perhaps want to change title and subtitle of your chart - or if you added other genomes in exercies M2, you need to specify them here. Try also and locate where the E-value cutoff and ALR (alignment length ratio) is specified and where references to your protein data is specified.


    5. jpico my_matrix.xml
      
    6. Save the file (CTRL-O, return), and close the editor (CTRL-X) when your done. If the servers will get too overloaded, you can as mentioned, use pre-generated data. You do this by copying blastreports and protein sequences from a previous run. The way these results are being stored and reused, is by generating an MD5 checksum for each input sequence. An example of a checksum could be '3c50ae755e4931a3577cc5b50222c7ef'. The MD5 checksum serves as a fingerprint and it (within an overwhelming statistical likelihood) garantee that if a change occurs in the input sequence, the MD5 checksum will also change. The blast report of genome A against genome B will therefor be stored in the file aa-bb.blastout.gz where aa and bb are the MD5 checksum of genomes A and B, respectively. This is an important take-home-message; it's a nice approach for maintaining data integrity! As a note, we can tell you, that MySQL offers MD5 checksum generation on the fly, through the command 'MD5(<string>)'.


      Unless you insist not to, please work on the pre-generated data set, copy the BLAST reports into the working directory of the program:
    7. mkdir temp
      cp /usr/opt/www/pub/CBS/courses/brazilworkshop/docs/blastmatrix/*.gz temp/
      cp /usr/opt/www/pub/CBS/courses/brazilworkshop/docs/blastmatrix/*.proteins.fsa prot/
       
      
    8. Save the file and close the editor. Once you are done with the XML file, you are ready to launch the program:
    9. perl /home/people/pfh/scripts/blastmatrix/blastmatrix my_matrix.xml > my_matrix.ps
      
    10. Examine the PostScript by opening it on your 'organism' session and try and make sense out of it!
  7. VISUALIZING PROTEIN HOMOLOGY AND LOCALIZATION - THE BLAST ATLAS

    We will now make use of the GeneWiz which was developed at CBS. The program was made for visualizeing DNA properties on a per-basepair level.

    We now want to use the program for mapping protein homology within a reference genome, which in this case will be the largest of the E.coli genomes (E.coli UTI89, around 4,500 genes, 5.07MB).

      We first copy the gene annotations which you have made by doing a MySQL query. We will create a *.ann file for each genome. Also, in order to calculate some of the genomic properties (AT-content, and GC-skew, tab files is being created.
    1. cp ../M11/source/CP000243.proteins.fsa .
      cp ../M2/source/CP000243.fsa .
      saco_convert -I fasta -O tab CP000243.fsa > CP000243.tab
      mysql -N -B -e "select 'source','1',length(seq),'+','source' from brazilworkshop06.genomes where accession = 'CP000243' and user=USER()" | ~pfh/scripts/genewiz/tabann2ann.pl > CP000243.ann
      mysql -N -B -e "select featuretype,start,stop,dir,featuretype from brazilworkshop06.features where accession = 'CP000243' and user=USER() order by start" | ~pfh/scripts/genewiz/tabann2ann.pl >> CP000243.ann
      
    2. Copy a template genewiz configuration and open it in nedit:
    3. cp /usr/opt/www/pub/CBS/courses/brazilworkshop/files/CP000243.baseatlas.cf .
      jpico CP000243.baseatlas.cf
       
      
      Take a good look at the file! This is a very basic configuration: It instructs the GeneWiz program to draw a circular representation of the genome including AT content and GC-skew only. We are going to modify this file to include BLAST search hits in the other 7 genomes. To do this, we will use the C preprocessor (cpp): Everywhere it says '# include >filename<' the content of file >filename< will be inserted at this point. This is a convinient way of automatically generate the content of the configuration file (remember you have seven genomes to reference).

      Find the places in the file where it say '#include'. This is where 'cpp' will insert GeneWiz-code which you are about to generate. You will place the code in three different files called DATA, CIRCLES, and FILES. Now, mysql can do many things - and why shouldn't it also be able to generate a bit of text for our GeneWiz configuration file? We will now start to generate the three files:
    4. mysql -B -N -e "select concat('file ',accession,'.proteins.blastatlas.genomemap0.gz dat;') from brazilworkshop06.genomes where user=user() and accession != 'CP000243'" > FILES
      mysql -B -N -e 'select concat("circle ",genomes.accession,".proteins.blastatlas.genomemap0.gz 1 \"",color,".cm2\" by 0.0 10.0;") from brazilworkshop06.genomes,brazilworkshop06.colors where user=user() and colors.accession=genomes.accession and genomes.accession != "CP000243" order by organism' > CIRCLES
      mysql -B -N -e 'select concat("dat ",accession,".proteins.blastatlas.genomemap0.gz 1 0.0 0.0 0.0 \"",organism,"\" boxfilter 10;") from brazilworkshop06.genomes where user=user() and accession != "CP000243" ' > DATA
      
    5. Try and have a look at the new individual files:
    6. less DATA
      less FILES
      less CIRCLES
      
    7. Use cpp to merge them, and try have a look at the result
    8. cpp -w -P CP000243.baseatlas.cf > CP000243.baseatlas.cpp.cf
      less CP000243.baseatlas.cpp.cf
      
    9. Now we need to create files which instructs the BLAST scripts which genomes should be BLASTed against which. Each such 'queryblast' file contains just a link to the reference genome (asuming there is a file called CP000243.proteins.fsa
    10. foreach accession (CP000247 BA000021 AP008232 CP000034 AE017042 CP000036 AP009048)
      cp ../M11/source/$accession.proteins.fsa .
      echo './CP000243' > $accession.proteins.queryblast
      end
      
    11. We feel confidennt that you are now familiar with how things are done at the prompt, and we are going to leave the 'engine room' for a while. We will let a program do all the boring stuff, and execute blasta. Basically, what will happen is, that the program will build maps of the E. coli UTI89 genomes containing E-values for the hits found within the other genomes.

      Execute a single script which would make all the data-files which are required by your GeneWiz configuration file. We don't have time however, to go into details about this - it's a complete chapter of it's own.
    12. perl /home/people/pfh/scripts/genewiz/cf_data_create.pl CP000243.baseatlas.cpp.cf
      /usr/cbs/tools/lib/make/bin/genewiz.Linux -p blastatlas.ps CP000243.baseatlas.cpp.cf
      
      
    13. Go to your 'organism' session and open the PostScript file in ghostview. Does the visual representation give you the same impression of genome similarity? Can you find larger regions where genome lack similarity?
  8. CONSTRUCT TREE BASED GENOME PROPERTIES

      We will now collect all the results you have been generating so fare (number sigma factors, number of coding genes, tRNAs rRNAs, genome length, and TCS. Just like be did before, will will let R generate a heatmap. This time, we will need to scale the data. This is done in the R-script by the function 'scale'.
    1. mysql -D brazilworkshop06 -N -B -e "select accession,organism from genomes where user=user() order by accession" > col1
      mysql -D brazilworkshop06 -B -e "select accession,count(*) as CDS from features where user=user() and featuretype='CDS' group by accession order by accession" > col2
      mysql -D brazilworkshop06 -B -e "select accession,count(*) as tRNA from features where user=user() and featuretype='tRNA' group by accession order by accession"  > col3
      mysql -D brazilworkshop06 -B -e "select accession,count(*) as rRNA from features where user=user() and featuretype='rRNA' group by accession order by accession" > col4
      mysql -D brazilworkshop06 -B -e "select accession,count(*) as s54 from features where user=user() and featuretype='CDS' and note like '%s54%' group by accession order by accession" > col5
      mysql -D brazilworkshop06 -B -e "select accession,count(*) as s70 from features where user=user() and featuretype='CDS' and note like '%s70%' group by accession order by accession" > col6
      mysql -D brazilworkshop06 -B -e "select accession,count(*) as ecf from features where user=user() and featuretype='CDS' and note like '%ecf%' group by accession order by accession" > col7
      mysql -D brazilworkshop06 -B -e "select accession,count(*) as RRreciever from features where user=user() and featuretype='CDS' and note like '%RRreciever%' group by accession order by accession" > col8
      mysql -D brazilworkshop06 -B -e "select accession,count(*) as HisKA_1 from features where user=user() and featuretype='CDS' and note like '%HisKA_1%' group by accession order by accession" > col9
      mysql -D brazilworkshop06 -B -e "select accession,count(*) as HisKA_3 from features where user=user() and featuretype='CDS' and note like '%HisKA_3%' group by accession order by accession" > col10
      mysql -D brazilworkshop06 -B -e "select accession,length(seq) as Length from genomes where user=user() group by accession order by accession" > col11
      
      perl -e 'foreach $i (1..11) { open IN , "col$i"; while (< IN >) { chomp;($a , $c) = split (/\t+/,$_); $DATA{$a}{$i} = $c;} close IN;} END {foreach $a ( reverse sort keys %DATA) {foreach $i (1..11) {$DATA{$a}{$i} = 0 if $DATA{$a}{$i} eq "" and $i > 1; print $DATA{$a}{$i}; print "\t" if $i < 11;}print "\n";}}' > final.dat
      
      R
      postscript ('final.ps')
      source ( "/usr/opt/www/pub/CBS/courses/brazilworkshop/files/heatmap.R" )
      data <- as.matrix ( read.table ( 'final.dat' , row.names=1, header= TRUE,sep = "\t" ) )
      pal <- maPalette(low = rgb(10,10,10,maxColorValue = 10), high = rgb(0,0,10,maxColorValue = 10), mid= rgb(5,5,10,maxColorValue = 10),, k = 1024)
      par(oma=c(0, 0, 0, 0.5))
      data  <- scale(data)
      new_heatmap ( data , cexCol=1,cexRow=1,colors = pal)
      dev.off()
      quit(save="no")
      
      
      View the output postscript file on organism, and compare it with the rRNA tree that you made in the beginning of the exercise