Copy files

  1. Copy the files needed for the exercise
    umask 022
    setenv MAKEFILES /home/databases/genomeatlasdb-3.0_cur/make/Makefile
    set accession = CP000025
    cp -rL ~www/pub/CBS/courses/thaiworkshop08/m12 ~/
    cp ~/m8/$accession.ann ~/m12
    cp ~/m8/$accession.proteins.fsa ~/m12
    cp ~/m8/$accession.fsa ~/m12
    cp ~/m8/$accession.orfs.fsa ~/m12
    cd ~/m12

DNA physical/chemical properties

  1. Calculate global AT content
    We start by measuring the AT content of the entire genome.
    set accession = CP000025
    perl -ne 'uc ; next unless /^([A-Za-z]+)/;for($x=0;$x< length($1);$x++){\
      $AT++ if ( substr($1,$x,1) eq "A" or substr($1,$x,1) eq "T"); $TOT++;} END {\
      printf ("%0.4f\n",(($AT) / $TOT));}' < $accession.fsa > $accession.ATcontent
    set ATcontent = `cat $accession.ATcontent`
    mysql -B -e "update MicrobialGenomicsGroup.thaiworkshop08 set ATcontent = $ATcontent where accession = '$accession'"

    QUESTION We are using the scripting language Perl, which is widely used in the field of bioinformatics. The code is what we call a Perl one-liner, referring to it being written as a single command line (specified by the -e). You would normally write this as a script instead, unless you have a really good overview ;-) Below is a nicer version of the same script: Try to understand what is going on in the script:

  2. Codon and amino acid usage
    We shall now use your prediction of open reading frames and their translation to produce a circular representation of the codon- and amino acid usage.
    perl < $accession.proteins.fsa > $accession.aa.png
    perl < $accession.orfs.fsa > $accession.codon.png
    Open the PDF file:m12/CP000025.aa.png
    Open the PDF file:m12/CP000025.codon.png
    QUESTION How is the AT content of the organism reflected in the codons used? (hint:base wobbling)

  3. Profile of AT content
    This gmake command will execute many commands. As was presented during tuesdays presentations, we are generating an AT content profile of all the 400 bp upstream/downstream regions relative to the genes you have predicted using EasyGene. This is possible, since you have created the annotation file yesterday (*.ann)
    QUESTION Can you identify the base wobbling?

    saco_convert -I fasta -O tab $accession.fsa > $
    gmake -j $accession.ATcontentBits.400profile400.pdf
    Open the PDF file:m12/CP000025.ATcontentBits.400profile400.pdf
    QUESTION What does saco_convert do? (hint:base wobbling)
  4. Profile of structural parameters

  5. In a similar way we will now generate some of the DNA parameters you have already heard about, and extract values located around the translation start:
    gmake -j $accession.promoter.400profile400.gauss5.pdf
    Open the PDF file:m12/CP000025.promoter.400profile400.gauss5.pdf
    QUESTION Are there changes in the structural properties which you can account for - consult the text book

  6. Fetch pre-calculated SIDD profiles

  7. We will now from a Web Services repository of SIDD values, download the free energy (dG) measures at various super helical densities and observe how the vary (if at all) near the translation start.
    perl $accession '-0.025' > $accession.sidd25dG &
    perl $accession '-0.035' > $accession.sidd35dG &
    perl $accession '-0.045' > $accession.sidd45dG &
    perl $accession '-0.055' > $accession.sidd55dG &
    gmake -j $accession.sidddG.400profile400.gauss5.pdf
    Open the PDF file:m12/CP000025.sidddG.400profile400.gauss5.pdf
QUESTION Can you explain why you observe no sharp peaks in the structural measures (curvature, stacking energy, and position preference) or in the SIDD measure near the translation start site - that is, why do you observe very broad and soft peaks?

Making a Genome Atlas

  1. Construct a standard 'Genome Atlas'
    The BLAST atlas web service allows you to create a circular representation of the structural parameters, in a simple and easy way.

    set organism = `getgene $accession | perl -ne 'if (/^  ORGANISM  /) {s/^  ORGANISM  //; s/^([^ ])[^ ]* ([^ 0-9]+( .*)?)$/$1. $2/; print;}'`
    echo $organism
    perl -ref $accession.fsa -t "$organism" \
     -dnap "Intrinsic Curvature,Stacking Energy,Position Preference,Global Direct Repeats,Global Inverted Repeats,GC Skew,Percent AT" \
     -ann $accession.ann > genomeatlas.pdf
    Open the PDF file:m12/genomeatlas.pdf

    Here, the options refer to: -t being the title for the plot.-ref being the DNA sequence of which we plot the properties, -dnap specifying which DNA properties to include, and -ann is pointing to the annotation file you made - the annotations placed here will be plotted on the atlas.

    QUESTION Identify the rRNA genes you have predicted (see the legend on page two to see the color). Do you find any level of direct or inverted repeats in this area? Why / why not? (hint: count how many rRNA genes did you find? You may want to look in your *.rrna.fsa file to see which rRNA molecules you predicted)

  2. Get all the predicted translations of the other groups
    We will now construct a small configuration file, which instructs the script to blast the predicted proteins of your personal bacterium against all the proteins of all the bacteria of the other groups!
    rm -f blast.cfg
    foreach file ( `ls -1 ~stud125/upload/*proteins.fsa | sort` )
    if ( -s $file > 0 ) then
      set acc = `echo $file | perl -ne 'print $1 if /([^\/]+)\.proteins\.fsa/;'`
      set organism = `getgene $acc | perl -ne 'if (/^  ORGANISM  /) {s/^  ORGANISM  //; s/^([^ ])[^ ]* ([^ 0-9]+( .*)?)$/$1. $2/; print;}'`
      echo "$organism\tblastp\t5,5,0\t0.5\t$file"  >> blast.cfg
    less blast.cfg
    QUESTION What do you thing the double redirect ('>>') means?

  3. Make BLAST atlas
    We use the same script to generate the atlas:
    perl -ref $accession.fsa -t "$organism" \
     -dnap "Intrinsic Curvature,Stacking Energy,Position Preference,Global Direct Repeats,Global Inverted Repeats,GC Skew,Percent AT" \
     -proteins $accession.proteins.fsa -ann $accession.ann -blastcfg blast.cfg > blastatlas.pdf
    Open the PDF file:m12/blastatlas.pdf

    Here, we have added two other options: -proteins being a file containing the proteins of the reference genome and -blastcfg being the configuration file you just created, containing the setup of BLAST lanes included in the atlas.

    QUESTION Can you identify areas that are highly conserved - feature your organism possess which are also found the organisms of the other groups? Can you identify areas that are unique to your organism?

  4. Add custom lanes: SIDD
    We will now demonstrate how you can add any type of numerical data to your atlas. In this case we use the SIDD measure at sigma -0.035. You should realize, that this data could have any form - being microarray data mapped to gene positions or individual probe intensities for individual experiments.

    echo "SIDD -0.035\t0,0,10\t10,10,10\t5\t10\t$accession.sidd35dG" > custom.cfg
    perl -ref $accession.fsa -t "$organism" \
     -dnap "Intrinsic Curvature,Stacking Energy,Position Preference,Global Direct Repeats,Global Inverted Repeats,GC Skew,Percent AT" \
     -proteins $accession.proteins.fsa -ann $accession.ann -customcfg custom.cfg -blastcfg blast.cfg > customatlas.pdf
    Open the PDF file:m12/customatlas.pdf

    Here, we have added two other options: -customcfg to setup the SIDD lane you just created.
    QUESTION Why does one of the BLAST lanes have a dark color intensity almost throughout the circle? And how would you account for the gaps that exist in this lane?

  5. Zooming
    The script allows you to zoom in on regions which you find interesting. Here, we have chosen an arbitrary region. Find the options to the program from the command below, and feel free to change them if you like. You can use the previous atlases to identify a region you want to look at.
    perl -ref $accession.fsa -t "$organism" \
     -begin 1 -end 10000 -dnap "Intrinsic Curvature,Stacking Energy,Position Preference,Global Direct Repeats,Global Inverted Repeats,GC Skew,Percent AT" \
     -proteins $accession.proteins.fsa -ann $accession.ann -customcfg custom.cfg -blastcfg blast.cfg > zoomatlas.pdf
    Open the PDF file:m12/zoomatlas.pdf
    CHALLENGE: Try to identify a region where your organism contain unique properties (or not very common) and make a zoom around the window (not more than 20,000 nucleotides). Yesterday, you learned how to make a BLAST search of your own predicted proteins against Swiss Prot and you should now use the same approach to see if you can find the function of the unique features. Hint: use saco_convert -I fasta -O tab < $accession.protein.fsa to convert this into a tab file, and after this you can for example use grep to find sequences having the coordinate of the genes you are looking for. Instead of writing this to 'zoomatlas.pdf', write the output to 'challenge.pdf'

    Open the PDF file:m12/challenge.pdf

Sequence based promoter prediction

When prokaryotic organisms transcribe DNA into RNA it involves in many cases the binding of a sigma factor upstream of the transcription start site. This guides the RNA polymerase (RNAP) into position to facilitate the transcription process. Sigma factors have an important regulatory functions and many different factors acts to ensure differential regulation of different classes of genes under different conditions. In this exercise we will look at σ70 (RpoD) which is the 'housekeeping' sigma factor responsible for regulating most genes in E. coli, and which you predicted the presence of yesterday. It has two distinct binding sites -10 and -35 upstream of the transcription start site, often denoted by the consensus sequences: TATAAT and TTGACA. Since these hexamers rarely match the consensus, information theory is often applied to describe the variation and to measure how well a given promoter fits a model. This measure is provided in bits of information.


Counting the frequency of ATGC in aligned, experimentally verified σ70 minus 35 hexamers, Huerta and co-workers have established the following weight matrix:

Since we are 'surprised' when observing sites with low variability, we define the information content iw (in bits) for each column i as:

-where j iterates over the four bases ATGC, J = 4 is the alphabet length, and w is the position/column in the alignment.

A so-called logo plot can then be constructed which visualizes both the base frequencies as well as the information content at each position: The stack height corresponds to the information content at each position, and the relative heights of the letters correspond to the frequencies of the bases ( see above ). The lower the total stack height, the more equal distribution between ATGC is observed at that position; A stack ofheight 2 corresponds 100% conservation of a single letter, in a four-letter alphabet whereas a stack height of 0 means equal distribution.

To obtain the bit score of how a query sequence fits the model you first have to construct the matrix Rb,p and calculate the sum Rtot:

Additionally a method was proposed by Shultzaberger and co-workers which incorporates an information theory based quantification of the helical phasing between different binding sites to account for various spacers. Since the two hexamers of the -10 and -35 binds to the same molecule, it's spacing is essential for subsequent transcription - not only distance-wise but adding 5 bp to a spacer will change the helical phasing by 180 degrees, making the box inaccessible to the sigma factor.

σ70 promters of rRNA genes

The ribosomal RNA (rRNA) genes encode essential products for the translation machinery of the bacteria; without the ribosome (and its components) no translation of RNA into protein (=no life!) In other words, the rRNA genes are needed constantly by the cell, however required at different amounts depending on the growth rate of the cell. The rRNA genes are regulated by two σ70 promoters, P1 and P2 upstream of the 16S rRNA (which is the first gene of the operon) The redundancy of having multiple promoters allows the cell to differentially express the genes depending on the conditions. The E. coli genome contains seven rRNA operons.

Using Web Services, we shall construct a workflow to investigate how this promoter structure is conserved throughout different bacterial strains. We will not go into details with the infscan tool, as it simply serves the purpose of demonstrating how bioinformatic tools integrate into bigger workflows.

Information of P1 / P2 in E. coli

  1. Extract 500 bp upstream of the 16S rRNA genes in E. coli K12
    perl U00096 | perl bac ssu > K1216S.u500.fsa

  2. Generate information profiles
    We have provided three Perl scripts (,, and which reads in DNA in fasta format, and submits it to a small Web Services based tool which scans for the presence of minus10 boxes, P1, and P2 promoters using information theory. The models are hardcoded into the scripts. The script only models the first -10 hexamer. Try running this Web Service, and preview the output using 'less' (press 'q' to quit less)
  3. First, briefly observe the script
    In the script, identify where the matrices are submitted
    less K1216S.u500.fsa

  4. Scan the upstream sequence with the minus 10 model
    perl < K1216S.u500.fsa > K1216S.u500.minus10.out
    less K1216S.u500.minus10.out
    perl < K1216S.u500.minus10.out >
    gmake K1216S.u500.minus10.pdf
    Open the PDF file:m12/K1216S.u500.minus10.pdf
    As you may have observed, this model is inadequate to model P1 and P2. We will now try running the same sequence using the full P1 and P2 models:
  5. Now, again briefly, observe the script
    In the script, identify where the matrices are submitted
  6. Scan the upstream sequence with the P1 model
    perl < K1216S.u500.fsa > K1216S.u500.P1.out
    less K1216S.u500.P1.out
    perl < K1216S.u500.P1.out >
    gmake K1216S.u500.P1.pdf
    Open the PDF file:m12/K1216S.u500.P1.pdf
  7. Scan the upstream sequence with the P2 model
    perl < K1216S.u500.fsa > K1216S.u500.P2.out
    less K1216S.u500.P2.out
    perl < K1216S.u500.P2.out >
    gmake K1216S.u500.P2.pdf
    Open the PDF file:m12/K1216S.u500.P2.pdf
  8. Obtaining the -10 and -35, UP element, and FIS binding sites
    We have provided a small Perl scripts, which sorts the profile output, and list the match having the highest total score. This position corresponds to the peaks of the plots you have just made. Source code:
    perl < K1216S.u500.fsa | perl
    perl < K1216S.u500.fsa | perl

rRNA promoter in your bacterium

  1. Make profile of P1
    set accession = CP000025
    perl $accession | perl bac ssu > $accession.500.fsa
    perl < $accession.500.fsa > $accession.p1.profile
    perl < $accession.p1.profile > $
    less $
    perl < $accession.p1.profile > $
    gmake $accession.p1.profile.pdf
    Open the PDF file:m12/CP000025.p1.profile.pdf
  2. Make profile of P2
    set accession = CP000025
    perl < $accession.500.fsa > $accession.p2.profile
    perl < $accession.p2.profile > $
    less $
    perl < $accession.p2.profile > $
    gmake $accession.p2.profile.pdf
    Open the PDF file:m12/CP000025.p2.profile.pdf
QUESTION You may have identified some problems identifing P1 and P2 in your organism. Explain why you think this method does not apply to all bacteria.