NOTE: This exercise was held on Feb 2, 2008 at 10:15am.



This exercise requires Perl, SOAP::Lite (version 0.69), R statistical package (for plotting) and any recent SoapUI version installed locally. We recommend you log in to the CBS servers sbiology, ibiology or life while doing this exercise. In the folder 'ex2' of your home directory you will find the files relevant for this exercise.



When prokaryotic organisms transcribe DNA into RNA it involves in many cases the binding of a sigma factor upstream of the transcription start site. This guides the RNA polymerase (RNAP) into position to facilitate the transcription process. Sigma factors also have important regulatory functions and many different factors acts to ensure differential regulation of different classes of genes under different conditions. In this exercise we will look at σ70 (or RpoD) which is the 'housekeeping' sigma factor responsible for regulating most genes in the prokaryotic cell. It has two distinct binding sites -10 and -35 upstream of the transcription start site, often denoted by the consensus sequences: TATAAT and TTGACA. Since the hexamers rarely match the consensus, information theory is often used to describe the observed variation and to measure how well a given promoter fits a model. This measure is provided in bits of information.


Counting the frequency of ATGC in aligned, experimentally verified σ70 minus 35 hexamers, Huerta and co-workers have established the following weight matrix:

Since we are 'surprised' when observing sites with low variability, we define the information content iw (in bits) for each column i as:

-where j iterates over the four bases ATGC, J = 4 is the alphabet length, and w is the position/column in the alignment.

A so-called logo plot can then be constructed which visualizes both the base frequencies as well as the information content at each position: The stack height corresponds to the information content at each position, and the relative heights of the letters correspond to the frequencies of the bases ( see above ). The lower the total stack height, the more equal distribution between ATGC is observed at that position; A stack ofheight 2 corresponds 100% conservation of a single letter, in a four-letter alphabet whereas a stack height of 0 means equal distribution.

To obtain the bit score of how a query sequence fits the model you first have to construct the matrix Rb,p and calculate the sum Rtot:

Additionally a method was proposed by Shultzaberger and co-workers which incorporates an information theory based quantification of the helical phasing between different binding sites to account for various spacers. Since the two hexamers of the -10 and -35 binds to the same molecule, it's spacing is essential for subsequent transcription - not only distance-wise but adding 5 bp to a spacer will change the helical phasing by 180 degrees, making the box inaccessible to the sigma factor.

σ70 promters of rRNA genes

The ribosomal RNA (rRNA) genes encode essential products for the translation machinery of the bacteria; without the ribosome (and its components) no translation of RNA into protein (=no life!) In other words, the rRNA genes are needed constantly by the cell, however required at different amounts depending on the growth rate of the cell. The rRNA genes are regulated by two σ70 promoters, P1 and P2 upstream of the 16S rRNA (which is the first gene of the operon) The redundancy of having multiple promoters allows the cell to differentially express the genes depending on the conditions. The E. coli genome contains seven rRNA operons.

Using Web Services, we shall construct a workflow to investigate how this promoter structure is conserved throughout different bacterial strains.

Enough biology - let's move on!

During the exercise we will use three CBS Web Services:
  • RNAmmer: A tool for predicting 5S, 16S, and 23S rRNA genes in prokaryotic genomes, using Hidden Markov Models build from structurally aligned sequences.
  • GenomeAtlas: A database resource for obtaining genomes of fully sequence prokaryotic organisms.
  • infscan: A small tool for scanning DNA using information theory by providing weight matrices

We will not go into details with the infscan tool, as it simply serves the purpose of demonstrating how bioinformatic tools integrate into bigger workflows. The two other services will first be demonstrated using SoapUI to let you get familiar with the SOAP operations in the input/output objects.

Exploring services with SoapUI

  1. New SoapUI project
    Open SoapUI and create a new project and call it 'exercise2'.

  2. Load Genome Atlas WSDL
    Add the Genome Atlas WSDL file: Similar services can be found other places as well, e.g. at NCBI or EBI.

  3. Obtain genome sequence
    Locate the operation getSeq, and request the genome sequence of the E. coli K12 genome, which has genbank accession no. U00096

  4. Load RNAmmer WSDL
    Add the WSDL file to your existing project, exercise2.

  5. Submit genome sequence to RNAmmer
    Transfer the genome sequence from the getSeq response into the RNAmmer runService default request (hint: double click on the sequence to select the entire element content - be aware this line is over 4 mill. characters long!) Set element 'kingdom'='bac' (using bacterial models) and set 'mol'='ssu' (predict only 16S, Small Submit Unit). Your request should look this this:

  6. Poll the queue until the job is finished, and obtain the result using 'fetchResult' and inspect the output

We hope you are now familiar with the basic functionality of these Web Services. We will now try and implement those in a programmatic workflow.

The programmatic way

  1. Request genome sequences
    The workflow just described can be done with simple Perl or Python scripts. If you log into the CBS servers, the scripts are created for you. Source code: | | |
    perl u00096 | perl bac ssu
  2. Extracting promoter regions
    The previous step returned the 16S gene sequences, but we are interested in analyzing the 500bp upstream region of the gene. Make the appropriate changes to the script and save it under and re-run the workflow. We have been considerably kind to you by providing the code given as comments in the scripts.
    perl u00096 | perl bac ssu > 16S_m500.fsa
  3. Generate information profiles
    We have provided three Perl scripts (,, and which reads in DNA in fasta format, and submits it to a small Web Services based tool which scans for the presence of minus10 boxes, P1, and P2 promoters using information theory. The models are hardcoded into the scripts. The script only models the first -10 hexamer. Try running this Web Service, and preview the output using 'less' (press 'q' to quit less)
    perl < 16S_m500.fsa > minus10.out
    less minus10.out
    perl < minus10.out >
    As you may have observed, this model is inadequate to model P1 and P2. We will now try running the same sequence using the full P1 and P2 models:
    perl < 16S_m500.fsa | perl >
    perl < 16S_m500.fsa | perl >
  4. Obtaining the -10 and -35, UP element, and FIS binding sites
    We have provided a small Perl scripts, which sorts the profile output, and list the match having the highest total score. This position corresponds to the peaks of the plots you have just made. Source code:
    perl < 16S_m500.fsa | perl
    perl < 16S_m500.fsa | perl
  5. Combining everything into a single workflow
    If time allows, you can implement the above steps into a single Perl or Python script, e.g. with some (or all) of these specifications:
    1. Read one or more genbank accession numbers from the argument list, e.g. MyWorkflow CP000247 CP000468 AE014075
    2. Download each of them using the GenomeAtlas getSeq operation
    3. Predict the location of the 16S rRNA genes using RNAmmer
    4. Extract the 500 bp regions upstream of the 16S rRNA genes
    5. Using the infscan operation to find the location where P1 and P2 models have the highest total information score, and return the predicted binding motifs
    6. Your workflow could also write a PostScript file of the P1 and P2 profiles, by doing a system call to, or you can insert the code into your workflow CP000247 CP000468 AE014075
    #16S 3536868..3538397	P2 starts at 304	TAATAT	TTGACT	AAAGAGAAAAAATCCTGAAATT
    CP000247Escherichia coli 536
    CP000468Escherichia coli APEC O1
    AE014075Escherichia coli CFT073
    CP000800Escherichia coli E24377A
    CP000802Escherichia coli HS
    U00096Escherichia coli K12
    AE005174Escherichia coli O157:H7 EDL933
    BA000007Escherichia coli O157:H7 str. Sakai
    CP000243Escherichia coli UTI89
    AP009048Escherichia coli W3110
    CP000880Salmonella enterica subsp. arizonae serovar 62:z4,z23:--
    AE017220Salmonella enterica subsp. enterica serovar Choleraesuis str. SC-B67
    CP000026Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150
    CP000886Salmonella enterica subsp. enterica serovar Paratyphi B str. SPB7
    AL513382Salmonella enterica subsp. enterica serovar Typhi str. CT18
    AE014613Salmonella enterica subsp. enterica serovar Typhi str. Ty2
    AE006468Salmonella typhimurium LT2
    CP000036Shigella boydii Sb227
    CP000034Shigella dysenteriae Sd197
    AE014073Shigella flexneri 2a str. 2457T
    AE005674Shigella flexneri 2a str. 301
    CP000266Shigella flexneri 5 str. 8401
    CP000038Shigella sonnei Ss046
    AM286415Yersinia enterocolitica subsp. enterocolitica 8081
    CP000901Yersinia pestis Angola
    CP000308Yersinia pestis Antiqua
    AE017042Yersinia pestis biovar Microtus str. 91001
    AL590842Yersinia pestis CO92
    AE009952Yersinia pestis KIM
    CP000305Yersinia pestis Nepal516
    CP000668Yersinia pestis Pestoides F
    CP000720Yersinia pseudotuberculosis IP 31758
    BX936398Yersinia pseudotuberculosis IP 32953



Peter F. Hallin,