NOTE: This exercise was held on Feb 2, 2008 at 10:15am.
EXERCISE 2: DNA BINDING SITES AND INFORMATION CONTENT
This exercise requires Perl, SOAP::Lite (version 0.69), R statistical package
(for plotting) and any recent SoapUI version installed locally. We recommend you
log in to the CBS servers sbiology, ibiology or life while doing this exercise. In
the folder 'ex2' of your home directory you will find the files relevant for this exercise.
When prokaryotic organisms transcribe DNA into RNA it involves in many cases the
binding of a sigma factor upstream of the transcription start site. This guides
the RNA polymerase (RNAP) into position to facilitate the transcription process.
Sigma factors also have important regulatory functions and many different
factors acts to ensure differential regulation of different classes of genes
under different conditions. In this exercise we will look at
σ70 (or RpoD) which is the 'housekeeping' sigma factor
responsible for regulating most genes in the prokaryotic cell. It has two
distinct binding sites -10 and -35 upstream of the transcription start site,
often denoted by the consensus sequences: TATAAT and TTGACA. Since the hexamers
rarely match the consensus, information theory is often used to describe
the observed variation and to measure how well a given promoter fits a model. This
measure is provided in bits of information.
Counting the frequency of ATGC in aligned, experimentally verified
σ70 minus 35 hexamers, Huerta and co-workers have established
the following weight matrix:
Since we are 'surprised' when observing sites with low variability, we define the information content
iw (in bits) for each column i as:
-where j iterates over the four bases ATGC, J = 4 is the alphabet
length, and w is the position/column in the alignment.
A so-called logo plot can then be constructed which visualizes both the base
frequencies as well as the information content at each position: The stack
height corresponds to the information content at each position, and the relative
heights of the letters correspond to the frequencies of the bases ( see above ).
The lower the total stack height, the more equal distribution between ATGC is
observed at that position; A stack ofheight 2 corresponds 100% conservation of
a single letter, in a four-letter alphabet whereas a stack height of 0 means
To obtain the bit score of how a query sequence fits the model you first have to construct
the matrix Rb,p and calculate the sum Rtot:
Additionally a method was proposed by Shultzaberger and co-workers which
incorporates an information theory based quantification of the helical phasing
between different binding sites to account for various spacers. Since the two
hexamers of the -10 and -35 binds to the same molecule, it's spacing is
essential for subsequent transcription - not only distance-wise but adding 5 bp
to a spacer will change the helical phasing by 180 degrees, making the box
inaccessible to the sigma factor.
σ70 promters of rRNA genes
The ribosomal RNA (rRNA) genes encode essential products for the translation
machinery of the bacteria; without the ribosome (and its components) no
translation of RNA into protein (=no life!) In other words, the rRNA genes are
needed constantly by the cell, however required at different amounts depending
on the growth rate of the cell. The rRNA genes are regulated by two
σ70 promoters, P1 and P2 upstream of the 16S rRNA (which
is the first gene of the operon) The redundancy of having multiple promoters
allows the cell to differentially express the genes depending on the conditions.
The E. coli genome contains seven rRNA operons.
Using Web Services, we shall construct a workflow to investigate how this promoter
structure is conserved throughout different bacterial strains.
Enough biology - let's move on!
During the exercise we will use three CBS Web Services:
- RNAmmer: A tool for predicting 5S, 16S, and 23S rRNA genes in prokaryotic genomes, using
Hidden Markov Models build from structurally aligned sequences.
- GenomeAtlas: A database resource for obtaining genomes of fully sequence prokaryotic organisms.
- infscan: A small tool for scanning DNA using information theory by providing weight matrices
We will not go into details with the infscan tool, as it simply serves the purpose of demonstrating
how bioinformatic tools integrate into bigger workflows. The two other services will first be demonstrated
using SoapUI to let you get familiar with the SOAP operations in the input/output objects.
Exploring services with SoapUI
We hope you are now familiar with the basic functionality of these Web Services. We will now try and
implement those in a programmatic workflow.
- New SoapUI project
Open SoapUI and create a new project and call it 'exercise2'.
- Load Genome Atlas WSDL
Add the Genome Atlas WSDL file: http://www.cbs.dtu.dk/ws/GenomeAtlas/GenomeAtlas_3_3.wsdl
Similar services can be found other places as well, e.g. at NCBI or EBI.
- Obtain genome sequence
Locate the operation getSeq, and request the genome sequence of
the E. coli K12 genome, which has genbank accession no. U00096
- Load RNAmmer WSDL
Add the WSDL file http://www.cbs.dtu.dk/ws/RNAmmer/RNAmmer_1_2a.wsdl
to your existing project, exercise2.
- Submit genome sequence to RNAmmer
Transfer the genome sequence from the getSeq response into the RNAmmer runService default
request (hint: double click on the sequence to select
the entire element content - be aware this line is over 4 mill. characters long!)
Set element 'kingdom'='bac' (using bacterial models) and set 'mol'='ssu'
(predict only 16S, Small Submit Unit). Your request should look this this:
Poll the queue until the job is finished, and obtain the result using 'fetchResult'
and inspect the output
The programmatic way
- Request genome sequences
The workflow just described can be done with simple Perl or Python scripts. If you log into the CBS servers,
the scripts are created for you. Source code: rnammer.pl | getseq.pl | rnammer.py | getseq.py
perl getseq.pl u00096 | perl rnammer.pl bac ssu
Extracting promoter regions
The previous step returned the 16S gene sequences, but we are interested in
analyzing the 500bp upstream region of the gene. Make the appropriate
changes to the rnammer.pl/.py script and save it under
rnammer_upstream.pl/py and re-run the workflow. We have been considerably
kind to you by providing the code given as comments in the rnammer.pl/.py scripts.
perl getseq.pl u00096 | perl rnammer_upstream.pl bac ssu > 16S_m500.fsa
- Generate information profiles
We have provided three Perl scripts (minus10.pl, P1.pl, and P2.pl)
which reads in DNA in fasta format, and submits it to a small Web Services based tool which
scans for the presence of minus10 boxes, P1, and P2 promoters using information theory. The models are hardcoded
into the scripts. The minus10.pl script only models the first -10 hexamer. Try running this Web Service,
and preview the output using 'less' (press 'q' to quit less)
perl minus10.pl < 16S_m500.fsa > minus10.out
perl xyplot.pl < minus10.out > 16SrRNA.m10.ps
As you may have observed, this model is inadequate to model P1 and P2. We will now try running the same sequence
using the full P1 and P2 models:
perl P1.pl < 16S_m500.fsa | perl xyplot.pl > 16SrRNA.p1.ps
perl P2.pl < 16S_m500.fsa | perl xyplot.pl > 16SrRNA.p2.ps
Obtaining the -10 and -35, UP element, and FIS binding sites
We have provided a small Perl scripts, which sorts the profile output, and list the match having the highest
total score. This position corresponds to the peaks of the plots you have just made. Source code: infsort.pl.
perl P1.pl < 16S_m500.fsa | perl infsort.pl
perl P2.pl < 16S_m500.fsa | perl infsort.pl
- Combining everything into a single workflow
If time allows, you can implement the above steps into a single Perl or Python script,
e.g. with some (or all) of these specifications:
- Read one or more genbank accession numbers from the argument list, e.g. MyWorkflow CP000247 CP000468 AE014075
- Download each of them using the GenomeAtlas getSeq operation
- Predict the location of the 16S rRNA genes using RNAmmer
- Extract the 500 bp regions upstream of the 16S rRNA genes
- Using the infscan operation to find the location where P1 and P2 models have the highest
total information score, and return the predicted binding motifs
- Your workflow could also write a PostScript file of the P1 and P2 profiles, by doing a system call to xyplot.pl,
or you can insert the xyplot.pl code into your workflow
ex2-workflow.pl CP000247 CP000468 AE014075
#16S 3536868..3538397 P1 starts at 196 TATAAT TTGTGC AGAAAAAAAGATCAAAAAAGTA AAAGGGTGAAAAAGCAACAAA
#16S 3536868..3538397 P2 starts at 304 TAATAT TTGACT AAAGAGAAAAAATCCTGAAATT
|CP000247||Escherichia coli 536|
|CP000468||Escherichia coli APEC O1|
|AE014075||Escherichia coli CFT073|
|CP000800||Escherichia coli E24377A|
|CP000802||Escherichia coli HS|
|U00096||Escherichia coli K12|
|AE005174||Escherichia coli O157:H7 EDL933|
|BA000007||Escherichia coli O157:H7 str. Sakai|
|CP000243||Escherichia coli UTI89|
|AP009048||Escherichia coli W3110|
|CP000880||Salmonella enterica subsp. arizonae serovar 62:z4,z23:--|
|AE017220||Salmonella enterica subsp. enterica serovar Choleraesuis str. SC-B67|
|CP000026||Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150|
|CP000886||Salmonella enterica subsp. enterica serovar Paratyphi B str. SPB7|
|AL513382||Salmonella enterica subsp. enterica serovar Typhi str. CT18|
|AE014613||Salmonella enterica subsp. enterica serovar Typhi str. Ty2|
|AE006468||Salmonella typhimurium LT2|
|CP000036||Shigella boydii Sb227|
|CP000034||Shigella dysenteriae Sd197|
|AE014073||Shigella flexneri 2a str. 2457T|
|AE005674||Shigella flexneri 2a str. 301|
|CP000266||Shigella flexneri 5 str. 8401|
|CP000038||Shigella sonnei Ss046|
|AM286415||Yersinia enterocolitica subsp. enterocolitica 8081|
|CP000901||Yersinia pestis Angola|
|CP000308||Yersinia pestis Antiqua|
|AE017042||Yersinia pestis biovar Microtus str. 91001|
|AL590842||Yersinia pestis CO92|
|AE009952||Yersinia pestis KIM|
|CP000305||Yersinia pestis Nepal516|
|CP000668||Yersinia pestis Pestoides F|
|CP000720||Yersinia pseudotuberculosis IP 31758|
|BX936398||Yersinia pseudotuberculosis IP 32953|
Peter F. Hallin,