|
Exercise - Module M12 - Codon usage, promoters, BLAST atlases
Copy files
- Copy the files needed for the exercise
umask 022
setenv MAKEFILES /home/databases/genomeatlasdb-3.0_cur/make/Makefile
set accession = BA000028
cp -rL ~www/pub/CBS/courses/thaiworkshop08/m12 ~/
cp ~/m8/$accession.ann ~/m12
cp ~/m8/$accession.proteins.fsa ~/m12
cp ~/m8/$accession.fsa ~/m12
cp ~/m8/$accession.orfs.fsa ~/m12
cd ~/m12
DNA physical/chemical properties
- Calculate global AT content
We start by measuring the AT content of the entire genome.
set accession = BA000028
perl -ne 'uc ; next unless /^([A-Za-z]+)/;for($x=0;$x< length($1);$x++){\
$AT++ if ( substr($1,$x,1) eq "A" or substr($1,$x,1) eq "T"); $TOT++;} END {\
printf ("%0.4f\n",(($AT) / $TOT));}' < $accession.fsa > $accession.ATcontent
set ATcontent = `cat $accession.ATcontent`
mysql -B -e "update MicrobialGenomicsGroup.thaiworkshop08 set ATcontent = $ATcontent where accession = '$accession'"
QUESTION We are using the scripting language Perl, which is widely
used in the field of bioinformatics. The code is what we call a Perl one-liner,
referring to it being written as a single command line (specified by the
-e). You would normally write this as a script instead, unless you have
a really good overview ;-) Below is a nicer version of the same script: Try to
understand what is going on in the script:
- Codon and amino acid usage
We shall now use your prediction of open reading frames and their translation
to produce a circular representation of the codon- and amino acid usage.
perl aa.pl < $accession.proteins.fsa > $accession.aa.png
perl codon.pl < $accession.orfs.fsa > $accession.codon.png
Open the PDF file:m12/BA000028.aa.png
Open the PDF file:m12/BA000028.codon.png
QUESTION How is the AT content of the organism reflected in the codons used?
(hint:base wobbling)
- Profile of AT content
This gmake command will execute many commands. As was presented during tuesdays presentations,
we are generating an AT content profile of all the 400 bp upstream/downstream regions relative to the
genes you have predicted using EasyGene. This is possible, since you have created the
annotation file yesterday (*.ann)
QUESTION Can you identify the base wobbling?
saco_convert -I fasta -O tab $accession.fsa > $accession.tab
gmake -j $accession.ATcontentBits.400profile400.pdf
Open the PDF file:m12/BA000028.ATcontentBits.400profile400.pdf
QUESTION What does saco_convert do?
(hint:base wobbling)
- Profile of structural parameters
In a similar way we will now generate some of the DNA parameters you have already heard about,
and extract values located around the translation start:
gmake -j $accession.promoter.400profile400.gauss5.pdf
Open the PDF file:m12/BA000028.promoter.400profile400.gauss5.pdf
QUESTION Are there changes in the structural properties which you can
account for - consult the text book
- Fetch pre-calculated SIDD profiles
We will now from a Web Services repository of SIDD values, download the free energy (dG) measures
at various super helical densities and observe how the vary (if at all) near the translation start.
perl getSIDD.pl $accession '-0.025' > $accession.sidd25dG &
perl getSIDD.pl $accession '-0.035' > $accession.sidd35dG &
perl getSIDD.pl $accession '-0.045' > $accession.sidd45dG &
perl getSIDD.pl $accession '-0.055' > $accession.sidd55dG &
wait
gmake -j $accession.sidddG.400profile400.gauss5.pdf
Open the PDF file:m12/BA000028.sidddG.400profile400.gauss5.pdf
QUESTION Can you explain why you observe no sharp peaks in the
structural measures (curvature, stacking energy, and position preference) or in the SIDD measure near
the translation start site - that is, why do you observe very broad and soft peaks?
Making a Genome Atlas
- Construct a standard 'Genome Atlas'
The BLAST atlas web service allows you to create a circular representation of the structural parameters,
in a simple and easy way.
set organism = `getgene $accession | perl -ne 'if (/^ ORGANISM /) {s/^ ORGANISM //; s/^([^ ])[^ ]* ([^ 0-9]+( .*)?)$/$1. $2/; print;}'`
echo $organism
perl genomeatlas.pl -ref $accession.fsa -t "$organism" \
-dnap "Intrinsic Curvature,Stacking Energy,Position Preference,Global Direct Repeats,Global Inverted Repeats,GC Skew,Percent AT" \
-ann $accession.ann > genomeatlas.pdf
Open the PDF file:m12/genomeatlas.pdf
Here, the options refer to: -t being the title for the
plot.-ref being the DNA sequence of which we plot the properties,
-dnap specifying which DNA properties to include, and
-ann is pointing to the annotation file you made - the annotations
placed here will be plotted on the atlas.
QUESTION Identify the rRNA genes you have predicted (see the legend on page two to see the color).
Do you find any level of direct or inverted repeats in this area? Why / why not? (hint: count how many rRNA
genes did you find? You may want to look in your *.rrna.fsa file to see which rRNA molecules you predicted)
- Get all the predicted translations of the other groups
We will now construct a small configuration file, which instructs the genomeatlas.pl script
to blast the predicted proteins of your personal bacterium against all the proteins of all the bacteria of the other groups!
rm -f blast.cfg
foreach file ( `ls -1 ~stud125/upload/*proteins.fsa | sort` )
if ( -s $file > 0 ) then
set acc = `echo $file | perl -ne 'print $1 if /([^\/]+)\.proteins\.fsa/;'`
set organism = `getgene $acc | perl -ne 'if (/^ ORGANISM /) {s/^ ORGANISM //; s/^([^ ])[^ ]* ([^ 0-9]+( .*)?)$/$1. $2/; print;}'`
echo "$organism\tblastp\t5,5,0\t0.5\t$file" >> blast.cfg
endif
end
less blast.cfg
QUESTION What do you thing the double redirect ('>>') means?
- Make BLAST atlas
We use the same genomeatlas.pl script to generate the atlas:
perl genomeatlas.pl -ref $accession.fsa -t "$organism" \
-dnap "Intrinsic Curvature,Stacking Energy,Position Preference,Global Direct Repeats,Global Inverted Repeats,GC Skew,Percent AT" \
-proteins $accession.proteins.fsa -ann $accession.ann -blastcfg blast.cfg > blastatlas.pdf
Open the PDF file:m12/blastatlas.pdf
Here, we have added two other options: -proteins being a file containing the
proteins of the reference genome and -blastcfg being the configuration file you just created,
containing the setup of BLAST lanes included in the atlas.
QUESTION Can you identify areas that are highly conserved - feature your organism possess which
are also found the organisms of the other groups? Can you identify areas that are unique to your organism?
- Add custom lanes: SIDD
We will now demonstrate how you can add any type of numerical data to your atlas. In this
case we use the SIDD measure at sigma -0.035. You should realize, that this data could have any form -
being microarray data mapped to gene positions or individual probe intensities for individual experiments.
echo "SIDD -0.035\t0,0,10\t10,10,10\t5\t10\t$accession.sidd35dG" > custom.cfg
perl genomeatlas.pl -ref $accession.fsa -t "$organism" \
-dnap "Intrinsic Curvature,Stacking Energy,Position Preference,Global Direct Repeats,Global Inverted Repeats,GC Skew,Percent AT" \
-proteins $accession.proteins.fsa -ann $accession.ann -customcfg custom.cfg -blastcfg blast.cfg > customatlas.pdf
Open the PDF file:m12/customatlas.pdf
Here, we have added two other options: -customcfg to setup the SIDD lane you just created.
QUESTION Why does one of the BLAST lanes have a dark color intensity almost throughout the circle? And how would you account
for the gaps that exist in this lane?
- Zooming
The genomeatlas.pl script allows you to zoom in on regions which you find interesting. Here,
we have chosen an arbitrary region. Find the options to the program from the command below, and feel free to change
them if you like. You can use the previous atlases to identify a region you want to look at.
perl genomeatlas.pl -ref $accession.fsa -t "$organism" \
-begin 1 -end 10000 -dnap "Intrinsic Curvature,Stacking Energy,Position Preference,Global Direct Repeats,Global Inverted Repeats,GC Skew,Percent AT" \
-proteins $accession.proteins.fsa -ann $accession.ann -customcfg custom.cfg -blastcfg blast.cfg > zoomatlas.pdf
Open the PDF file:m12/zoomatlas.pdf
CHALLENGE: Try to identify a region where your organism
contain unique properties (or not very common) and make a zoom around the window
(not more than 20,000 nucleotides). Yesterday, you learned how to make a BLAST
search of your own predicted proteins against Swiss Prot and you should now use the
same approach to see if you can find the function of the unique features. Hint:
use saco_convert -I fasta -O tab < $accession.protein.fsa to convert
this into a tab file, and after this you can for example use grep to
find sequences having the coordinate of the genes you are looking for. Instead
of writing this to 'zoomatlas.pdf', write the output to 'challenge.pdf'
Open the PDF file:m12/challenge.pdf
Sequence based promoter prediction
When prokaryotic organisms transcribe DNA into RNA it involves in many cases the
binding of a sigma factor upstream of the transcription start site. This guides
the RNA polymerase (RNAP) into position to facilitate the transcription process.
Sigma factors have an important regulatory functions and many different factors
acts to ensure differential regulation of different classes of genes under
different conditions. In this exercise we will look at σ70
(RpoD) which is the 'housekeeping' sigma factor responsible for regulating most
genes in E. coli, and which you predicted the presence of yesterday. It
has two distinct binding sites -10 and -35 upstream of the transcription start
site, often denoted by the consensus sequences: TATAAT and TTGACA. Since these
hexamers rarely match the consensus, information theory is often applied to
describe the variation and to measure how well a given promoter fits a
model. This measure is provided in bits of information.
Example
Counting the frequency of ATGC in aligned, experimentally verified
σ70 minus 35 hexamers, Huerta and co-workers have established
the following weight matrix:

Since we are 'surprised' when observing sites with low variability, we define the information content
iw (in bits) for each column i as:
-where j iterates over the four bases ATGC, J = 4 is the alphabet
length, and w is the position/column in the alignment.
|
|
A so-called logo plot can then be constructed which visualizes both the base
frequencies as well as the information content at each position: The stack
height corresponds to the information content at each position, and the relative
heights of the letters correspond to the frequencies of the bases ( see above ).
The lower the total stack height, the more equal distribution between ATGC is
observed at that position; A stack ofheight 2 corresponds 100% conservation of
a single letter, in a four-letter alphabet whereas a stack height of 0 means
equal distribution.
To obtain the bit score of how a query sequence fits the model you first have to construct
the matrix Rb,p and calculate the sum Rtot:
Additionally a method was proposed by Shultzaberger and co-workers which
incorporates an information theory based quantification of the helical phasing
between different binding sites to account for various spacers. Since the two
hexamers of the -10 and -35 binds to the same molecule, it's spacing is
essential for subsequent transcription - not only distance-wise but adding 5 bp
to a spacer will change the helical phasing by 180 degrees, making the box
inaccessible to the sigma factor.
σ70 promters of rRNA genes
The ribosomal RNA (rRNA) genes encode essential products for the translation
machinery of the bacteria; without the ribosome (and its components) no
translation of RNA into protein (=no life!) In other words, the rRNA genes are
needed constantly by the cell, however required at different amounts depending
on the growth rate of the cell. The rRNA genes are regulated by two
σ70 promoters, P1 and P2 upstream of the 16S rRNA (which
is the first gene of the operon) The redundancy of having multiple promoters
allows the cell to differentially express the genes depending on the conditions.
The E. coli genome contains seven rRNA operons.
Using Web Services, we shall construct a workflow to investigate how this
promoter structure is conserved throughout different bacterial strains. We will
not go into details with the infscan tool, as it simply serves the
purpose of demonstrating how bioinformatic tools integrate into bigger
workflows.
Information of P1 / P2 in E. coli
- Extract 500 bp upstream of the 16S rRNA genes in E. coli K12
perl getseq.pl U00096 | perl rnammer_upstream.pl bac ssu > K1216S.u500.fsa
- Generate information profiles
We have provided three Perl scripts (minus10.pl, P1.pl, and P2.pl)
which reads in DNA in fasta format, and submits it to a small Web Services based tool which
scans for the presence of minus10 boxes, P1, and P2 promoters using information theory. The models are hardcoded
into the scripts. The minus10.pl script only models the first -10 hexamer. Try running this Web Service,
and preview the output using 'less' (press 'q' to quit less)
- First, briefly observe the minus10.pl script
In the script, identify where the matrices are submitted
less K1216S.u500.fsa
less minus10.pl
- Scan the upstream sequence with the minus 10 model
perl minus10.pl < K1216S.u500.fsa > K1216S.u500.minus10.out
less K1216S.u500.minus10.out
perl xyplot.pl < K1216S.u500.minus10.out > K1216S.u500.minus10.ps
gmake K1216S.u500.minus10.pdf
Open the PDF file:m12/K1216S.u500.minus10.pdf
As you may have observed, this model is inadequate to model P1 and P2. We will now try running the same sequence
using the full P1 and P2 models:
- Now, again briefly, observe the P1.pl script
In the script, identify where the matrices are submitted
less P1.pl
- Scan the upstream sequence with the P1 model
perl P1.pl < K1216S.u500.fsa > K1216S.u500.P1.out
less K1216S.u500.P1.out
perl xyplot.pl < K1216S.u500.P1.out > K1216S.u500.P1.ps
gmake K1216S.u500.P1.pdf
Open the PDF file:m12/K1216S.u500.P1.pdf
- Scan the upstream sequence with the P2 model
perl P2.pl < K1216S.u500.fsa > K1216S.u500.P2.out
less K1216S.u500.P2.out
perl xyplot.pl < K1216S.u500.P2.out > K1216S.u500.P2.ps
gmake K1216S.u500.P2.pdf
Open the PDF file:m12/K1216S.u500.P2.pdf
- Obtaining the -10 and -35, UP element, and FIS binding sites
We have provided a small Perl scripts, which sorts the profile output, and list the match having the highest
total score. This position corresponds to the peaks of the plots you have just made. Source code: infsort.pl.
perl P1.pl < K1216S.u500.fsa | perl infsort.pl
perl P2.pl < K1216S.u500.fsa | perl infsort.pl
rRNA promoter in your bacterium
- Make profile of P1
set accession = BA000028
perl getseq.pl $accession | perl rnammer_upstream.pl bac ssu > $accession.500.fsa
perl P1.pl < $accession.500.fsa > $accession.p1.profile
perl infsort.pl < $accession.p1.profile > $accession.p1.top
less $accession.p1.top
perl xyplot.pl < $accession.p1.profile > $accession.p1.profile.ps
gmake $accession.p1.profile.pdf
Open the PDF file:m12/BA000028.p1.profile.pdf
- Make profile of P2
set accession = BA000028
perl P2.pl < $accession.500.fsa > $accession.p2.profile
perl infsort.pl < $accession.p2.profile > $accession.p2.top
less $accession.p2.top
perl xyplot.pl < $accession.p2.profile > $accession.p2.profile.ps
gmake $accession.p2.profile.pdf
Open the PDF file:m12/BA000028.p2.profile.pdf
QUESTION You may have identified some problems
identifing P1 and P2 in your organism. Explain why you think this
method does not apply to all bacteria.
LITERATURE
|