$what = "Exercise - Module M12 - Codon usage, promoters, BLAST atlases"; standard_head($what); # Create CBS menu. Format is: # Keyword, Path, Name_of_this_page # Keyword is the keyword for the menu color/area. # Name_of_this_page is what this page is called in the hieraki # Path format is a number of comma separated entries in parenthesis # showing the path to this page; (services/,'CBS Prediction Servers') standard_menu("CBSCO","(courses.html,'CBS Courses'),(courses/thaiworkshop08/index.php,'Comparative Microbial Genomics Workshop - 2008 Thailand'),(courses/thaiworkshop08/programme.php,'Course Programme')",$what); ?>
umask 022 setenv MAKEFILES /home/databases/genomeatlasdb-3.0_cur/make/Makefile set accession = AE017334 cp -rL ~www/pub/CBS/courses/thaiworkshop08/m12 ~/ cp ~/m8/$accession.ann ~/m12 cp ~/m8/$accession.proteins.fsa ~/m12 cp ~/m8/$accession.fsa ~/m12 cp ~/m8/$accession.orfs.fsa ~/m12 cd ~/m12
set accession = AE017334 perl -ne 'uc ; next unless /^([A-Za-z]+)/;for($x=0;$x< length($1);$x++){\ $AT++ if ( substr($1,$x,1) eq "A" or substr($1,$x,1) eq "T"); $TOT++;} END {\ printf ("%0.4f\n",(($AT) / $TOT));}' < $accession.fsa > $accession.ATcontent set ATcontent = `cat $accession.ATcontent` mysql -B -e "update MicrobialGenomicsGroup.thaiworkshop08 set ATcontent = $ATcontent where accession = '$accession'"
perl aa.pl < $accession.proteins.fsa > $accession.aa.png perl codon.pl < $accession.orfs.fsa > $accession.codon.png Open the PDF file:m12/AE017334.aa.png Open the PDF file:m12/AE017334.codon.pngQUESTION How is the AT content of the organism reflected in the codons used? (hint:base wobbling)
saco_convert -I fasta -O tab $accession.fsa > $accession.tab gmake -j $accession.ATcontentBits.400profile400.pdf Open the PDF file:m12/AE017334.ATcontentBits.400profile400.pdfQUESTION What does saco_convert do? (hint:base wobbling)
gmake -j $accession.promoter.400profile400.gauss5.pdf Open the PDF file:m12/AE017334.promoter.400profile400.gauss5.pdfQUESTION Are there changes in the structural properties which you can account for - consult the text book
perl getSIDD.pl $accession '-0.025' > $accession.sidd25dG & perl getSIDD.pl $accession '-0.035' > $accession.sidd35dG & perl getSIDD.pl $accession '-0.045' > $accession.sidd45dG & perl getSIDD.pl $accession '-0.055' > $accession.sidd55dG & wait gmake -j $accession.sidddG.400profile400.gauss5.pdf Open the PDF file:m12/AE017334.sidddG.400profile400.gauss5.pdf
set organism = `getgene $accession | perl -ne 'if (/^ ORGANISM /) {s/^ ORGANISM //; s/^([^ ])[^ ]* ([^ 0-9]+( .*)?)$/$1. $2/; print;}'` echo $organism perl genomeatlas.pl -ref $accession.fsa -t "$organism" \ -dnap "Intrinsic Curvature,Stacking Energy,Position Preference,Global Direct Repeats,Global Inverted Repeats,GC Skew,Percent AT" \ -ann $accession.ann > genomeatlas.pdf Open the PDF file:m12/genomeatlas.pdf
rm -f blast.cfg foreach file ( `ls -1 ~stud125/upload/*proteins.fsa | sort` ) if ( -s $file > 0 ) then set acc = `echo $file | perl -ne 'print $1 if /([^\/]+)\.proteins\.fsa/;'` set organism = `getgene $acc | perl -ne 'if (/^ ORGANISM /) {s/^ ORGANISM //; s/^([^ ])[^ ]* ([^ 0-9]+( .*)?)$/$1. $2/; print;}'` echo "$organism\tblastp\t5,5,0\t0.5\t$file" >> blast.cfg endif end less blast.cfgQUESTION What do you thing the double redirect ('>>') means?
perl genomeatlas.pl -ref $accession.fsa -t "$organism" \ -dnap "Intrinsic Curvature,Stacking Energy,Position Preference,Global Direct Repeats,Global Inverted Repeats,GC Skew,Percent AT" \ -proteins $accession.proteins.fsa -ann $accession.ann -blastcfg blast.cfg > blastatlas.pdf Open the PDF file:m12/blastatlas.pdf
echo "SIDD -0.035\t0,0,10\t10,10,10\t5\t10\t$accession.sidd35dG" > custom.cfg perl genomeatlas.pl -ref $accession.fsa -t "$organism" \ -dnap "Intrinsic Curvature,Stacking Energy,Position Preference,Global Direct Repeats,Global Inverted Repeats,GC Skew,Percent AT" \ -proteins $accession.proteins.fsa -ann $accession.ann -customcfg custom.cfg -blastcfg blast.cfg > customatlas.pdf Open the PDF file:m12/customatlas.pdf
perl genomeatlas.pl -ref $accession.fsa -t "$organism" \ -begin 1 -end 10000 -dnap "Intrinsic Curvature,Stacking Energy,Position Preference,Global Direct Repeats,Global Inverted Repeats,GC Skew,Percent AT" \ -proteins $accession.proteins.fsa -ann $accession.ann -customcfg custom.cfg -blastcfg blast.cfg > zoomatlas.pdf Open the PDF file:m12/zoomatlas.pdfCHALLENGE: Try to identify a region where your organism contain unique properties (or not very common) and make a zoom around the window (not more than 20,000 nucleotides). Yesterday, you learned how to make a BLAST search of your own predicted proteins against Swiss Prot and you should now use the same approach to see if you can find the function of the unique features. Hint: use saco_convert -I fasta -O tab < $accession.protein.fsa to convert this into a tab file, and after this you can for example use grep to find sequences having the coordinate of the genes you are looking for. Instead of writing this to 'zoomatlas.pdf', write the output to 'challenge.pdf'
Counting the frequency of ATGC in aligned, experimentally verified
σ70 minus 35 hexamers, Huerta and co-workers have established
the following weight matrix:
![]() Since we are 'surprised' when observing sites with low variability, we define the information content iw (in bits) for each column i as: ![]() -where j iterates over the four bases ATGC, J = 4 is the alphabet length, and w is the position/column in the alignment. |
![]() |
A so-called logo plot can then be constructed which visualizes both the base frequencies as well as the information content at each position: The stack height corresponds to the information content at each position, and the relative heights of the letters correspond to the frequencies of the bases ( see above ). The lower the total stack height, the more equal distribution between ATGC is observed at that position; A stack ofheight 2 corresponds 100% conservation of a single letter, in a four-letter alphabet whereas a stack height of 0 means equal distribution.
To obtain the bit score of how a query sequence fits the model you first have to construct the matrix Rb,p and calculate the sum Rtot:
Additionally a method was proposed by Shultzaberger and co-workers which incorporates an information theory based quantification of the helical phasing between different binding sites to account for various spacers. Since the two hexamers of the -10 and -35 binds to the same molecule, it's spacing is essential for subsequent transcription - not only distance-wise but adding 5 bp to a spacer will change the helical phasing by 180 degrees, making the box inaccessible to the sigma factor.
The ribosomal RNA (rRNA) genes encode essential products for the translation machinery of the bacteria; without the ribosome (and its components) no translation of RNA into protein (=no life!) In other words, the rRNA genes are needed constantly by the cell, however required at different amounts depending on the growth rate of the cell. The rRNA genes are regulated by two σ70 promoters, P1 and P2 upstream of the 16S rRNA (which is the first gene of the operon) The redundancy of having multiple promoters allows the cell to differentially express the genes depending on the conditions. The E. coli genome contains seven rRNA operons.
Using Web Services, we shall construct a workflow to investigate how this promoter structure is conserved throughout different bacterial strains. We will not go into details with the infscan tool, as it simply serves the purpose of demonstrating how bioinformatic tools integrate into bigger workflows.
perl getseq.pl U00096 | perl rnammer_upstream.pl bac ssu > K1216S.u500.fsa
less K1216S.u500.fsa less minus10.pl
perl minus10.pl < K1216S.u500.fsa > K1216S.u500.minus10.out less K1216S.u500.minus10.out perl xyplot.pl < K1216S.u500.minus10.out > K1216S.u500.minus10.ps gmake K1216S.u500.minus10.pdf Open the PDF file:m12/K1216S.u500.minus10.pdfAs you may have observed, this model is inadequate to model P1 and P2. We will now try running the same sequence using the full P1 and P2 models:
less P1.pl
perl P1.pl < K1216S.u500.fsa > K1216S.u500.P1.out less K1216S.u500.P1.out perl xyplot.pl < K1216S.u500.P1.out > K1216S.u500.P1.ps gmake K1216S.u500.P1.pdf Open the PDF file:m12/K1216S.u500.P1.pdf
perl P2.pl < K1216S.u500.fsa > K1216S.u500.P2.out less K1216S.u500.P2.out perl xyplot.pl < K1216S.u500.P2.out > K1216S.u500.P2.ps gmake K1216S.u500.P2.pdf Open the PDF file:m12/K1216S.u500.P2.pdf
perl P1.pl < K1216S.u500.fsa | perl infsort.pl perl P2.pl < K1216S.u500.fsa | perl infsort.pl
set accession = AE017334 perl getseq.pl $accession | perl rnammer_upstream.pl bac ssu > $accession.500.fsa perl P1.pl < $accession.500.fsa > $accession.p1.profile perl infsort.pl < $accession.p1.profile > $accession.p1.top less $accession.p1.top perl xyplot.pl < $accession.p1.profile > $accession.p1.profile.ps gmake $accession.p1.profile.pdf Open the PDF file:m12/AE017334.p1.profile.pdf
set accession = AE017334 perl P2.pl < $accession.500.fsa > $accession.p2.profile perl infsort.pl < $accession.p2.profile > $accession.p2.top less $accession.p2.top perl xyplot.pl < $accession.p2.profile > $accession.p2.profile.ps gmake $accession.p2.profile.pdf Open the PDF file:m12/AE017334.p2.profile.pdf