This lecture is about ways of looking at DNA sequences in complete genomes and chromosomes, in terms of symmetry elements. There are three parts to this talk. In Part 1, I will discuss briefly the rice genome sequence, which was published last Friday. From here, I will go on to the fact that we simply have "Too Much Information" becoming available, and the problem will only get worse in the near future. There are ways of cataloging and organising the data, of course. I have found that the true diversity of genome sizes in Nature is often neglected, so we'll talk for a few minutes about the "C-value paradox", along with some possible ideas for WHY certain organisms have so much DNA.
In Part 2, I will introduce "DNA Atlases", which are a way of visualising information in completely sequenced genomes. There are two ways in which we can try and eal with the explosion of information: machine learning approaches (neural nets, HMMs, etc.) and visualisation methods.
In Part 3, I will introduce the idea of DNA structures as a source of genomic information. I would like to think that one way of dealing with the explosion of sequence information, in terms of DNA sequences, is to think about it in biological terms, in particular in physical-chemical terms of structure and function of symmetry elements. For example, there are specific DNA sequences which "code" for a telomere, and different DNA sequences which are specific for centromeres. Specific DNA sequences, their structures, and biological functions will be discussed.
Brevis esse laboro, Obscuro fio. - Horace
The information in GenBank is doubling every year (or maybe even faster).
What are the implications of this?
![]()
A look at genome sequencing since 1994:
| YEAR | # GENOMES Sequenced |
| 1994 | |
| 1995 | |
| 1996 | |
| 1997 | |
| 1998 | |
| 1999 | |
| 2000 | |
| 2001 |
Although the number of genomes being sequenced is increasing rapidly, one has to this into perspective - the organisms can be placed into four different classes:
| Organism group | Size (bp) | No. sequenced |
| viruses | ~300 bp to ~350,000 bp | 728 |
| prokaryotes | ~250,000 to ~15,000,000 bp | 112 |
| single-celled eukaryotes |
 ~5,000,000 to ~600,000,000,000 bp | 6 |
| multi-celled eukaryotes | ~20,000,000 to ~500,000,000,000 bp | 8 |
| Drosophila species | Genome Size (in base pairs) |
| D. americana | ~300,000,000 bp |
| D. arizonensis | ~225,000,000 bp |
| D. eohydei (male) | ~234,000,000 bp |
| D. eohydei (female) | ~246,000,000 bp |
| D. funebris | ~255,000,000 bp |
| D. hydei | ~202,000,000 bp |
| D. melanogaster | ~180,000,000 bp (~138,000,000 bp sequenced) |
| D. miranda | ~300,000,000 bp |
| D. nasutoides | ~800,000,000 bp |
| D. neohydei | ~192,000,000 bp |
| D. simulans | ~127,500,000 bp |
| D. virilis | ~345,000,000 bp |
In summary, the genome sizes of the Drosophila species that have been examined so far range from about 127 million bp to about 800 million bp. But of course at present we SUSPECT that they contain roughly the same number of genes, although it is possible (likely) that they contain duplicated regions (or perhaps even entire chromosomes; there is ample space to have an entire extra copy (or two or more) of the entire genome). In addition, they also contain various types of repeats, known as "selfish DNA".

One way of dealing with the problem of how to display so much sequence information is to have a look at the whole chromosome at once, smoothing over a large window. The entire bacterial chromosome is displayed as a circle, with different colours representing various parameters. First, as an introduction to atlases, we will look at base-composition. Then we will have a look at levels of expression of mRNA and proteins throghout the chromosome. As an example, I will use the first "organism" to be sequenced, Escherichia coli bacteriophage phi-X174. This was sequenced about 30 years ago; Fred Sanger got the Nobel prize for sequencing this (actually, for developing a faster way of sequencing DNA). It took him a couple of years to sequence this - to put it into perspective, at the same rate, it would take more than a MILLION years to sequence the human genome!
LOCUS NC_001422 5386 bp ss-DNA circular PHG 31-JUL-2001 DEFINITION Coliphage phiX174, complete genome. ACCESSION NC_001422 VERSION NC_001422.1 GI:9626372 KEYWORDS . SOURCE coliphage phiX174. ORGANISM coliphage phiX174 Viruses; ssDNA viruses; Microviridae; Microvirus. REFERENCE 1 (bases 2370 to 2421) AUTHORS Robertson,H.D., Barrell,B.G., Weith,H.L. and Donelson,J.E. TITLE Isolation and sequence analysis of a ribosome protected fragment from bacteriophage phi-X 174 DNA JOURNAL Nature New Biol. 241, 38-40 (1973) MEDLINE 73161742 REFERENCE 2 (bases 1047 to 1094) AUTHORS Ziff,E.B., Sedat,J.W. and Galibert,F. TITLE Determination of the nucleotide sequence of a fragment of bacteriophage phi-X 174 DNA JOURNAL Nature New Biol. 241, 34-37 (1973) MEDLINE 73161741 REFERENCE 3 (bases 2370 to 2420) AUTHORS Barrell,B.G., Weith,H.L., Donelson,J.E. and Robertson,H.D. TITLE Sequence analysis of the ribosome-protected bacteriophage phi-X174 DNA fragment containing the gene G initiation site JOURNAL J. Mol. Biol. 92, 377-393 (1975) MEDLINE 75192039 REFERENCE 4 (bases 2365 to 2591) AUTHORS Air,G.M., Blackburn,E.H., Sanger,F. and Coulson,A.R. TITLE The nucleotide and amino acid sequences of the N (5') terminal region of gene G of bacteriophage phi-X174 JOURNAL J. Mol. Biol. 96, 703-719 (1975) MEDLINE 76072037 REFERENCE 5 (bases 4137 to 4207) AUTHORS van Mansfeld,A.D.M., Vereijken,J.M. and Jansz,H.S. TITLE The nucleotide sequence of a DNA fragment, 71 base pairs in length, near the origin of DNA replication of bacteriophage phi-X174 JOURNAL Nucleic Acids Res. 3, 2827-2844 (1976) MEDLINE 77057432 REFERENCE 6 (bases 2395 to 2922) AUTHORS Air,G.M., Sanger,F. and Coulson,A.R. TITLE Nucleotide and amino acid sequences of gene G of phi-X174 JOURNAL J. Mol. Biol. 108, 519-533 (1976) MEDLINE 77121207 REFERENCE 7 (bases 1017 to 1762) AUTHORS Air,G.M., Blackburn,E.H., Coulson,A.R., Galibert,F., Sanger,F., Sedat,J.W. and Ziff,E.B. TITLE Gene F of bacteriophage phi-x174. Correlation of nucleotide sequences from the DNA and amino acid sequences from the gene product JOURNAL J. Mol. Biol. 107, 445-458 (1976) MEDLINE 77074163 REFERENCE 8 (bases 730 to 903) AUTHORS Blackburn,E.H. TITLE Transcription and sequence analysis of a fragment of bacteriophage phi-X174 DNA JOURNAL J. Mol. Biol. 107, 417-431 (1976) MEDLINE 77074161 REFERENCE 9 (bases 1017 to 1081) AUTHORS Sedat,J., Ziff,E. and Galibert,F. TITLE Direct determination of DNA nucleotide sequences: Structure of large specific fragments of bacteriophage phi-X174 DNA JOURNAL J. Mol. Biol. 107, 391-416 (1976) MEDLINE 77074160 REFERENCE 10 (bases 2263 to 2421) AUTHORS Fiddes,J.C. TITLE Nucleotide sequence of the intercistronic region between genes G and F in bacteriophage phi-X174 DNA JOURNAL J. Mol. Biol. 107, 1-24 (1976) MEDLINE 77074135 REFERENCE 11 (bases 1 to 5375) AUTHORS Sanger,F., Air,G.M., Barrell,B.G., Brown,N.L., Coulson,A.R., Fiddes,J.C., Hutchison,C.A., Slocombe,P.M. and Smith,M. TITLE Nucleotide sequence of bacteriophage phi-X174 DNA JOURNAL Nature 265, 687-695 (1977) MEDLINE 77171175 REFERENCE 12 (bases 4505 to 5374) AUTHORS Brown,N.L. and Smith,M. TITLE The sequence of a region of bacteriophage phi-X174 DNA coding for parts of genes A and B JOURNAL J. Mol. Biol. 116, 1-30 (1977) MEDLINE 78069208 REFERENCE 13 (sites) AUTHORS Fiddes,J.C. TITLE The nucleotide sequence of a viral DNA JOURNAL Sci. Am. 237, 54-67 (1977) MEDLINE 78054683 REFERENCE 14 (bases 5022 to 5132) AUTHORS Brown,N.L. and Smith,M. TITLE DNA sequence of a region of the phi-X174 genome coding for a ribosome binding site JOURNAL Nature 265, 695-698 (1977) MEDLINE 77171176 REFERENCE 15 (bases 5346 to 5386; 1 to 159) AUTHORS Smith,M., Brown,N.L., Air,G.M., Barrell,B.G., Coulson,A.R., Hutchison,C.A.I.I.I. and Sanger,F. TITLE DNA sequence at the C termini of the overlapping genes A and B in bacteriophage phi-X174 JOURNAL Nature 265, 702-705 (1977) MEDLINE 77171178 REFERENCE 16 (bases 1 to 5386) AUTHORS Sanger,F., Coulson,A.R., Friedmann,T., Air,G.M., Barrell,B.G., Brown,N.L., Fiddes,J.C., Hutchison,C.A., Slocombe,P.M. and Smith,M. TITLE The nucleotide sequence of bacteriophage phi-X174 JOURNAL J. Mol. Biol. 125, 225-246 (1978) MEDLINE 79091185 REFERENCE 17 (bases 1290 to 1302; 1340 to 1430; 1510 to 1570; 1600 to 1750) AUTHORS Air,G.M., Coulson,A.R., Fiddes,J.C., Friedmann,T., Hutchison,C.A., Sanger,F., Slocombe,P.M. and Smith,A.J. TITLE Nucleotide sequence of the F protein coding region of bacteriophage phi-X174 and the amino acid sequence of its product JOURNAL J. Mol. Biol. 125, 247-254 (1978) MEDLINE 79091186 REFERENCE 18 (bases 4256 to 4317) AUTHORS Langeveld,S.A., van Mansfeld,A.D.M., de Winter,J.M. and Weisbeek,P.J. TITLE Cleavage of single-stranded DNA by the A and A* proteins of bacteriophage phi-X174 JOURNAL Nucleic Acids Res. 7, 2177-2188 (1979) MEDLINE 80101074 REFERENCE 19 (bases 4248 to 4332) AUTHORS Heidekamp,F., Langeveld,S.A., Baas,P.D. and Jansz,H.S. TITLE Studies of the recognition sequence of phi-X174 gene A protein. Cleavage site of phi-X gene A protein in St-1 RFI DNA JOURNAL Nucleic Acids Res. 8, 2009-2021 (1980) MEDLINE 81053861 REFERENCE 20 (bases 436 to 490; 630 to 669; 930 to 979) AUTHORS Takeshita,M., Kappen,L.S., Grollman,A.P., Eisenberg,M. and Goldberg,I.H. TITLE Strand scission of deoxyribonucleic acid by neocarzinostatin, auromomycin, and bleomycin: studies on base release and nucleotide sequence specificity JOURNAL Biochemistry (N.Y.) 20 (26), 7599-7606 (1981) MEDLINE 82113627 PUBMED 6173064 REFERENCE 21 (bases 1064 to 1757) AUTHORS Melville,M.-P., Piette,J., Lopez,M., Decuyper,J. and van de Vorst,A. TITLE Termination sites of the in vitro DNA sysnthesis on single-stranded DNA photosensitized by promazines JOURNAL J. Biol. Chem. 259, 15069-15077 (1984) MEDLINE 85079985 REFERENCE 22 (bases 449 to 482; 504 to 598; 1047 to 1111) AUTHORS Ueda,K., Morita,J. and Komano,T. TITLE Sequence specificity of heat-labile sites in DNA induced by mitomycin C JOURNAL Biochemistry (N.Y.) 23 (8), 1634-1640 (1984) MEDLINE 84203526 PUBMED 6232949 REFERENCE 23 (bases 2380 to 2512; 2593 to 2786; 2788 to 2947) AUTHORS Air,G.M., Els,M.C., Brown,L.E., Laver,W.G. and Webster,R.G. TITLE Location of antigenic sites in the three-dimensional structure of the influenza N2 virus neuraminidase JOURNAL Virology 145, 237-248 (1985) MEDLINE 85274373 COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The reference sequence was derived from J02482. [8] intermittent sequences. [15] review; discussion of complete genome. Double checked with sumex tape. Single-stranded circular DNA which codes for eleven proteins. Replicative form is duplex, icosahedron, related to s13 & g4. [21] indicates that mitomycin C reduced with sodium borohydride induced heat-labile sites in DNA most preferentially at dinucleotide sequence 'gt' (especially 'Pu-g-t'). Bacteriophage phi-X174 single stranded DNA molecules were irradiated with near UV light in the presence of promazine derivatives, after priming with restriction fragments or synthetic primers [22]. The resulting DNA fragments were used as templates for in vitro complementary chain synthesis by E.coli DNA polymerase I [22]. More than 90% of the observed chain terminations were mapped one nucleotide before a guanine residue [22]. Photoreaction occurred more predominantly with guanine residues localized in single-stranded parts of the genome [22]. These same guanine residues could also be damaged when the reaction was performed in the dark, in the presence of promazine cation radicals [22]. FEATURES Location/Qualifiers source 1..5386 /organism="coliphage phiX174" /specific_host="Escherichia coli" /db_xref="taxon:10847" variation 23 /note="c in wt; t in am18 and am35 [14]" variation 25 /note="g in wt; c in ts116 [14]" CDS 51..221 /note="K (function unknown)" /codon_start=1 /transl_table=11 /protein_id="NP_040706.1" /db_xref="GI:9626376" /translation="MSRKIILIKQELLLLVYELNRSGLLAENEKIRPILAQLEKLLLC DLSPSTNDSVKN" variation 57 /note="c in wt; t in am6 [14]" variation 117 /note="g in wt; a in am6 [14]" CDS 133..393 /note="C (DNA maturation)" /codon_start=1 /transl_table=11 /protein_id="NP_040707.1" /db_xref="GI:9626377" /translation="MRKFDLSLRSSRSSYFATFRHQLTILSKTDALDEEKWLNMLGTF VKDWFRYESHFVHGRDSLVDILKERGLLSESDAVQPLIGKKS" mRNA 358..3975 /note="mRNA (major alt.)" mRNA 358..991 /note="mRNA (minor alt.)" CDS 390..848 /note="D (capsid morphogenesis)" /codon_start=1 /transl_table=11 /protein_id="NP_040708.1" /db_xref="GI:9626378" /translation="MSQVTEQSVRFQTALASIKLIQASAVLDLTEDDFDFLTSNKVWI ATDRSRARRCVEACVYGTLDFVGYPRFPAPVEFIAAVIAYYVHPVNIQTACLIMEGAE FTENIINGVERPVKAAELFAFTLRVRAGNTDVLTDAEENVRQKLRAEGVM" CDS 568..843 /note="E (cell lysis)" /codon_start=1 /transl_table=11 /protein_id="NP_040709.1" /db_xref="GI:9626379" /translation="MVRWTLWDTLAFLLLLSLLLPSLLIMFIPSTFKRPVSSWKALNL RKTLLMASSVRLKPLNCSRLPCVYAQETLTFLLTQKKTCVKNYVRKE" CDS 848..964 /note="J (core protein, DNA condensation)" /codon_start=1 /transl_table=11 /protein_id="NP_040710.1" /db_xref="GI:9626380" /translation="MSKGKKRSGARPGRPQPLRGTKGKRKGARLWYVGGQQF" CDS 1001..2284 /note="F (major coat protein)" /codon_start=1 /transl_table=11 /protein_id="NP_040711.1" /db_xref="GI:9626381" /translation="MSNIQTGAERMPHDLSHLGFLAGQIGRLITISTTPVIAGDSFEM DAVGALRLSPLRRGLAIDSTVDIFTFYVPHRHVYGEQWIKFMKDGVNATPLPTVNTTG YIDHAAFLGTINPDTNKIPKHLFQGYLNIYNNYFKAPWMPDRTEANPNELNQDDARYG FRCCHLKNIWTAPLPPETELSRQMTTSTTSIDIMGLQAAYANLHTDQERDYFMQRYHD VISSFGGKTSYDADNRPLLVMRSNLWASGYDVDGTDQTSLGQFSGRVQQTYKHSVPRF FVPEHGTMFTLALVRFPPTATKEIQYLNAKGALTYTDIAGDPVLYGNLPPREISMKDV FRSGDSSKKFKIAEGQWYRYAPSYVSPAYHLLEGFPFIQEPPSGDLQERVLIRHHDYD QCFQSVQLLQWNSQVKFNVTVYRNLPTTRDSIMTS" CDS 2395..2922 /note="G (major spike protein)" /codon_start=1 /transl_table=11 /protein_id="NP_040712.1" /db_xref="GI:9626382" /translation="MFQTFISRHNSNFFSDKLVLTSVTPASSAPVLQTPKATSSTLYF DSLTVNAGNGGFLHCIQMDTSVNAANQVVSVGADIAFDADPKFFACLVRFESSSVPTT LPTAYDVYPLNGRHDGGYYTVKDCVTIDVLPRTPGNNVYVGFMVWSNFTATKCRGLVS LNQVIKEIICLQPLK" CDS 2931..3917 /note="H (minor spike protein, adsorption)" /codon_start=1 /transl_table=11 /protein_id="NP_040713.1" /db_xref="GI:9626383" /translation="MFGAIAGGIASALAGGAMSKLFGGGQKAASGGIQGDVLATDNNT VGMGDAGIKSAIQGSNVPNPDEAAPSFVSGAMAKAGKGLLEGTLQAGTSAVSDKLLDL VGLGGKSAADKGKDTRDYLAAAFPELNAWERAGADASSAGMVDAGFENQKELTKMQLD NQKEIAEMQNETQKEIAGIQSATSRQNTKDQVYAQNEMLAYQQKESTARVASIMENTN LSKQQQVSEIMRQMLTQAQTAGQYFTNDQIKEMTRKVSAEVDLVHQQTQNQRYGSSHI GATAKDISNVVTDAASGVVDIFHGIDKAVADTWNNFWKDGKADGIGSNLSRK" misc_feature 3962 /note="transcription start site" CDS join(3981..5386,1..136) /note="A (rf replication, viral strand synthesis)" /codon_start=1 /transl_table=11 /protein_id="NP_040703.1" /db_xref="GI:9626373" /translation="MVRSYYPSECHADYFDFERIEALKPAIEACGISTLSQSPMLGFH KQMDNRIKLLEEILSFRMQGVEFDNGDMYVDGHKAASDVRDEFVSVTEKLMDELAQCY NVLPQLDINNTIDHRPEGDEKWFLENEKTVTQFCRKLAAERPLKDIRDEYNYPKKKGI KDECSRLLEASTMKSRRGFAIQRLMNAMRQAHADGWFIVFDTLTLADDRLEAFYDNPN ALRDYFRDIGRMVLAAEGRKANDSHADCYQYFCVPEYGTANGRLHFHAVHFMRTLPTG SVDPNFGRRVRNRRQLNSLQNTWPYGYSMPIAVRYTQDAFSRSGWLWPVDAKGEPLKA TSYMAVGFYVAKYVNKKSDMDLAAKGLGAKEWNNSLKTKLSLLPKKLFRIRMSRNFGM KMLTMTNLSTECLIQLTKLGYDATPFNQILKQNAKREMRLRLGKVTVADVLAAQPVTT NLLKFMRASIKMIGVSNLQSFIASMTQKLTLSDISDESKNYLDKAGITTACLRIKSKW TAGGK" rep_origin 4306 /note="origin of viral strand synthesis" CDS join(4497..5386,1..136) /note="A* (shut off host DNA synthesis)" /codon_start=1 /transl_table=11 /protein_id="NP_040704.1" /db_xref="GI:9626374" /translation="MKSRRGFAIQRLMNAMRQAHADGWFIVFDTLTLADDRLEAFYDN PNALRDYFRDIGRMVLAAEGRKANDSHADCYQYFCVPEYGTANGRLHFHAVHFMRTLP TGSVDPNFGRRVRNRRQLNSLQNTWPYGYSMPIAVRYTQDAFSRSGWLWPVDAKGEPL KATSYMAVGFYVAKYVNKKSDMDLAAKGLGAKEWNNSLKTKLSLLPKKLFRIRMSRNF GMKMLTMTNLSTECLIQLTKLGYDATPFNQILKQNAKREMRLRLGKVTVADVLAAQPV TTNLLKFMRASIKMIGVSNLQSFIASMTQKLTLSDISDESKNYLDKAGITTACLRIKS KWTAGGK" misc_feature 4899 /note="transcription start site" CDS join(5075..5386,1..51) /note="B (capsid morphogenesis)" /codon_start=1 /transl_table=11 /protein_id="NP_040705.1" /db_xref="GI:9626375" /translation="MEQLTKNQAVATSQEAVQNQNEPQLRDENAHNDKSVHGVLNPTY QAGLRRDAVQPDIEAERKKRDEIEAGKSYCSRRFGGATCDDKSAQIYARFDKNDWRIQ PAEFYRFHDAEVNTFGYF" BASE COUNT 1291 a 1157 c 1254 g 1684 t ORIGIN 1 gagttttatc gcttccatga cgcagaagtt aacactttcg gatatttctg atgagtcgaa 61 aaattatctt gataaagcag gaattactac tgcttgttta cgaattaaat cgaagtggac 121 tgctggcgga aaatgagaaa attcgaccta tccttgcgca gctcgagaag ctcttacttt 181 gcgacctttc gccatcaact aacgattctg tcaaaaactg acgcgttgga tgaggagaag 241 tggcttaata tgcttggcac gttcgtcaag gactggttta gatatgagtc acattttgtt 301 catggtagag attctcttgt tgacatttta aaagagcgtg gattactatc tgagtccgat 361 gctgttcaac cactaatagg taagaaatca tgagtcaagt tactgaacaa tccgtacgtt 421 tccagaccgc tttggcctct attaagctca ttcaggcttc tgccgttttg gatttaaccg 481 aagatgattt cgattttctg acgagtaaca aagtttggat tgctactgac cgctctcgtg 541 ctcgtcgctg cgttgaggct tgcgtttatg gtacgctgga ctttgtggga taccctcgct 601 ttcctgctcc tgttgagttt attgctgccg tcattgctta ttatgttcat cccgtcaaca 661 ttcaaacggc ctgtctcatc atggaaggcg ctgaatttac ggaaaacatt attaatggcg 721 tcgagcgtcc ggttaaagcc gctgaattgt tcgcgtttac cttgcgtgta cgcgcaggaa 781 acactgacgt tcttactgac gcagaagaaa acgtgcgtca aaaattacgt gcggaaggag 841 tgatgtaatg tctaaaggta aaaaacgttc tggcgctcgc cctggtcgtc cgcagccgtt 901 gcgaggtact aaaggcaagc gtaaaggcgc tcgtctttgg tatgtaggtg gtcaacaatt 961 ttaattgcag gggcttcggc cccttacttg aggataaatt atgtctaata ttcaaactgg 1021 cgccgagcgt atgccgcatg acctttccca tcttggcttc cttgctggtc agattggtcg 1081 tcttattacc atttcaacta ctccggttat cgctggcgac tccttcgaga tggacgccgt 1141 tggcgctctc cgtctttctc cattgcgtcg tggccttgct attgactcta ctgtagacat 1201 ttttactttt tatgtccctc atcgtcacgt ttatggtgaa cagtggatta agttcatgaa 1261 ggatggtgtt aatgccactc ctctcccgac tgttaacact actggttata ttgaccatgc 1321 cgcttttctt ggcacgatta accctgatac caataaaatc cctaagcatt tgtttcaggg 1381 ttatttgaat atctataaca actattttaa agcgccgtgg atgcctgacc gtaccgaggc 1441 taaccctaat gagcttaatc aagatgatgc tcgttatggt ttccgttgct gccatctcaa 1501 aaacatttgg actgctccgc ttcctcctga gactgagctt tctcgccaaa tgacgacttc 1561 taccacatct attgacatta tgggtctgca agctgcttat gctaatttgc atactgacca 1621 agaacgtgat tacttcatgc agcgttacca tgatgttatt tcttcatttg gaggtaaaac 1681 ctcttatgac gctgacaacc gtcctttact tgtcatgcgc tctaatctct gggcatctgg 1741 ctatgatgtt gatggaactg accaaacgtc gttaggccag ttttctggtc gtgttcaaca 1801 gacctataaa cattctgtgc cgcgtttctt tgttcctgag catggcacta tgtttactct 1861 tgcgcttgtt cgttttccgc ctactgcgac taaagagatt cagtacctta acgctaaagg 1921 tgctttgact tataccgata ttgctggcga ccctgttttg tatggcaact tgccgccgcg 1981 tgaaatttct atgaaggatg ttttccgttc tggtgattcg tctaagaagt ttaagattgc 2041 tgagggtcag tggtatcgtt atgcgccttc gtatgtttct cctgcttatc accttcttga 2101 aggcttccca ttcattcagg aaccgccttc tggtgatttg caagaacgcg tacttattcg 2161 ccaccatgat tatgaccagt gtttccagtc cgttcagttg ttgcagtgga atagtcaggt 2221 taaatttaat gtgaccgttt atcgcaatct gccgaccact cgcgattcaa tcatgacttc 2281 gtgataaaag attgagtgtg aggttataac gccgaagcgg taaaaatttt aatttttgcc 2341 gctgaggggt tgaccaagcg aagcgcggta ggttttctgc ttaggagttt aatcatgttt 2401 cagactttta tttctcgcca taattcaaac tttttttctg ataagctggt tctcacttct 2461 gttactccag cttcttcggc acctgtttta cagacaccta aagctacatc gtcaacgtta 2521 tattttgata gtttgacggt taatgctggt aatggtggtt ttcttcattg cattcagatg 2581 gatacatctg tcaacgccgc taatcaggtt gtttctgttg gtgctgatat tgcttttgat 2641 gccgacccta aattttttgc ctgtttggtt cgctttgagt cttcttcggt tccgactacc 2701 ctcccgactg cctatgatgt ttatcctttg aatggtcgcc atgatggtgg ttattatacc 2761 gtcaaggact gtgtgactat tgacgtcctt ccccgtacgc cgggcaataa cgtttatgtt 2821 ggtttcatgg tttggtctaa ctttaccgct actaaatgcc gcggattggt ttcgctgaat 2881 caggttatta aagagattat ttgtctccag ccacttaagt gaggtgattt atgtttggtg 2941 ctattgctgg cggtattgct tctgctcttg ctggtggcgc catgtctaaa ttgtttggag 3001 gcggtcaaaa agccgcctcc ggtggcattc aaggtgatgt gcttgctacc gataacaata 3061 ctgtaggcat gggtgatgct ggtattaaat ctgccattca aggctctaat gttcctaacc 3121 ctgatgaggc cgcccctagt tttgtttctg gtgctatggc taaagctggt aaaggacttc 3181 ttgaaggtac gttgcaggct ggcacttctg ccgtttctga taagttgctt gatttggttg 3241 gacttggtgg caagtctgcc gctgataaag gaaaggatac tcgtgattat cttgctgctg 3301 catttcctga gcttaatgct tgggagcgtg ctggtgctga tgcttcctct gctggtatgg 3361 ttgacgccgg atttgagaat caaaaagagc ttactaaaat gcaactggac aatcagaaag 3421 agattgccga gatgcaaaat gagactcaaa aagagattgc tggcattcag tcggcgactt 3481 cacgccagaa tacgaaagac caggtatatg cacaaaatga gatgcttgct tatcaacaga 3541 aggagtctac tgctcgcgtt gcgtctatta tggaaaacac caatctttcc aagcaacagc 3601 aggtttccga gattatgcgc caaatgctta ctcaagctca aacggctggt cagtatttta 3661 ccaatgacca aatcaaagaa atgactcgca aggttagtgc tgaggttgac ttagttcatc 3721 agcaaacgca gaatcagcgg tatggctctt ctcatattgg cgctactgca aaggatattt 3781 ctaatgtcgt cactgatgct gcttctggtg tggttgatat ttttcatggt attgataaag 3841 ctgttgccga tacttggaac aatttctgga aagacggtaa agctgatggt attggctcta 3901 atttgtctag gaaataaccg tcaggattga caccctccca attgtatgtt ttcatgcctc 3961 caaatcttgg aggctttttt atggttcgtt cttattaccc ttctgaatgt cacgctgatt 4021 attttgactt tgagcgtatc gaggctctta aacctgctat tgaggcttgt ggcatttcta 4081 ctctttctca atccccaatg cttggcttcc ataagcagat ggataaccgc atcaagctct 4141 tggaagagat tctgtctttt cgtatgcagg gcgttgagtt cgataatggt gatatgtatg 4201 ttgacggcca taaggctgct tctgacgttc gtgatgagtt tgtatctgtt actgagaagt 4261 taatggatga attggcacaa tgctacaatg tgctccccca acttgatatt aataacacta 4321 tagaccaccg ccccgaaggg gacgaaaaat ggtttttaga gaacgagaag acggttacgc 4381 agttttgccg caagctggct gctgaacgcc ctcttaagga tattcgcgat gagtataatt 4441 accccaaaaa gaaaggtatt aaggatgagt gttcaagatt gctggaggcc tccactatga 4501 aatcgcgtag aggctttgct attcagcgtt tgatgaatgc aatgcgacag gctcatgctg 4561 atggttggtt tatcgttttt gacactctca cgttggctga cgaccgatta gaggcgtttt 4621 atgataatcc caatgctttg cgtgactatt ttcgtgatat tggtcgtatg gttcttgctg 4681 ccgagggtcg caaggctaat gattcacacg ccgactgcta tcagtatttt tgtgtgcctg 4741 agtatggtac agctaatggc cgtcttcatt tccatgcggt gcactttatg cggacacttc 4801 ctacaggtag cgttgaccct aattttggtc gtcgggtacg caatcgccgc cagttaaata 4861 gcttgcaaaa tacgtggcct tatggttaca gtatgcccat cgcagttcgc tacacgcagg 4921 acgctttttc acgttctggt tggttgtggc ctgttgatgc taaaggtgag ccgcttaaag 4981 ctaccagtta tatggctgtt ggtttctatg tggctaaata cgttaacaaa aagtcagata 5041 tggaccttgc tgctaaaggt ctaggagcta aagaatggaa caactcacta aaaaccaagc 5101 tgtcgctact tcccaagaag ctgttcagaa tcagaatgag ccgcaacttc gggatgaaaa 5161 tgctcacaat gacaaatctg tccacggagt gcttaatcca acttaccaag ctgggttacg 5221 acgcgacgcc gttcaaccag atattgaagc agaacgcaaa aagagagatg agattgaggc 5281 tgggaaaagt tactgtagcc gacgttttgg cggcgcaacc tgtgacgaca aatctgctca 5341 aatttatgcg cgcttcgata aaaatgattg gcgtatccaa cctgca // |

There are several things to notice in this plot. First, the genome is circular. The density of the four nucleotides are plotted in the four outer-most circles. This density is not evenly distributed; although all four of the scales range from 0% (min., no colour) to 40% (max colour intensity), it can be easily seen that the sequence is dominated by T's (red circle), and that there are relatively few G's (outermost turquoise circle) and C's (pink circle), and a few A-rich regions (green 2nd circle).
There are many genes which overlap (the genes are indicated in the "annotation circle", which is the fifth circle from the outside - with the blue bands representing genes in the forward direction). Note that all the genes are oriented in the same direction. Whilst this is true for some viruses, it is rarely true for organisms with larger genomes. There is a strong bias towards T's over A's (AT skew - which is simply #A's minus #T's, over a given window), a less strong bias of G's over C's (GC skew) and finally the genome is generally AT rich (red in innermost circle).
Genome Atlas for E. coli Phi-X174 bacteriophage
Notice that some of the structural features (such as the perfect palindromes circle) are often found near the end of genes. A more detailed explanation for the various parameters will be given in the next lecture, but for now the important point is that there is much information in the sequence which can be visualised in the atlas, which is not readily apparent from merely looking at the GenBank file alone.
Part 3: DNA Structures and Symmetry Elements
A Brief Introduction to [a few] Alternative Conformations of DNA
DNA symmetry elements defined
ONE possible way of trying to deal with all this information is to develop methods of visualising DNA structures within bacterial chromosomes. The method I have chosen to talk about today is based on two different groups of "DNA symmetry elements". The first is simply various types of repeats, and the second group is DNA helix families, which is caused by certain stretches of purines (or pyrimidines) for A-DNA, and certain stretches of alternating pyrimidine/purines for Z-DNA. The various conformations of these different sequences have putative biological functions, based in part on these structures. The repeats will be discussed first.
A. DNA Repeats
From a DNA sequence perspective, there are 4 types of repeats:
Direct Repeats
Simple Tandem Repeats
(Longer)Tandem Repeats
Direct (non-tandem)
Phased Repeats
Inverted Repeats
Mirror Repeats
Everted Repeats
Table of DNA sequence repeats, structures, and biological functions
Repeat Pattern Possible structure Biological function + (N)n
Direct repeats recA triple-stranded DNA homologous recombination
duplications+ (R)n
Mirror repeats Intermolecular triplex
Intramolecular Triple-strandsrecombination
replication+ (N)n
Inverted repeats cruciforms deletions (in bacteria)
insertion sequences+ (N)n
Everted repeats parallel stranded DNA unknown
stabilisation of telomeres(?)
B. DNA Helix Families
A-DNA (left), B-DNA (middle) and Z-DNA (right) -- 12 bp each
From Dickerson et al. in Cold Spring Harbor Symposium for Quantitative Biology (1982) v47 p13-24.
3 families of DNA helices:
![]()
A-DNA family - this is most common for double stranded RNA, RNA/DNA hybrids, as well as for certain DNA sequences, such as long stretches of purines. NMR studies have shown that as few as 5 bp of purines in a row can set up an A-type of helix. Most of the DNA inside of cells is likely to be a mixture of the A- and B-DNA conformations.
![]()
B-DNA family - the majority of DNA exists in the "B-DNA form" inside the cells of living organisms. This is the classical "Watson-Crick" structure, although there is considerable sequence-specific variation. Thus, for example, different sequences can have from 9 bp/turn of the helix to 12 bp/turn, depending on the sequence of the DNA! However, on AVERAGE, the DNA is about 10.5 bp/turn.
![]()
Z-DNA family - this is much more rare than the other two families, although certains sequences (such as runs of GC repeats (GCGCGC)) can form Z-DNA easily. In eukaryotes, CpG islands can form Z-DNA, and methylated CpG islands can form Z-DNA readily in vivo. Furthermore, specific proteins have been isolated which will bind preferentially to the left-handed Z-DNA conformation.
![]()
Link to more atlases for Escherichia coli genomes.
Link to the main "Genome Atlas" web page
REFERENCES
Papers relevant to this lecture (handed out in class)
Friday (6 April, 2001)
- David W. Ussery, "Genome Databases", The Encyclopedia of Genetics, in press, April, 2001.
- Ussery,D.W., Larsen,T.S., Wilkes,K.T., Friis,C., Worning,P., Krogh,A., Brunak,S. "Genome Organisation and Chromatin Structure in Escherichia coli", Biochimie,83:201-212, (2001).
- Carsten Friis, Lars Juhl Jensen, and David W. Ussery, "Visualisation of Pathogenicity Regions in Bacteria", Genetica, 108:47-51, 2000.
- David W. Ussery, "Bioinformatics2000 Meeting Report", Genome Biology, 1, (#3), 1-2, 2000.
Other references
Richard R. Sinden, Christopher E. Pearson, Vladimir N. Potoman, and David W. Ussery, "DNA: Structure and Function", Advances in Genome Biology, 5A:1-141, (1998).
Ussery,D.W., Higgins,C.F., and Bloshoy,A., "Environmental Influences on DNA Curvature", J. Biomolecular Structure & Dynamics,16:811-823, (1999).[PubMed]
To be handed out next lecture (Tuesday, 17 April, 2001)
David W. Ussery, "DNA Structure: A-, B-, and Z-DNA Families", manuscript submitted to The Encyclopedia of Life Sciences, April, 2000.
Anders Gorm Pedersen, Lars Juhl Jensen, Hans-Henrik Stærfeldt, Søren Brunak, and David W. Ussery, "A DNA Structural Atlas for Escherichia coli", Journal of Molecular Biology, 299 (#4), 907-930, (2000). [cover]
Link to JMB online version of this article. PDF file[PubMed]
Lars Juhl Jensen, Carsten Friis, and David W. Ussery, "Three Views of Microbial Genomes", Research in Microbiology, 150, pages 773-777, 1999. [cover] [PubMed] PDF file![]()
David W. Ussery, "DNA Denaturation", manuscript submitted to The Encyclopedia of Genetics, September, 2000.
Link to a list of recent papers and talks on DNA structures.
Books about DNA:
Watson, James D. "A PASSION FOR DNA: Genes, Genomes, and Society", (Oxford University Press, Oxford, 2000). Amazon Barnes&Noble
Sinden, Richard R., "DNA: STRUCTURE and FUNCTION", (Academic Press, New York, 1994). Amazon Barnes&Noble
Calladine,C.R., Drew,H.R., "Understanding DNA: The Molecule and How It Works", (2nd edition, Academic Press, San Diego, 1997). Amazon Barnes&Noble
A List of more than a thousand books about DNA
Back to the CBS homepage
Back to Dave's Courses page![]()
Last modified Tuesday, 9 April, 2002 by David Ussery