DTU Course number 27011
Introduction to Bioinformatics
David Ussery
Tuesday, 9 April, 2002

Link to 2nd lecture
Link to web-based exercises
Link to the M.Sci. course web page

DNA Symmetry Elements in Whole Genomes



IHF logo


Overview


  1. The Problem: too much information!
  2. DNA Atlases
  3. DNA Symmetry elements


This lecture is about ways of looking at DNA sequences in complete genomes and chromosomes, in terms of symmetry elements. There are three parts to this talk. In Part 1, I will discuss briefly the rice genome sequence, which was published last Friday. From here, I will go on to the fact that we simply have "Too Much Information" becoming available, and the problem will only get worse in the near future. There are ways of cataloging and organising the data, of course. I have found that the true diversity of genome sizes in Nature is often neglected, so we'll talk for a few minutes about the "C-value paradox", along with some possible ideas for WHY certain organisms have so much DNA.


In Part 2, I will introduce "DNA Atlases", which are a way of visualising information in completely sequenced genomes. There are two ways in which we can try and eal with the explosion of information: machine learning approaches (neural nets, HMMs, etc.) and visualisation methods.


In Part 3, I will introduce the idea of DNA structures as a source of genomic information. I would like to think that one way of dealing with the explosion of sequence information, in terms of DNA sequences, is to think about it in biological terms, in particular in physical-chemical terms of structure and function of symmetry elements. For example, there are specific DNA sequences which "code" for a telomere, and different DNA sequences which are specific for centromeres. Specific DNA sequences, their structures, and biological functions will be discussed.




dna53.gif




Part 1Part 1: The Problem: Too Much Information


Brevis esse laboro,     Obscuro fio.     - Horace


Some philosophical thoughts about Information and the Size of Genomes.



The information in GenBank is doubling every year (or maybe even faster).
What are the implications of this?

growth in GenBank



A look at genome sequencing since 1994:

YEAR# GENOMES Sequenced
1994
0
1995
2
1996
4
1997
9
1998
17
1999
30
2000
53
2001
95


Genome Databases Links





leaf bar







The "C-value" paradox"

Although the number of genomes being sequenced is increasing rapidly, one has to this into perspective - the organisms can be placed into four different classes:

Organism group Size (bp) No. sequenced
viruses ~300 bp  to               ~350,000 bp 728
prokaryotes  ~250,000 to          ~15,000,000 bp 112
single-celled eukaryotes
 ~5,000,000 to ~600,000,000,000 bp 6
multi-celled eukaryotes ~20,000,000 to ~500,000,000,000 bp 8



This level of variation often does not correlate at all with "biological complexity". For example, a simple amoeba has 600,000,000,000 bp of DNA in its genome, or 200x as much as in humans! As another example, the genome in insects ranges from 20,000,000 bp, or just a bit larger than a bacteria, to more than 10 BILLION bp, or much larger than the human genome. Here is a table of different Drosophila species:

Drosophila species Genome Size
(in base pairs)
D. americana ~300,000,000 bp
D. arizonensis ~225,000,000 bp
D. eohydei (male) ~234,000,000 bp
D. eohydei (female) ~246,000,000 bp
D. funebris ~255,000,000 bp
D. hydei ~202,000,000 bp
D. melanogaster ~180,000,000 bp
(~138,000,000 bp sequenced)
D. miranda ~300,000,000 bp
D. nasutoides ~800,000,000 bp
D. neohydei ~192,000,000 bp
D. simulans ~127,500,000 bp
D. virilis ~345,000,000 bp

In summary, the genome sizes of the Drosophila species that have been examined so far range from about 127 million bp to about 800 million bp. But of course at present we SUSPECT that they contain roughly the same number of genes, although it is possible (likely) that they contain duplicated regions (or perhaps even entire chromosomes; there is ample space to have an entire extra copy (or two or more) of the entire genome). In addition, they also contain various types of repeats, known as "selfish DNA".



dna43.gif





Part 2Part 2: DNA Atlases

One way of dealing with the problem of how to display so much sequence information is to have a look at the whole chromosome at once, smoothing over a large window. The entire bacterial chromosome is displayed as a circle, with different colours representing various parameters. First, as an introduction to atlases, we will look at base-composition. Then we will have a look at levels of expression of mRNA and proteins throghout the chromosome. As an example, I will use the first "organism" to be sequenced, Escherichia coli bacteriophage phi-X174. This was sequenced about 30 years ago; Fred Sanger got the Nobel prize for sequencing this (actually, for developing a faster way of sequencing DNA). It took him a couple of years to sequence this - to put it into perspective, at the same rate, it would take more than a MILLION years to sequence the human genome!



LOCUS       NC_001422               5386 bp ss-DNA     circular PHG 31-JUL-2001
DEFINITION  Coliphage phiX174, complete genome.
ACCESSION   NC_001422
VERSION     NC_001422.1  GI:9626372
KEYWORDS    .
SOURCE      coliphage phiX174.
  ORGANISM  coliphage phiX174
            Viruses; ssDNA viruses; Microviridae; Microvirus.
REFERENCE   1  (bases 2370 to 2421)
  AUTHORS   Robertson,H.D., Barrell,B.G., Weith,H.L. and Donelson,J.E.
  TITLE     Isolation and sequence analysis of a ribosome protected fragment
            from bacteriophage phi-X 174 DNA
  JOURNAL   Nature New Biol. 241, 38-40 (1973)
  MEDLINE   73161742
REFERENCE   2  (bases 1047 to 1094)
  AUTHORS   Ziff,E.B., Sedat,J.W. and Galibert,F.
  TITLE     Determination of the nucleotide sequence of a fragment of
            bacteriophage phi-X 174 DNA
  JOURNAL   Nature New Biol. 241, 34-37 (1973)
  MEDLINE   73161741
REFERENCE   3  (bases 2370 to 2420)
  AUTHORS   Barrell,B.G., Weith,H.L., Donelson,J.E. and Robertson,H.D.
  TITLE     Sequence analysis of the ribosome-protected bacteriophage phi-X174
            DNA fragment containing the gene G initiation site
  JOURNAL   J. Mol. Biol. 92, 377-393 (1975)
  MEDLINE   75192039
REFERENCE   4  (bases 2365 to 2591)
  AUTHORS   Air,G.M., Blackburn,E.H., Sanger,F. and Coulson,A.R.
  TITLE     The nucleotide and amino acid sequences of the N (5') terminal
            region of gene G of bacteriophage phi-X174
  JOURNAL   J. Mol. Biol. 96, 703-719 (1975)
  MEDLINE   76072037
REFERENCE   5  (bases 4137 to 4207)
  AUTHORS   van Mansfeld,A.D.M., Vereijken,J.M. and Jansz,H.S.
  TITLE     The nucleotide sequence of a DNA fragment, 71 base pairs in length,
            near the origin of DNA replication of bacteriophage phi-X174
  JOURNAL   Nucleic Acids Res. 3, 2827-2844 (1976)
  MEDLINE   77057432
REFERENCE   6  (bases 2395 to 2922)
  AUTHORS   Air,G.M., Sanger,F. and Coulson,A.R.
  TITLE     Nucleotide and amino acid sequences of gene G of phi-X174
  JOURNAL   J. Mol. Biol. 108, 519-533 (1976)
  MEDLINE   77121207
REFERENCE   7  (bases 1017 to 1762)
  AUTHORS   Air,G.M., Blackburn,E.H., Coulson,A.R., Galibert,F., Sanger,F.,
            Sedat,J.W. and Ziff,E.B.
  TITLE     Gene F of bacteriophage phi-x174. Correlation of nucleotide
            sequences from the DNA and amino acid sequences from the gene
            product
  JOURNAL   J. Mol. Biol. 107, 445-458 (1976)
  MEDLINE   77074163
REFERENCE   8  (bases 730 to 903)
  AUTHORS   Blackburn,E.H.
  TITLE     Transcription and sequence analysis of a fragment of bacteriophage
            phi-X174 DNA
  JOURNAL   J. Mol. Biol. 107, 417-431 (1976)
  MEDLINE   77074161
REFERENCE   9  (bases 1017 to 1081)
  AUTHORS   Sedat,J., Ziff,E. and Galibert,F.
  TITLE     Direct determination of DNA nucleotide sequences: Structure of
            large specific fragments of bacteriophage phi-X174 DNA
  JOURNAL   J. Mol. Biol. 107, 391-416 (1976)
  MEDLINE   77074160
REFERENCE   10 (bases 2263 to 2421)
  AUTHORS   Fiddes,J.C.
  TITLE     Nucleotide sequence of the intercistronic region between genes G
            and F in bacteriophage phi-X174 DNA
  JOURNAL   J. Mol. Biol. 107, 1-24 (1976)
  MEDLINE   77074135
REFERENCE   11 (bases 1 to 5375)
  AUTHORS   Sanger,F., Air,G.M., Barrell,B.G., Brown,N.L., Coulson,A.R.,
            Fiddes,J.C., Hutchison,C.A., Slocombe,P.M. and Smith,M.
  TITLE     Nucleotide sequence of bacteriophage phi-X174 DNA
  JOURNAL   Nature 265, 687-695 (1977)
  MEDLINE   77171175
REFERENCE   12 (bases 4505 to 5374)
  AUTHORS   Brown,N.L. and Smith,M.
  TITLE     The sequence of a region of bacteriophage phi-X174 DNA coding for
            parts of genes A and B
  JOURNAL   J. Mol. Biol. 116, 1-30 (1977)
  MEDLINE   78069208
REFERENCE   13 (sites)
  AUTHORS   Fiddes,J.C.
  TITLE     The nucleotide sequence of a viral DNA
  JOURNAL   Sci. Am. 237, 54-67 (1977)
  MEDLINE   78054683
REFERENCE   14 (bases 5022 to 5132)
  AUTHORS   Brown,N.L. and Smith,M.
  TITLE     DNA sequence of a region of the phi-X174 genome coding for a
            ribosome binding site
  JOURNAL   Nature 265, 695-698 (1977)
  MEDLINE   77171176
REFERENCE   15 (bases 5346 to 5386; 1 to 159)
  AUTHORS   Smith,M., Brown,N.L., Air,G.M., Barrell,B.G., Coulson,A.R.,
            Hutchison,C.A.I.I.I. and Sanger,F.
  TITLE     DNA sequence at the C termini of the overlapping genes A and B in
            bacteriophage phi-X174
  JOURNAL   Nature 265, 702-705 (1977)
  MEDLINE   77171178
REFERENCE   16 (bases 1 to 5386)
  AUTHORS   Sanger,F., Coulson,A.R., Friedmann,T., Air,G.M., Barrell,B.G.,
            Brown,N.L., Fiddes,J.C., Hutchison,C.A., Slocombe,P.M. and Smith,M.
  TITLE     The nucleotide sequence of bacteriophage phi-X174
  JOURNAL   J. Mol. Biol. 125, 225-246 (1978)
  MEDLINE   79091185
REFERENCE   17 (bases 1290 to 1302; 1340 to 1430; 1510 to 1570; 1600 to 1750)
  AUTHORS   Air,G.M., Coulson,A.R., Fiddes,J.C., Friedmann,T., Hutchison,C.A.,
            Sanger,F., Slocombe,P.M. and Smith,A.J.
  TITLE     Nucleotide sequence of the F protein coding region of bacteriophage
            phi-X174 and the amino acid sequence of its product
  JOURNAL   J. Mol. Biol. 125, 247-254 (1978)
  MEDLINE   79091186
REFERENCE   18 (bases 4256 to 4317)
  AUTHORS   Langeveld,S.A., van Mansfeld,A.D.M., de Winter,J.M. and
            Weisbeek,P.J.
  TITLE     Cleavage of single-stranded DNA by the A and A* proteins of
            bacteriophage phi-X174
  JOURNAL   Nucleic Acids Res. 7, 2177-2188 (1979)
  MEDLINE   80101074
REFERENCE   19 (bases 4248 to 4332)
  AUTHORS   Heidekamp,F., Langeveld,S.A., Baas,P.D. and Jansz,H.S.
  TITLE     Studies of the recognition sequence of phi-X174 gene A protein.
            Cleavage site of phi-X gene A protein in St-1 RFI DNA
  JOURNAL   Nucleic Acids Res. 8, 2009-2021 (1980)
  MEDLINE   81053861
REFERENCE   20 (bases 436 to 490; 630 to 669; 930 to 979)
  AUTHORS   Takeshita,M., Kappen,L.S., Grollman,A.P., Eisenberg,M. and
            Goldberg,I.H.
  TITLE     Strand scission of deoxyribonucleic acid by neocarzinostatin,
            auromomycin, and bleomycin: studies on base release and nucleotide
            sequence specificity
  JOURNAL   Biochemistry (N.Y.) 20 (26), 7599-7606 (1981)
  MEDLINE   82113627
   PUBMED   6173064
REFERENCE   21 (bases 1064 to 1757)
  AUTHORS   Melville,M.-P., Piette,J., Lopez,M., Decuyper,J. and van de
            Vorst,A.
  TITLE     Termination sites of the in vitro DNA sysnthesis on single-stranded
            DNA photosensitized by promazines
  JOURNAL   J. Biol. Chem. 259, 15069-15077 (1984)
  MEDLINE   85079985
REFERENCE   22 (bases 449 to 482; 504 to 598; 1047 to 1111)
  AUTHORS   Ueda,K., Morita,J. and Komano,T.
  TITLE     Sequence specificity of heat-labile sites in DNA induced by
            mitomycin C
  JOURNAL   Biochemistry (N.Y.) 23 (8), 1634-1640 (1984)
  MEDLINE   84203526
   PUBMED   6232949
REFERENCE   23 (bases 2380 to 2512; 2593 to 2786; 2788 to 2947)
  AUTHORS   Air,G.M., Els,M.C., Brown,L.E., Laver,W.G. and Webster,R.G.
  TITLE     Location of antigenic sites in the three-dimensional structure of
            the influenza N2 virus neuraminidase
  JOURNAL   Virology 145, 237-248 (1985)
  MEDLINE   85274373
COMMENT     REVIEWED REFSEQ: This record has been curated by NCBI staff. The
            reference sequence was derived from J02482.
            [8]  intermittent sequences.
            [15]  review; discussion of complete genome.
            Double checked with sumex tape.
            Single-stranded circular DNA which codes for eleven proteins.
            Replicative form is duplex, icosahedron, related to s13 & g4. [21]
            indicates that mitomycin C reduced with sodium borohydride induced
            heat-labile sites in DNA most preferentially at dinucleotide
            sequence 'gt' (especially 'Pu-g-t').
            Bacteriophage phi-X174 single stranded DNA molecules were
            irradiated with near UV light in the presence of promazine
            derivatives, after priming with restriction fragments or synthetic
            primers [22].  The resulting DNA fragments were used as templates
            for in vitro complementary chain synthesis by E.coli DNA polymerase
            I [22].  More than 90% of the observed chain terminations were
            mapped one nucleotide before a guanine residue [22].  Photoreaction
            occurred more predominantly with guanine residues localized in
            single-stranded parts of the genome [22].  These same guanine
            residues could also be damaged when the reaction was performed in
            the dark, in the presence of promazine cation radicals [22].
FEATURES             Location/Qualifiers
     source          1..5386
                     /organism="coliphage phiX174"
                     /specific_host="Escherichia coli"
                     /db_xref="taxon:10847"
     variation       23
                     /note="c in wt; t in am18 and am35 [14]"
     variation       25
                     /note="g in wt; c in ts116 [14]"
     CDS             51..221
                     /note="K (function unknown)"
                     /codon_start=1
                     /transl_table=11
                     /protein_id="NP_040706.1"
                     /db_xref="GI:9626376"
                     /translation="MSRKIILIKQELLLLVYELNRSGLLAENEKIRPILAQLEKLLLC
                     DLSPSTNDSVKN"
     variation       57
                     /note="c in wt; t in am6 [14]"
     variation       117
                     /note="g in wt; a in am6 [14]"
     CDS             133..393
                     /note="C (DNA maturation)"
                     /codon_start=1
                     /transl_table=11
                     /protein_id="NP_040707.1"
                     /db_xref="GI:9626377"
                     /translation="MRKFDLSLRSSRSSYFATFRHQLTILSKTDALDEEKWLNMLGTF
                     VKDWFRYESHFVHGRDSLVDILKERGLLSESDAVQPLIGKKS"
     mRNA            358..3975
                     /note="mRNA (major alt.)"
     mRNA            358..991
                     /note="mRNA (minor alt.)"
     CDS             390..848
                     /note="D (capsid morphogenesis)"
                     /codon_start=1
                     /transl_table=11
                     /protein_id="NP_040708.1"
                     /db_xref="GI:9626378"
                     /translation="MSQVTEQSVRFQTALASIKLIQASAVLDLTEDDFDFLTSNKVWI
                     ATDRSRARRCVEACVYGTLDFVGYPRFPAPVEFIAAVIAYYVHPVNIQTACLIMEGAE
                     FTENIINGVERPVKAAELFAFTLRVRAGNTDVLTDAEENVRQKLRAEGVM"
     CDS             568..843
                     /note="E (cell lysis)"
                     /codon_start=1
                     /transl_table=11
                     /protein_id="NP_040709.1"
                     /db_xref="GI:9626379"
                     /translation="MVRWTLWDTLAFLLLLSLLLPSLLIMFIPSTFKRPVSSWKALNL
                     RKTLLMASSVRLKPLNCSRLPCVYAQETLTFLLTQKKTCVKNYVRKE"
     CDS             848..964
                     /note="J (core protein, DNA condensation)"
                     /codon_start=1
                     /transl_table=11
                     /protein_id="NP_040710.1"
                     /db_xref="GI:9626380"
                     /translation="MSKGKKRSGARPGRPQPLRGTKGKRKGARLWYVGGQQF"
     CDS             1001..2284
                     /note="F (major coat protein)"
                     /codon_start=1
                     /transl_table=11
                     /protein_id="NP_040711.1"
                     /db_xref="GI:9626381"
                     /translation="MSNIQTGAERMPHDLSHLGFLAGQIGRLITISTTPVIAGDSFEM
                     DAVGALRLSPLRRGLAIDSTVDIFTFYVPHRHVYGEQWIKFMKDGVNATPLPTVNTTG
                     YIDHAAFLGTINPDTNKIPKHLFQGYLNIYNNYFKAPWMPDRTEANPNELNQDDARYG
                     FRCCHLKNIWTAPLPPETELSRQMTTSTTSIDIMGLQAAYANLHTDQERDYFMQRYHD
                     VISSFGGKTSYDADNRPLLVMRSNLWASGYDVDGTDQTSLGQFSGRVQQTYKHSVPRF
                     FVPEHGTMFTLALVRFPPTATKEIQYLNAKGALTYTDIAGDPVLYGNLPPREISMKDV
                     FRSGDSSKKFKIAEGQWYRYAPSYVSPAYHLLEGFPFIQEPPSGDLQERVLIRHHDYD
                     QCFQSVQLLQWNSQVKFNVTVYRNLPTTRDSIMTS"
     CDS             2395..2922
                     /note="G (major spike protein)"
                     /codon_start=1
                     /transl_table=11
                     /protein_id="NP_040712.1"
                     /db_xref="GI:9626382"
                     /translation="MFQTFISRHNSNFFSDKLVLTSVTPASSAPVLQTPKATSSTLYF
                     DSLTVNAGNGGFLHCIQMDTSVNAANQVVSVGADIAFDADPKFFACLVRFESSSVPTT
                     LPTAYDVYPLNGRHDGGYYTVKDCVTIDVLPRTPGNNVYVGFMVWSNFTATKCRGLVS
                     LNQVIKEIICLQPLK"
     CDS             2931..3917
                     /note="H (minor spike protein, adsorption)"
                     /codon_start=1
                     /transl_table=11
                     /protein_id="NP_040713.1"
                     /db_xref="GI:9626383"
                     /translation="MFGAIAGGIASALAGGAMSKLFGGGQKAASGGIQGDVLATDNNT
                     VGMGDAGIKSAIQGSNVPNPDEAAPSFVSGAMAKAGKGLLEGTLQAGTSAVSDKLLDL
                     VGLGGKSAADKGKDTRDYLAAAFPELNAWERAGADASSAGMVDAGFENQKELTKMQLD
                     NQKEIAEMQNETQKEIAGIQSATSRQNTKDQVYAQNEMLAYQQKESTARVASIMENTN
                     LSKQQQVSEIMRQMLTQAQTAGQYFTNDQIKEMTRKVSAEVDLVHQQTQNQRYGSSHI
                     GATAKDISNVVTDAASGVVDIFHGIDKAVADTWNNFWKDGKADGIGSNLSRK"
     misc_feature    3962
                     /note="transcription start site"
     CDS             join(3981..5386,1..136)
                     /note="A (rf replication, viral strand synthesis)"
                     /codon_start=1
                     /transl_table=11
                     /protein_id="NP_040703.1"
                     /db_xref="GI:9626373"
                     /translation="MVRSYYPSECHADYFDFERIEALKPAIEACGISTLSQSPMLGFH
                     KQMDNRIKLLEEILSFRMQGVEFDNGDMYVDGHKAASDVRDEFVSVTEKLMDELAQCY
                     NVLPQLDINNTIDHRPEGDEKWFLENEKTVTQFCRKLAAERPLKDIRDEYNYPKKKGI
                     KDECSRLLEASTMKSRRGFAIQRLMNAMRQAHADGWFIVFDTLTLADDRLEAFYDNPN
                     ALRDYFRDIGRMVLAAEGRKANDSHADCYQYFCVPEYGTANGRLHFHAVHFMRTLPTG
                     SVDPNFGRRVRNRRQLNSLQNTWPYGYSMPIAVRYTQDAFSRSGWLWPVDAKGEPLKA
                     TSYMAVGFYVAKYVNKKSDMDLAAKGLGAKEWNNSLKTKLSLLPKKLFRIRMSRNFGM
                     KMLTMTNLSTECLIQLTKLGYDATPFNQILKQNAKREMRLRLGKVTVADVLAAQPVTT
                     NLLKFMRASIKMIGVSNLQSFIASMTQKLTLSDISDESKNYLDKAGITTACLRIKSKW
                     TAGGK"
     rep_origin      4306
                     /note="origin of viral strand synthesis"
     CDS             join(4497..5386,1..136)
                     /note="A* (shut off host DNA synthesis)"
                     /codon_start=1
                     /transl_table=11
                     /protein_id="NP_040704.1"
                     /db_xref="GI:9626374"
                     /translation="MKSRRGFAIQRLMNAMRQAHADGWFIVFDTLTLADDRLEAFYDN
                     PNALRDYFRDIGRMVLAAEGRKANDSHADCYQYFCVPEYGTANGRLHFHAVHFMRTLP
                     TGSVDPNFGRRVRNRRQLNSLQNTWPYGYSMPIAVRYTQDAFSRSGWLWPVDAKGEPL
                     KATSYMAVGFYVAKYVNKKSDMDLAAKGLGAKEWNNSLKTKLSLLPKKLFRIRMSRNF
                     GMKMLTMTNLSTECLIQLTKLGYDATPFNQILKQNAKREMRLRLGKVTVADVLAAQPV
                     TTNLLKFMRASIKMIGVSNLQSFIASMTQKLTLSDISDESKNYLDKAGITTACLRIKS
                     KWTAGGK"
     misc_feature    4899
                     /note="transcription start site"
     CDS             join(5075..5386,1..51)
                     /note="B (capsid morphogenesis)"
                     /codon_start=1
                     /transl_table=11
                     /protein_id="NP_040705.1"
                     /db_xref="GI:9626375"
                     /translation="MEQLTKNQAVATSQEAVQNQNEPQLRDENAHNDKSVHGVLNPTY
                     QAGLRRDAVQPDIEAERKKRDEIEAGKSYCSRRFGGATCDDKSAQIYARFDKNDWRIQ
                     PAEFYRFHDAEVNTFGYF"
BASE COUNT     1291 a   1157 c   1254 g   1684 t
ORIGIN      
        1 gagttttatc gcttccatga cgcagaagtt aacactttcg gatatttctg atgagtcgaa
       61 aaattatctt gataaagcag gaattactac tgcttgttta cgaattaaat cgaagtggac
      121 tgctggcgga aaatgagaaa attcgaccta tccttgcgca gctcgagaag ctcttacttt
      181 gcgacctttc gccatcaact aacgattctg tcaaaaactg acgcgttgga tgaggagaag
      241 tggcttaata tgcttggcac gttcgtcaag gactggttta gatatgagtc acattttgtt
      301 catggtagag attctcttgt tgacatttta aaagagcgtg gattactatc tgagtccgat
      361 gctgttcaac cactaatagg taagaaatca tgagtcaagt tactgaacaa tccgtacgtt
      421 tccagaccgc tttggcctct attaagctca ttcaggcttc tgccgttttg gatttaaccg
      481 aagatgattt cgattttctg acgagtaaca aagtttggat tgctactgac cgctctcgtg
      541 ctcgtcgctg cgttgaggct tgcgtttatg gtacgctgga ctttgtggga taccctcgct
      601 ttcctgctcc tgttgagttt attgctgccg tcattgctta ttatgttcat cccgtcaaca
      661 ttcaaacggc ctgtctcatc atggaaggcg ctgaatttac ggaaaacatt attaatggcg
      721 tcgagcgtcc ggttaaagcc gctgaattgt tcgcgtttac cttgcgtgta cgcgcaggaa
      781 acactgacgt tcttactgac gcagaagaaa acgtgcgtca aaaattacgt gcggaaggag
      841 tgatgtaatg tctaaaggta aaaaacgttc tggcgctcgc cctggtcgtc cgcagccgtt
      901 gcgaggtact aaaggcaagc gtaaaggcgc tcgtctttgg tatgtaggtg gtcaacaatt
      961 ttaattgcag gggcttcggc cccttacttg aggataaatt atgtctaata ttcaaactgg
     1021 cgccgagcgt atgccgcatg acctttccca tcttggcttc cttgctggtc agattggtcg
     1081 tcttattacc atttcaacta ctccggttat cgctggcgac tccttcgaga tggacgccgt
     1141 tggcgctctc cgtctttctc cattgcgtcg tggccttgct attgactcta ctgtagacat
     1201 ttttactttt tatgtccctc atcgtcacgt ttatggtgaa cagtggatta agttcatgaa
     1261 ggatggtgtt aatgccactc ctctcccgac tgttaacact actggttata ttgaccatgc
     1321 cgcttttctt ggcacgatta accctgatac caataaaatc cctaagcatt tgtttcaggg
     1381 ttatttgaat atctataaca actattttaa agcgccgtgg atgcctgacc gtaccgaggc
     1441 taaccctaat gagcttaatc aagatgatgc tcgttatggt ttccgttgct gccatctcaa
     1501 aaacatttgg actgctccgc ttcctcctga gactgagctt tctcgccaaa tgacgacttc
     1561 taccacatct attgacatta tgggtctgca agctgcttat gctaatttgc atactgacca
     1621 agaacgtgat tacttcatgc agcgttacca tgatgttatt tcttcatttg gaggtaaaac
     1681 ctcttatgac gctgacaacc gtcctttact tgtcatgcgc tctaatctct gggcatctgg
     1741 ctatgatgtt gatggaactg accaaacgtc gttaggccag ttttctggtc gtgttcaaca
     1801 gacctataaa cattctgtgc cgcgtttctt tgttcctgag catggcacta tgtttactct
     1861 tgcgcttgtt cgttttccgc ctactgcgac taaagagatt cagtacctta acgctaaagg
     1921 tgctttgact tataccgata ttgctggcga ccctgttttg tatggcaact tgccgccgcg
     1981 tgaaatttct atgaaggatg ttttccgttc tggtgattcg tctaagaagt ttaagattgc
     2041 tgagggtcag tggtatcgtt atgcgccttc gtatgtttct cctgcttatc accttcttga
     2101 aggcttccca ttcattcagg aaccgccttc tggtgatttg caagaacgcg tacttattcg
     2161 ccaccatgat tatgaccagt gtttccagtc cgttcagttg ttgcagtgga atagtcaggt
     2221 taaatttaat gtgaccgttt atcgcaatct gccgaccact cgcgattcaa tcatgacttc
     2281 gtgataaaag attgagtgtg aggttataac gccgaagcgg taaaaatttt aatttttgcc
     2341 gctgaggggt tgaccaagcg aagcgcggta ggttttctgc ttaggagttt aatcatgttt
     2401 cagactttta tttctcgcca taattcaaac tttttttctg ataagctggt tctcacttct
     2461 gttactccag cttcttcggc acctgtttta cagacaccta aagctacatc gtcaacgtta
     2521 tattttgata gtttgacggt taatgctggt aatggtggtt ttcttcattg cattcagatg
     2581 gatacatctg tcaacgccgc taatcaggtt gtttctgttg gtgctgatat tgcttttgat
     2641 gccgacccta aattttttgc ctgtttggtt cgctttgagt cttcttcggt tccgactacc
     2701 ctcccgactg cctatgatgt ttatcctttg aatggtcgcc atgatggtgg ttattatacc
     2761 gtcaaggact gtgtgactat tgacgtcctt ccccgtacgc cgggcaataa cgtttatgtt
     2821 ggtttcatgg tttggtctaa ctttaccgct actaaatgcc gcggattggt ttcgctgaat
     2881 caggttatta aagagattat ttgtctccag ccacttaagt gaggtgattt atgtttggtg
     2941 ctattgctgg cggtattgct tctgctcttg ctggtggcgc catgtctaaa ttgtttggag
     3001 gcggtcaaaa agccgcctcc ggtggcattc aaggtgatgt gcttgctacc gataacaata
     3061 ctgtaggcat gggtgatgct ggtattaaat ctgccattca aggctctaat gttcctaacc
     3121 ctgatgaggc cgcccctagt tttgtttctg gtgctatggc taaagctggt aaaggacttc
     3181 ttgaaggtac gttgcaggct ggcacttctg ccgtttctga taagttgctt gatttggttg
     3241 gacttggtgg caagtctgcc gctgataaag gaaaggatac tcgtgattat cttgctgctg
     3301 catttcctga gcttaatgct tgggagcgtg ctggtgctga tgcttcctct gctggtatgg
     3361 ttgacgccgg atttgagaat caaaaagagc ttactaaaat gcaactggac aatcagaaag
     3421 agattgccga gatgcaaaat gagactcaaa aagagattgc tggcattcag tcggcgactt
     3481 cacgccagaa tacgaaagac caggtatatg cacaaaatga gatgcttgct tatcaacaga
     3541 aggagtctac tgctcgcgtt gcgtctatta tggaaaacac caatctttcc aagcaacagc
     3601 aggtttccga gattatgcgc caaatgctta ctcaagctca aacggctggt cagtatttta
     3661 ccaatgacca aatcaaagaa atgactcgca aggttagtgc tgaggttgac ttagttcatc
     3721 agcaaacgca gaatcagcgg tatggctctt ctcatattgg cgctactgca aaggatattt
     3781 ctaatgtcgt cactgatgct gcttctggtg tggttgatat ttttcatggt attgataaag
     3841 ctgttgccga tacttggaac aatttctgga aagacggtaa agctgatggt attggctcta
     3901 atttgtctag gaaataaccg tcaggattga caccctccca attgtatgtt ttcatgcctc
     3961 caaatcttgg aggctttttt atggttcgtt cttattaccc ttctgaatgt cacgctgatt
     4021 attttgactt tgagcgtatc gaggctctta aacctgctat tgaggcttgt ggcatttcta
     4081 ctctttctca atccccaatg cttggcttcc ataagcagat ggataaccgc atcaagctct
     4141 tggaagagat tctgtctttt cgtatgcagg gcgttgagtt cgataatggt gatatgtatg
     4201 ttgacggcca taaggctgct tctgacgttc gtgatgagtt tgtatctgtt actgagaagt
     4261 taatggatga attggcacaa tgctacaatg tgctccccca acttgatatt aataacacta
     4321 tagaccaccg ccccgaaggg gacgaaaaat ggtttttaga gaacgagaag acggttacgc
     4381 agttttgccg caagctggct gctgaacgcc ctcttaagga tattcgcgat gagtataatt
     4441 accccaaaaa gaaaggtatt aaggatgagt gttcaagatt gctggaggcc tccactatga
     4501 aatcgcgtag aggctttgct attcagcgtt tgatgaatgc aatgcgacag gctcatgctg
     4561 atggttggtt tatcgttttt gacactctca cgttggctga cgaccgatta gaggcgtttt
     4621 atgataatcc caatgctttg cgtgactatt ttcgtgatat tggtcgtatg gttcttgctg
     4681 ccgagggtcg caaggctaat gattcacacg ccgactgcta tcagtatttt tgtgtgcctg
     4741 agtatggtac agctaatggc cgtcttcatt tccatgcggt gcactttatg cggacacttc
     4801 ctacaggtag cgttgaccct aattttggtc gtcgggtacg caatcgccgc cagttaaata
     4861 gcttgcaaaa tacgtggcct tatggttaca gtatgcccat cgcagttcgc tacacgcagg
     4921 acgctttttc acgttctggt tggttgtggc ctgttgatgc taaaggtgag ccgcttaaag
     4981 ctaccagtta tatggctgtt ggtttctatg tggctaaata cgttaacaaa aagtcagata
     5041 tggaccttgc tgctaaaggt ctaggagcta aagaatggaa caactcacta aaaaccaagc
     5101 tgtcgctact tcccaagaag ctgttcagaa tcagaatgag ccgcaacttc gggatgaaaa
     5161 tgctcacaat gacaaatctg tccacggagt gcttaatcca acttaccaag ctgggttacg
     5221 acgcgacgcc gttcaaccag atattgaagc agaacgcaaa aagagagatg agattgaggc
     5281 tgggaaaagt tactgtagcc gacgttttgg cggcgcaacc tgtgacgaca aatctgctca
     5341 aatttatgcg cgcttcgata aaaatgattg gcgtatccaa cctgca
//

Base-Composition Atlas for E. coli Phi-X174 bacteriophage



There are several things to notice in this plot. First, the genome is circular. The density of the four nucleotides are plotted in the four outer-most circles. This density is not evenly distributed; although all four of the scales range from 0% (min., no colour) to 40% (max colour intensity), it can be easily seen that the sequence is dominated by T's (red circle), and that there are relatively few G's (outermost turquoise circle) and C's (pink circle), and a few A-rich regions (green 2nd circle).

There are many genes which overlap (the genes are indicated in the "annotation circle", which is the fifth circle from the outside - with the blue bands representing genes in the forward direction). Note that all the genes are oriented in the same direction. Whilst this is true for some viruses, it is rarely true for organisms with larger genomes. There is a strong bias towards T's over A's (AT skew - which is simply #A's minus #T's, over a given window), a less strong bias of G's over C's (GC skew) and finally the genome is generally AT rich (red in innermost circle).


Genome Atlas for E. coli Phi-X174 bacteriophage



Notice that some of the structural features (such as the perfect palindromes circle) are often found near the end of genes. A more detailed explanation for the various parameters will be given in the next lecture, but for now the important point is that there is much information in the sequence which can be visualised in the atlas, which is not readily apparent from merely looking at the GenBank file alone.


Fern Banner



Part 2Part 3: DNA Structures and Symmetry Elements




A Brief Introduction to [a few] Alternative Conformations of DNA


DNA symmetry elements defined

ONE possible way of trying to deal with all this information is to develop methods of visualising DNA structures within bacterial chromosomes. The method I have chosen to talk about today is based on two different groups of "DNA symmetry elements". The first is simply various types of repeats, and the second group is DNA helix families, which is caused by certain stretches of purines (or pyrimidines) for A-DNA, and certain stretches of alternating pyrimidine/purines for Z-DNA. The various conformations of these different sequences have putative biological functions, based in part on these structures. The repeats will be discussed first.





A. DNA Repeats

From a DNA sequence perspective, there are 4 types of repeats:

Direct Repeats

  • Simple Tandem Repeats

  • (Longer)Tandem Repeats

  • Direct (non-tandem)

  • Phased Repeats



  • Inverted Repeats



    Mirror Repeats




    Everted Repeats







    Table of DNA sequence repeats, structures, and biological functions

    Repeat Pattern Possible structure Biological function
     +   (N)n Direct repeats recA triple-stranded DNA homologous recombination
    duplications
     +   (R)n Mirror repeats Intermolecular triplex
    Intramolecular Triple-strands
    recombination
    replication
     + (N)n
    Inverted repeats cruciforms deletions (in bacteria)
    insertion sequences
     + (N)n
    Everted repeats parallel stranded DNA unknown
    stabilisation of telomeres(?)







    leaf bar



    B. DNA Helix Families

    A-, B-, and Z-DNAs
    A-DNA (left), B-DNA (middle) and Z-DNA (right) -- 12 bp each
    From Dickerson et al. in Cold Spring Harbor Symposium for Quantitative Biology (1982) v47 p13-24.





    3 families of DNA helices:



    A-DNA conformation

    A-DNA family - this is most common for double stranded RNA, RNA/DNA hybrids, as well as for certain DNA sequences, such as long stretches of purines. NMR studies have shown that as few as 5 bp of purines in a row can set up an A-type of helix. Most of the DNA inside of cells is likely to be a mixture of the A- and B-DNA conformations.















    B-DNA conformation

    B-DNA family - the majority of DNA exists in the "B-DNA form" inside the cells of living organisms. This is the classical "Watson-Crick" structure, although there is considerable sequence-specific variation. Thus, for example, different sequences can have from 9 bp/turn of the helix to 12 bp/turn, depending on the sequence of the DNA! However, on AVERAGE, the DNA is about 10.5 bp/turn.





















    Z-DNA conformation

    Z-DNA family - this is much more rare than the other two families, although certains sequences (such as runs of GC repeats (GCGCGC)) can form Z-DNA easily. In eukaryotes, CpG islands can form Z-DNA, and methylated CpG islands can form Z-DNA readily in vivo. Furthermore, specific proteins have been isolated which will bind preferentially to the left-handed Z-DNA conformation.























    Link to more atlases for Escherichia coli genomes.


    Link to the main "Genome Atlas" web page




    REFERENCES

    Papers relevant to this lecture (handed out in class)

      Friday (6 April, 2001)

    1. David W. Ussery, "Genome Databases", The Encyclopedia of Genetics, in press, April, 2001.
    2. Ussery,D.W., Larsen,T.S., Wilkes,K.T., Friis,C., Worning,P., Krogh,A., Brunak,S. "Genome Organisation and Chromatin Structure in Escherichia coli", Biochimie,83:201-212, (2001).
    3. Carsten Friis, Lars Juhl Jensen, and David W. Ussery, "Visualisation of Pathogenicity Regions in Bacteria", Genetica, 108:47-51, 2000.
    4. David W. Ussery, "Bioinformatics2000 Meeting Report", Genome Biology, 1, (#3), 1-2, 2000.



    Other references

  • Richard R. Sinden, Christopher E. Pearson, Vladimir N. Potoman, and David W. Ussery, "DNA: Structure and Function", Advances in Genome Biology, 5A:1-141, (1998).
  • Ussery,D.W., Higgins,C.F., and Bloshoy,A., "Environmental Influences on DNA Curvature", J. Biomolecular Structure & Dynamics,16:811-823, (1999).[PubMed]




  • To be handed out next lecture (Tuesday, 17 April, 2001)

  • David W. Ussery, "DNA Structure: A-, B-, and Z-DNA Families", manuscript submitted to The Encyclopedia of Life Sciences, April, 2000.
  • Anders Gorm Pedersen, Lars Juhl Jensen, Hans-Henrik Stærfeldt, Søren Brunak, and David W. Ussery, "A DNA Structural Atlas for Escherichia coli", Journal of Molecular Biology, 299 (#4), 907-930, (2000).     [cover]

  • Link to JMB online version of this article.        PDF file     [PubMed]

  • Lars Juhl Jensen, Carsten Friis, and David W. Ussery, "Three Views of Microbial Genomes", Research in Microbiology, 150, pages 773-777, 1999.
  •    [cover]     [PubMed]        PDF file

  • David W. Ussery, "DNA Denaturation", manuscript submitted to The Encyclopedia of Genetics, September, 2000.


  • Link to a list of recent papers and talks on DNA structures.



    Books about DNA:

    Watson, James D. "A PASSION FOR DNA: Genes, Genomes, and Society", (Oxford University Press, Oxford, 2000).      Amazon      Barnes&Noble

    Sinden, Richard R., "DNA: STRUCTURE and FUNCTION", (Academic Press, New York, 1994).      Amazon      Barnes&Noble

    Calladine,C.R., Drew,H.R., "Understanding DNA: The Molecule and How It Works", (2nd edition, Academic Press, San Diego, 1997).      Amazon      Barnes&Noble



    A List of more than a thousand books about DNA






    Go to the CBS Home Page Back to the CBS homepage

    Back to Dave's Courses page

    Last modified Tuesday, 9 April, 2002 by David Ussery