David Ussery
Friday, 23 November, 2001
Morning Lecture      Link to Afternoon lecture

Masterclass on Microbial Genomes
University of Groningen, The Netherlands


DNA Symmetry Elements in Bacterial Chromosomes

IHF logo

Overview

  1. What is bioinformatics?
    • Introduction
    • The Problem: too much information!


  2. DNA Symmetry Elements
    • DNA repeats
    • DNA helix families


  3. Comparative Genomics





Part 1Part 1: What is bioinformatics?

Bioinformatics is the application of machine learning methods to biological information. Part (but not all) of this information is in the form of sequences (e.g., DNA, RNA, proteins). In addition to sequences, there are other sources of data for bioinformatics analysis, such as micro-array data, 2-D gel information, image analysis, etc.





DNA -> RNA -> Protein



Once a new sequence has been determined, there are various ways of trying to find the function:



In this talk (as well as the one this afternoon), I will focus on only the latter two approaches - that is, looking for patterns in DNA sequences and predicting local DNA structures based on the DNA sequence. Both talks will be about information in the DNA sequences within the context of sequenced bacterial chromosomes. Note that this is only one tiny fraction of the much larger subject area of bioinformatics.



Information depends on CONTEXT

For DNA sequences, there are several different types of information:

  1. Coding information for amino acid sequences in proteins.


  2. Coding information for RNA sequences.
    • tRNA
    • rRNA
    • snRNA
    • telomerase RNA
    • other RNAs

  3. Protein binding site information.

    • transcription factors
    • chromatin proteins
    • restriction enzymes

  4. DNA modification site information.
    • methylation sites
    • glycoslyation sites
    • other modification sites

  5. Chromosome organisational information.
    • regions of highly expressed genes
    • origin and terminus of replication
    • mutational "hot spots"

  6. Physical/mechanical local structural information.
    • meltability
    • helix rigidty
    • intrinsic DNA curvature
    • nucleosomal positioning

  7. Repeat/symmetry elements.
    • repeats (direct, inverted, mirror, everted)
    • A-DNA (certain purine stretches) or Z-DNA (certain pyr/pur stretches)
    • structural periodicity patterns








Too Much Information!

Currently several hundred prokaryotic genomes have been sequenced, and nearly 100 genomes are publicly available for analysis. The flow of information is essentially the same as above, that is:




Genome -> Transcriptome -> Proteome




Link to a list of sequenced bacterial genomes

Some philosophical thoughts about Information and the Size of Genomes.



The information in GenBank is doubling every 10 months.
What are the implications of this?

growth in GenBank



A look at genome sequencing since 1994:

YEAR# GENOMES SequencedRunning Total
1994
0
0
1995
2
2
1996
2
4
1997
5
9
1998
8
17
1999
13
30
2000
23
53
2001
>50
>100


Genome Databases Links




Fern Banner



Part 2 Part 2: DNA Symmetry Elements




A Brief Introduction to [a few] Alternative Conformations of DNA


DNA symmetry elements defined

ONE possible way of trying to deal with all this information is to develop methods of visualising DNA structures within bacterial chromosomes. The method I have chosen to talk about today is based on two different groups of "DNA symmetry elements". The first is simply various types of repeats, and the second group is DNA helix families, which is caused by certain stretches of purines (or pyrimidines) for A-DNA, and certain stretches of alternating pyrimidine/purines for Z-DNA. The various conformations of these different sequences have putative biological functions, based in part on these structures. The repeats will be discussed first.





A. DNA Repeats

From a DNA sequence perspective, there are 4 types of repeats:

Direct Repeats

  • Simple Tandem Repeats

  • (Longer)Tandem Repeats

  • Direct (non-tandem)

  • Phased Repeats



  • Inverted Repeats



    Mirror Repeats




    Everted Repeats







    Table of DNA sequence repeats, structures, and biological functions

    Repeat Pattern Possible structure Biological function
     +   (N)n Direct repeats recA triple-stranded DNA homologous recombination
    duplications
     +   (R)n Mirror repeats Intermolecular triplex
    Intramolecular Triple-strands
    recombination
    replication
     + (N)n
    Inverted repeats cruciforms deletions (in bacteria)
    insertion sequences
     + (N)n
    Everted repeats parallel stranded DNA unknown
    stabilisation of telomeres(?)







    leaf bar



    B. DNA Helix Families

    A-, B-, and Z-DNAs
    A-DNA (left), B-DNA (middle) and Z-DNA (right) -- 12 bp each
    From Dickerson et al. in Cold Spring Harbor Symposium for Quantitative Biology (1982) v47 p13-24.





    3 families of DNA helices:



    A-DNA conformation

    A-DNA family - this is most common for double stranded RNA, RNA/DNA hybrids, as well as for certain DNA sequences, such as long stretches of purines. NMR studies have shown that as few as 5 bp of purines in a row can set up an A-type of helix. Most of the DNA inside of cells is likely to be a mixture of the A- and B-DNA conformations.















    B-DNA conformation

    B-DNA family - the majority of DNA exists in the "B-DNA form" inside the cells of living organisms. This is the classical "Watson-Crick" structure, although there is considerable sequence-specific variation. Thus, for example, different sequences can have from 9 bp/turn of the helix to 12 bp/turn, depending on the sequence of the DNA! However, on AVERAGE, the DNA is about 10.5 bp/turn.





















    Z-DNA conformation

    Z-DNA family - this is much more rare than the other two families, although certains sequences (such as runs of GC repeats (GCGCGC)) can form Z-DNA easily. In eukaryotes, CpG islands can form Z-DNA, and methylated CpG islands can form Z-DNA readily in vivo. Furthermore, specific proteins have been isolated which will bind preferentially to the left-handed Z-DNA conformation.




















    leaf bar




    Part 1 Part 3: Comparative Genomics

    What has been sequenced?

    As of 12 November, 2001. Link to an updated table

    KingdomNumber
    Species
    sequenced
    Number
    chromosomes
    sequenced *
    Total bp sequenced
    Archaea132026,409,849
    Bacteria61165208,304,570 bp
    Proctista41818,053,080 bp
    Fungi22336,117,519 bp
    Plants1747,623,657 bp
    Animals3362,979,841,298 bp
    Viruses5349112,279,171
    totals1377603,328,629,144

    * Includes plasmids from sequenced genomes.





    What is missing?

    There are (at least) TWO things missing: genomes from ecologically abundant and diverse niches, and larger genomes.


    Phylogenetic Tree





    Although the number of genomes being sequenced is increasing rapidly, one has to this into perspective - the organisms can be placed into four different classes:

    Organism group Size (bp) No. sequenced
    viruses ~300 bp  to               ~350,000 bp 545
    prokaryotes  ~250,000 to          ~15,000,000 bp 80
    (public)
    single-celled eukaryotes
    ~12,000,000 to ~600,000,000,000 bp 8
    multi-celled eukaryotes ~20,000,000 to ~500,000,000,000 bp 3*
    *note that NONE of the multi-cellular eukaryotic chromosomes have yet been completely sequenced (e.g., 1 contiguous piece of DNA, with no gaps).



    This level of variation often does not correlate at all with "biological complexity". For example, a simple amoeba has 600,000,000,000 bp of DNA in its genome, or 200x as much as in humans! As another example, the genome in insects ranges from 20,000,000 bp, or just a bit larger than a bacteria, to more than 10 BILLION bp, or more than three times larger than the human genome. So far (understandably!) the trend has been to sequence the SMALL genomes, and pretend they are representative of the larger ones. This is certainly reasonable, but one should keep the large genomes in the back of their minds when trying to extrapolate from the genome sequence of the smallest genome to the "real world" of biological complexity.





    Comparison of bacterial chromosomes.


    So we have lots of sequenced genomes. How can we compare them? A simple first approach is to look at average properties for the whole chromosomes. For example, the figure below shows the average AT content for 20 different Archaeal chromosomes. Note that MOST of the Archaeal genomes are AT RICH, which is contrary to the old dogma that they must be GC rich, because many of them are thermophiles. We now know that the genomes for many of these organisms can survive high temperatures because the DNA is positively supercoiled, rather than negatively supercoiled. This means that it takes much more energy to melt the helix.

    AT content of Archaeal Chromosomes




    As a further example, characteristics for 20 different Proteobacter chromosomes are compared on the following web page:

    http://www.cbs.dtu.dk/staff/dave/MScourse/ProteobacterOkt2001.html




    How Random is DNA?

    Although estimating the levels of A-DNA and Z-DNA might be difficult, one thing that is clear is that there is a strong bias in genomes towards an over-representation of purine stretches, as well as pyr/pur stretches. In addition, the patterns found in eukaryotic DNA is quite different from bacterial DNA. So, for example, in the two plots below, the occurance of stretches of purines or pyr/pur tracts is essentially the same as predicted by a "random" model of DNA for E. coli, but is very different that expected, even when taking into account the pentameric composition ("6th order Markov Model). This implies that the DNA in eukaryotes is much less "random" than the DNA in bacteria - at least with respect to runs of purines or alternating pyr/pur tracts.

    Purine tracts in E.coli




    Purine tracts in human chromosome 1





    Fraction of purine and pyr/pur stretches of at least 10 bp in Sequenced Chromosomes From All 5 Kingdoms



    Fraction of purine tracts in all five kingdoms




    OrganismKingdomSizePurine
    stretches
    Pyr/Pur
    stretches
    length dist.
    plot
    E. coli K-12Prokaryotae
    (Bacteria)
    4,639,221 bp
    (complete)
    1.1%
    1.4%
    plot
    P. furiosisProkaryotae
    (Archaea)
    1,908,523 bp
    (complete)
    6.0%
    0.2%
    plot
    L. major
    chromosome 1
    Protista
    (protozoa)
    268,984 bp
    (~40 Mbp total)
    6.0%
    4.6%
    plot
    S. cerevisiae
    All 16 chromosomes
    Fungi
    (yeast)
    12,057,849 bp
    (complete)
    3.7%
    0.8%
    plot
    A. thaliana
    chromosome 1
    Plantae
    (thale cress)
    28,920,698 bp
    (~100 Mbp total)
    4.6%
    0.9%
    plot
    H. sapiens
    chromosome 1
    Animalae
    (humans)
    282,193,664 bp
    (~3000 Mbp total)
    5.3%
    0.8%
    plot
    Expected values
    -
    n bp
    0.2%
    0.2%
    -

    Link to a table comparing more than 700 chromosomes


    leaf bar






    Comparison of Fraction of purine and pyr/pur tracts in Prokaryotic Chromosomes




    Archaea

    Purine tracts in Archaeal Genomes




    Proteobacteria

    Purine tracts in Proteobacter Genomes




    Firmicutes

    Purine tracts in Firmicute Genomes




    "Other" Bacterial Genomes

    Purine tracts in Other Genomes




    REFERENCES

    Photocopies of the following articles are provided in the course programme:

    1. David W. Ussery, Thomas S. Larsen, K. Trevor Wilkes, Carsten Friis, Peder Worning, Anders Krogh, and Søren Brunak, "Genome Organisation and Chromatin Structure in Escherichia coli",
      Biochimie, 83:201-212, (2001).
             [PubMed]        PDF file         [cover]

    2. Link to web page with supplemental information about this article.



    3. Anders Gorm Pedersen, Lars Juhl Jensen, Hans-Henrik Stærfeldt, Søren Brunak, and David W. Ussery, "A DNA Structural Atlas for Escherichia coli", Journal of Molecular Biology, 299 (#4), 907-930, (2000).     [cover]

    4. Link to JMB online version of this article.        PDF file     [PubMed]



    5. Lars Juhl Jensen, Carsten Friis, and David W. Ussery, "Three Views of Microbial Genomes", Research in Microbiology, 150, pages 773-777, 1999.
    6.    [cover]     [PubMed]        PDF file



    7. Carsten Friis,    Lars Juhl Jensen,    and David W. Ussery,
      "Visualisation of Pathogenicity Regions in Bacteria",
      Genetica, 108:47-51, (2000).
             [PubMed]        PDF file 
    8.        [cover]
      Link to Yersinia pestis pPCP1 atlases.
      Link to S. typhimurium DT104 atlases.
      Link to E. coli pO157 atlases.





    Articles referred to in the lecture, but not handed out in class:


      An Overview of Where to Find More Information on Sequenced Genomes:

    • David W. Ussery,
      "Genome Databases",
      The Encyclopedia of Genetics, in press, September, 2001.        PDF file 



    • Articles about A-DNA and Z-DNA in chromosomes:

    • David W. Ussery,
      "DNA Structure: A-, B-, and Z-DNA Families",
      manuscript submitted to The Encyclopedia of Life Sciences, (to be published in autumn 2001).        PDF file 
    • David Ussery,    Dikeos Mario Soumpasis,    Søren Brunak,    Hans-Henrik Stærfeldt,    Peder Worning,    and     Anders Krogh
      "Bias of Purine Stretches in Sequenced Genomes",
      Computers in Chemistry, in press, to be published in January, 2002.
             PDF file 

    • Link to web page comparing fractions purine and pyr/pur tracts in more than 700 chromosomes




      Cruciforms and Palindromes in Bacterial Chromosomes:

    • Richard R. Sinden,    David W. Ussery,    Peder Worning,    and     William Rosche
      "Genome Gymnastics and Spontaneous Mutagenesis: Intermolecular Leading Strand Misalignments Lead to Quasipalindrome Correction"
      submitted as a "MicroReview" to Molecular Microbiology,      PDF file 



    • Most Bacterial Genomes are Over-annotated (by as much as 50%!):

    • Marie Skovgaard, Lars Juhl Jensen, Søren Brunak, David W. Ussery, and Anders Krogh
      "On the Total Number of Genes and Their Length Distribution in Complete Microbial Genomes",
      Trends in Genetics, 17:425-428, 2001.
             PDF file 

    • Link to web page with supplemental information about this article.




      Some of the comparison of proteobacter genomes was included in the following manuscript:

    • Lise Petersen,    Stephen L.W. On,    and     David Ussery
      "Visualisation and Significance of DNA Structural Motifs in the Campylobacter jejuni genome",
      manuscript submitted to Genome Letters, to be published in spring, 2002).




    Other related articles:

    1. Richard R. Sinden, Christopher E. Pearson, Vladimir N. Potoman, and David W. Ussery, "DNA: Structure and Function", Advances in Genome Biology, 5A:1-141, (1998).
    2. Ussery,D.W., Higgins,C.F., and Bloshoy,A., "Environmental Influences on DNA Curvature", J. Biomolecular Structure & Dynamics,16:811-823, (1999).[PubMed]

    3. David W. Ussery,
      "DNA Denaturation",
      The Encyclopedia of Genetics, in press, September, 2001.        PDF file 
    4. David W. Ussery,
      "Bioinformatics2000 Meeting Report",
      GenomeBiology, 1:(#3), pages 1-2, (2000).
             PDF file         On-Line Version at http://www.genomebiology.com/2000/1/3/reports/4014/


    Link to a list of recent papers and talks on DNA structures.



    Books about DNA:

    Watson, James D. "A PASSION FOR DNA: Genes, Genomes, and Society", (Oxford University Press, Oxford, 2000).      Amazon      Barnes&Noble

    Sinden, Richard R., "DNA: STRUCTURE and FUNCTION", (Academic Press, New York, 1994).      Amazon      Barnes&Noble

    Calladine,C.R., Drew,H.R., "Understanding DNA: The Molecule and How It Works", (2nd edition, Academic Press, San Diego, 1997).      Amazon      Barnes&Noble



    A List of more than a thousand books about DNA






    Go to the CBS Home Page Back to the CBS homepage

    Back to Dave's Courses page

    Last modified Monday, 27 November, 2001 by David Ussery