30431 Introduktion til Bioinformatik
David Ussery
Tirsdag, 9 November, 1999

Animated DNA



DNA Symmetry Elements and their Meanings






Overview

This lecture is about ways of looking at DNA sequences in complete genomes and chromosomes, in terms of symmetry elements. There are two parts to this talk. In Part 1, I will discuss the fact that we simply have "Too Much Information" becoming available, and the problem will only get worse in the near future. There are ways of cataloging and organising the data, of course. However, many people don't appreciate the true diversity of genome sizes in Nature, so we'll talk for a few minutes about the "C-value paradox", along with some possible ideas for WHY certain organisms have so much DNA.

In Part 2 we get at the main subject of the lecture, which is a look at DNA symmetry elements and their biological meanings. Although you could have essentially an infinite variety of different possible DNA sequences, fortunately, there are only a limited number of DNA conformations. I would like to think that one way of dealing with all this information, in terms of DNA sequences, is to think about it in biological terms, in particular in physical-chemical terms of structure and function of symmetry elements. For example, there are specific DNA sequences which "code" for a telomere, and different DNA sequences which are specific for centromeres. Specific DNA sequences, their structures, and biological functions will be discussed.

I have also made a separate file, containing specific LEARNING OBJECTIVES for this lecture, as well as a "self-test quiz", which I recommend having a look at, BEFORE the lecture, if possible. I've incorporated the answers to questions 1 and 2 into PART 1 of the lecture notes.


dna53.gif




Part 1 Part 1: The Problem: Too Much Information and Too Many TLA's


Brevis esse laboro,     Obscuro fio.     - Horace


Some philosophical thoughts about Information and the Size of Genomes.

The information in GenBank is doubling every 10 months.
What are the implications of this?

growth in GenBank



A look at genome sequencing, from my lecture notes for the past four years:

star 1995: The Only Sequenced Genome:

(so far, as of 14 Sept., 1995)
from a "Journal club" presentation at the Institute of Molecular Medicine, John Radcliffe Hospital, University of Oxford, September, 1995.

1. Haemophilus influenzae
Haemophilus genome






star 1996: Genomes from various organisms:

(4 organisms have been sequenced as of 1 Nov., 1996)
from a "Workshop on DNA Structure and Function", given at the Norwegian veterinærhøgskole, November, 1996.

Organism % Coding Size (bp) # genes
Mycoplasma genitalium
88%
580,073   468  
Haemophilus influenzae
86%
2,087,778   1,662  
Methanococcus jannashchii
85%
1,660,000   1,997  
Synechocystissp.
80%
3,570,000   3,168  
Escherichia coli
90%
~3,000,000   ~3,400  
Saccharomyces cerevisiae
~50%
13,000,000   ~5,000  
Homo sapiens
~2%
~3,000,000,000   ~70,000  







star 1997: A List of Sequenced Genomes

(9 organisms have been sequenced so far, as of 30 September, 1997)
from a lecture to an Introductory Genetics course at Roanoke College, in Salem, Virginia, October, 1997.


Organism Type Size (Mbp) number of genes date sequenced
Haemophilus influenzae Bacteria (Gm-)
1.83
1703
 July, 1995
Mycoplasma genitalium Bacteria (Gm-)
0.58
470
October, 1995
Synechocystissp. Bacteria ("blue-green algae")
3.57
3168
May, 1996
Methanococcus jannashchii Archaebacteria
1.66
1738
August, 1996
Mycoplasma pneumoniae Bacteria (Gm-)
0.81
677
November, 1996
Saccharomyces cerevisiae Eukaryotic 
("baker's yeast")
13
5885
January, 1997
Helicobacter pylori Bacteria (Gm-)
1.66
1590
August, 1997
Escherichia coli Bacteria (Gm-)
4.60
4288
September, 1997
Bacillus subtilis Bacteria (Gm+)
4.20
-
 "submitted"
Archaeoglobus fulgidus Archaebacteria
2.20
-
 "submitted"
Borrelia burgdorferi Bacteria (Gm-)
1.30
-
 "submitted"







leaf bar

star  1998: A List of Sequenced Genomes

(17 so far, as of 1 September, 1998)
from last year's lecture, and also an "electronic poster" for the 2nd Annual Conference on Computation Genomics.

Organism # Type Size (Mbp) number of genes
date sequenced published
Haemophilus influenzae  1 Bacteria (Gm-)
1.83
1703
August,1995
Mycoplasma genitalium  2 Bacteria (Gm-)
0.58
470
October,1995
Saccharomyces cerevisiae  3 Eukaryotic 
("baker's yeast")
13.00
5885
January, 1996
Methanococcus jannashchii  4 Archaebacteria
1.66
1738
August, 1996
Synechocystissp.  5 Bacteria ("blue-green algae")
3.57
3168
September, 1996
Mycoplasma pneumoniae  6 Bacteria (Gm-)
0.81
677
November, 1996
Escherichia coli 
(Wisconsin, USA)
 7a Bacteria (Gm-)
4.60
4288
(www; Jan. '97)
October,1997
Escherichia coli
(Japan)
7b
Bacteria (Gm-)
4.64
~4000
January, 1997
(completed)
Methanobacterium
thermoautotrophicum
 8 Archaebacteria
1.75
1918
May,1997
Archaeoglobus fulgidus  9 Archaebacteria
2.18
2436
June,1997
Helicobacter pylori 10 Bacteria (Gm-)
1.66
1590
June, 1997
Borrelia burgdorferi 11 Bacteria (Gm-)
0.92
853
July,1997
Treponema pallidum 12 Bacteria (Gm-)
1.05
~1000
October, 1997
Bacillus subtilis 13 Bacteria (Gm+)
4.20
4100
 November,1997
Pyrococcus horikoshii 14 Archaebacteria
1.74
~1700
January,1998
Aquifex aeolicus 15 Eubacteria
1.55
1512
March,1998
Mycobacterium tuberculosis 16 Bacteria (Gm+)
4.41
3924
June,1998
Treponema pallidum 17 Bacteria (Gm-)
1.14
1041
July,1998










star 1999: A List of Sequenced Genomes
        (30 so far, as of 1 November, 1999)

  • link to alphabetical list






  • The "C-value" paradox"

    Although the number of genomes being sequenced is increasing rapidly, one has to this into perspective - the genomes of organisms fall very roughly into four different classes:

    Organism group Size (bp) No. sequenced
    viruses ~1000 bp - 70,000 bp 534
    bacteria  ~500,000 - 8,000,000 bp 30
    "simple" eukaryotes
    ~12,000,000 - 270,000,000 bp 2
    "complex" eukaryotes
    most animals
    and some plants
    ~700,000,000 - ~10,000,000,000 bp
    (ave. ~3,000,000,000)
    0
    "other" eukaryotes 
    plants and amoeba
    ~10,000,000,000 - 670,000,000,000 bp 0

    Discussion

    Why does amoeba have more than 200x as much DNA as humans?

    Think about it for a discussion in class. I have a possible explanation, although I'm not sure anyone really knows the answer to this, to be honest.



    This brings us to the first question on the quiz:

    Answers to the self-test quiz which you are supposed to do BEFORE the lecture:

    1. The short answer - a very long time. About 2.4x1012 years.
         That's about 160 times longer than the estimated age of the universe!



    2. The piece of paper would be quite thick - it would reach outside the earth's
         atmosphere and beyond the orbit of the planet Mars.



    Link to last year's introductory lecture



    DNA helix

    Part 2 Part 2: DNA Symmetry Elements and DNA Structures

    Background:

  • Introduction to DNA symmetry elements
  • DNA is like Coca-cola
  • Historical background - fiber diffraction vs. X-ray crystallography
  • Families of DNA helices
  • A Brief Introduction to Alternative Conformations of DNA



  • DNA symmetry elements defined

    From a DNA sequence perspective, there are 4 types of repeats:

    Direct Repeats

  • Simple Tandem Repeats

  • (Longer)Tandem Repeats

  • Direct (non-tandem)

  • Phased Repeats



  • Inverted Repeats



    Mirror Repeats




    Everted Repeats







    On the "Biological Meanings" of Symmetry Elements

    Anatomy of chromosomes - there are four important parts in metaphase chromosomes (telomeres, centromeres, and heterochromatin & euchromatin):

    Hartl & Jones Ch.06/Fig6_20b.JPG
    Figure 6.20b from Hartl & Jones, "GENETICS - Principles and Analysis", fourth edition (1998).

    There are two types of chromatin:

  • Heterochromatin - where the DNA is more condensed, and usually there is not much transcriptional activity.  Some heterochromatin will remain condensed throughout the cell cycle.


  • Euchromatin - this is where the "active" genes are - usually this region is much less condensed.

  • Centromeric DNA

    Hartl & Jones, Figure 6-25
    Figure 6.25 from Hartl & Jones, "GENETICS - Principles and Analysis", fourth edition (1998).

    There are certain DNA sequences that are associated with the centromeres of chromosomes. Knowledge of this was essential in the construction of Yeast Artificial Chromosomes (or YACs, as they're usually called).


    Figure 6.15 from Hartl & Jones, "GENETICS - Principles and Analysis", fourth edition (1998).

    Here's an oversimplified view of the attachement of the kinetochore, which consists of several hundred microtubles bound together.



    Telomeric DNA

    telomeric DNA can fold back on itself. This is necessary to allow for DNA synthesis (this isn't a problem for circular chromosomes!).

    The ends of the chromosomes gets shorter every time the cells divide, because part of the bases are used to template off of themselves. Thus, after every round of replication, the chromosome gets a bit shorter. This is kind of like "planned obscelence", where the cells basically have so many divisions and then they fall apart. However (fortunately!) the cells have a mechanism for extending the length of the telomeres - the name of the enzyme is TELOMERASE. It has been found that in many cases cancer cells have a mutation such that the telomerase gene is overexpressed, thus allowing the cells to "live forever". Early results from clinical trials show that by specifically inhibiting the activity of the telomerase protein, they can slow or completely stop the growth of many types of cancer. More recently, the idea of using telomerase gene therapy to prevent people from getting old has received much attention in the media.

    Figure 6.26 from Hartl & Jones (page 250).
    Hartl & Jones, Figure 6-27





    Repetitive DNA
    Highly repetitive DNA
    Dispersed - e.g., Alu family
     about 300 bp long
     500,000 copies in humans
     (about 5% of the human genome)
     dispersed throughout the chromosomes


    Localised highly repetitive sequences
    about 2-10 bp long
     present in millions of copies, often in large blocks
     (about 6% of the human genome)
     associated with heterochromatin
     usually very high A+T content

    Middle repetitive DNA
     makes up more than 40% of the human genome
     position varies due to transposable elements

     Includes the following types of sequences:
  • microsatellite DNA
  • Dinucleotide repeats

  • Trinucleotide repeats
          - associated with many diseases (e.g., Fragile X, muscular distrophy)





  • Table of DNA sequences, structures, and biological functions

    Sequence motif Possible structure Biological function
      (C3TA2)n 4-stranded DNA Telomeres
      (ACA5GAGTGT3CA2...)n Curved DNA associated with Centromeres
    (171 bp alphoid repeat)
      (A3-5N5-7)n Curved DNA promoter regions
      (R)n, where n > 250 bp A-DNA
    stable intramolecular triplex DNA
    transposons
    homologous recombination
      (ttcca)n, where n ~ 1,000,000 bp A-DNA
    stable intramolecular triplex DNA
    human y chromosome
      (RY)n Z-DNA (>50% GC rich)
    Cruciforms
    Slipped-mispair
    induce mRNA editing
    deletions (in bacteria)
    mutagenesis
     +   (N)n recA triple-stranded DNA homologous recombination
     +   (R)n Intermolecular triplex
    Intramolecular Triple-strands
    recombination
    replication
     + (N)n
    cruciforms deletions (in bacteria)
    insertion sequences
     + (N)n
    parallel stranded DNA unknown
    stabilisation of telomeres(?)


    References:

    Sinden,R.R., Pearson,C.E., Potaman,V.N., Ussery,D.W., "DNA: STRUCTURE and FUNCTION", Advances in Genome Biology, 5A:1-141, (1998).

    Calladine,C.R., Drew,H.R., "Understanding DNA: The Molecule and How It Works", (2nd edition, Academic Press, San Diego, 1997).

    References on DNA




    LINKS:

    "Official" CBS Bioinformatics links
    (an on-line and updated version of Chapter 12 from the Baldi & Brunak book)

    HMS Beagle report on NCBI Bioinformatics sites - this is a good place for a molecular biologist to start!

    The Human Genome Project Information page - this is put out by Los Alamos National Labs, and is updated regularly.



    Back to the CBS homepage
    Back to Dave's Courses page

    Last modified Tuesday, 28 September, 1999 by David Ussery