30431 Introduktion til Bioinformatik
David Ussery
Fredag, 6 April, 2001

Animated DNA



DNA Symmetry Elements in Whole Genomes







Overview

This lecture is about ways of looking at DNA sequences in complete genomes and chromosomes, in terms of symmetry elements. There are two parts to this talk. In Part 1, I will discuss the fact that we simply have "Too Much Information" becoming available, and the problem will only get worse in the near future. There are ways of cataloging and organising the data, of course. I have found that the true diversity of genome sizes in Nature is often neglected, so we'll talk for a few minutes about the "C-value paradox", along with some possible ideas for WHY certain organisms have so much DNA.

I would like to think that one way of dealing with the explosion of sequence information, in terms of DNA sequences, is to think about it in biological terms, in particular in physical-chemical terms of structure and function of symmetry elements. For example, there are specific DNA sequences which "code" for a telomere, and different DNA sequences which are specific for centromeres. Specific DNA sequences, their structures, and biological functions will be discussed.

In Part 2, I will introduce "DNA Atlases", first having a look at base composition throughout sequenced chromosomes, and then looking at gene expression throughout the whole genome.

I have also made separate file, containing specific LEARNING OBJECTIVES for this lecture, as well as a "self-test quiz", which I recommend having a look at, BEFORE the lecture, if possible. I've incorporated the answers to questions 1 and 2 into PART 1 of the lecture notes.


dna53.gif




Part 1Part 1: The Problem: Too Much Information


Brevis esse laboro,     Obscuro fio.     - Horace


Some philosophical thoughts about Information and the Size of Genomes.



The information in GenBank is doubling every 10 months.
What are the implications of this?

growth in GenBank



A look at genome sequencing since 1994:

YEAR# GENOMES Sequenced
1994
0
1995
2
1996
4
1997
9
1998
17
1999
30
2000
53
2001
>100


Genome Databases Links





leaf bar







The "C-value" paradox"

Although the number of genomes being sequenced is increasing rapidly, one has to this into perspective - the organisms can be placed into four different classes:

Organism group Size (bp) No. sequenced
viruses ~300 bp  to               ~350,000 bp 545
prokaryotes  ~250,000 to          ~15,000,000 bp >100
single-celled eukaryotes
~12,000,000 to ~600,000,000,000 bp 4
multi-celled eukaryotes ~20,000,000 to ~500,000,000,000 bp 3



This level of variation often does not correlate at all with "biological complexity". For example, a simple amoeba has 600,000,000,000 bp of DNA in its genome, or 200x as much as in humans! As another example, the genome in insects ranges from 20,000,000 bp, or just a bit larger than a bacteria, to more than 10 BILLION bp, or much larger than the human genome. Here is a table of different Drosophila species:

Drosophila species Genome Size
(in base pairs)
D. americana ~300,000,000 bp
D. arizonensis ~225,000,000 bp
D. eohydei (male) ~234,000,000 bp
D. eohydei (female) ~246,000,000 bp
D. funebris ~255,000,000 bp
D. hydei ~202,000,000 bp
D. melanogaster ~180,000,000 bp
(~138,000,000 bp sequenced)
D. miranda ~300,000,000 bp
D. nasutoides ~800,000,000 bp
D. neohydei ~192,000,000 bp
D. simulans ~127,500,000 bp
D. virilis ~345,000,000 bp

In summary, the genome sizes of the Drosophila species that have been examined so far range from about 127 million bp to about 800 million bp. But of course at present we SUSPECT that they contain roughly the same number of genes, although it is possible (likely) that they contain duplicated regions (or perhaps even entire chromosomes; there is ample space to have an entire extra copy (or two or more) of the entire genome). In addition, they also contain various types of repeats, known as "selfish DNA".



Discussion

Why does amoeba have more than 200x as much DNA as humans?

Think about it for a discussion in class. I have a possible explanation, although I'm not sure anyone really knows the answer to this, to be honest.



This brings us to the first question on the quiz:

Answers to the self-test quiz which you are supposed to do BEFORE the lecture:

1. The short answer - a very long time. About 2.4x1012 years.
     That's about 160 times longer than the estimated age of the universe!



2. The piece of paper would be quite thick - it would reach outside the earth's
     atmosphere and beyond the orbit of the planet Mars.



dna43.gif





Part 2Part 2: DNA Symmetry Elements and DNA Structures

I will divide "DNA symmetry elements" into 5 different categories.

    Today's lecture will cover:

  1. Base-composition throughout the chromosome

  2. Organisation of the genome and Gene Expression

  3. Next Tuesday's lecture will cover:

  4. DNA helical properties

  5. DNA mechanical/structural properties

  6. DNA Repeats throughout chromosomes




Introduction to DNA Atlases.

One way of dealing with the problem of how to display so much sequence information is to have a look at the whole chromosome at once, smoothing over a large window. The entire bacterial chromosome is displayed as a circle, with different colours representing various parameters. First, as an introduction to atlases, we will look at base-composition. Then we will have a look at levels of expression of mRNA and proteins throghout the chromosome. As examples, I will use my very favourite organism, Escherichia coli K-12.



Base-composition


Base-composition Atlas of E. coli K-12



There are several things to notice in this plot. First, the concentration of the bases are not uniformly distributed throughout the genome, but there are "clumps" or clusters where specific bases are a bit more concentrated. Also, the G's (turquoise) clearly are seen to be favoured on one half of the chromosome, whilst the C's (magenta) are on the other strand. This shows up in the "GC-skew" lane as well (2nd circle from the middle). I have labelled the entire terminus region, which ranges from TerE (around 1.08 million bp (Mbp) to TerG (~2.38 Mbp) in Escherichia coli K-12. Finally, several genes corresponding to the darker bands (e.g., more biased nucleotide composition) are labelled.

The same pattern can be seen for the other three Escherichia coli chromosomes which have been sequenced (so far!), as shown in the table below.

Organism %AT Size (bp)
Atlas
Number
of genes
Coding
density
Reference
Escherichia coli
Strain: K-12, isolate W3110
DDBJ     NCBI tax
49  4,636,552  Base Atlas 4085  79% 
1135 bp/gene
-
Escherichia coli
Strain: K-12, isolate MG1655
U. Wisconsin     TIGR cmr     NCBI tax     NCBI entrez
49  4,639,221  Base Atlas 4397  87% 
1055 bp/gene
Science 277:1453-1474
September, 1997
[PubMed]
Escherichia coli
Strain: O157:H7 (substrain EDL93)
U. Wisconsin     NCBI tax     NCBI entrez
49  5,529,376  Base Atlas 5283  86% 
1047 bp/gene
Nature 409:529-533
January, 2001
[PubMed]
Escherichia coli
Strain: O157:H7 (substrain RIMD 0509952)
Miyazaki, Japan     NCBI tax     NCBI entrez
49  5,498,450  Base Atlas 5361  88% 
1026 bp/gene
DNA Res. 8:11-22
February, 2001
[PubMed]

In addition to showing overall global properties of the chromosome (such as replication origin and terminus), the base composition can also highlight regions different from the rest of the genome. For example, in the plasmid pO157, there are some regions which are much more AT rich (probably these came about as a result of horizontal gene transfer - we will discuss this again in the next lecture...)


Note that the "toxB" gene is much more AT rich than the average for the rest of the plasmid. This COULD be due to the fact that this gene came from an organism with a more AT rich genome, or (more likely in my opinion) it is more AT rich because it is important for this gene to vary in sequence (e.g., have a higher mutational frequencey).




E. coli gene expression Gene Expression in E. coli

Escherichia coli is probably the best characterised organism.

some numbers:

  • There are 4085 predicted genes in Escherichia coli strain K-12 isolate W3110.


  • There are 4289 predicted genes in Escherichia coli strain K-12 isolate MG1665.


  • There are 5283 predicted genes in Escherichia coli strain O157:H7 isolate EDL933 (enterohemorrhagic pathogen).
  • There are about 5361 predicted genes in Escherichia coli strain O157:H7 substrain RIMD 0509952 (enterohemorrhagic pathogen).



  • Roughly 2600 genes have been found to be expressed in Escherichia coli strain K-12 cells, under standard laboratory growth conditions.

  • Transcription animation


  • About 2100 spots can be seen on 2-D protein gels.



  • Very roughly 1000 different genes (only about 600 mRNA transcripts) are expressed at "detectable levels" in E. coli cells grown in LB media.



  • Only about 350 proteins exist at concentrations of > 100 copies per cell. (These make up 90% of the total protein in E.coli!)

  • Most (>90%) of the proteins are present in very low amounts (less than 100 copies per cell).



    What is the chromosomal location of the genes for the highly exressed proteins?


    It has been known since the 1960's that genes closer to the replication origin are more highly expressed. However, it has only been in the past few years that technology has allowed the simultaneous monitoring of ALL the genes in Escherichia coli. There are 4397 annotated genes in the E. coli K-12 genome. Shown below is an "Atlas plot" of the E. coli K-12 genome, with the outer circle representing the concentration of proteins (roughly in number of molecules/cell) and mRNA (again, roughly number of molecules/cell). Under these conditions (e.g., cells grown to late log phase, in minimal media), there were 2005 genes expressed at detectable levels, and only 233 proteins have been found to exist in "abundant" conditions (e.g., very roughly more than 100 molecules per cell).


    For E. coli K-12 cells, grown in minimal media to late log phase:

    4397 annotated genes -> 2005 mRNAs expressed -> 233 abundant proteins



    (note that these numbers will vary for different experimental conditions....)

    E. coli chromatin atlas


    In this picture, the outer lane represents the concentration of proteins (blue), the next lane the concentration of mRNA (green), and then the annotated genes.


    The inner three circles represent different aspects of the DNA base composition throughout the genome. The innermost circle (turquoise/violet) is the bias of G's towards one strand or the other (that is, a look at the mono-nucleotide distribution of the 4 DNA bases). The next lane is the density of stretches of purine (or pyrimidine) stretches of 10 bp or longer. Note that in both cases purines tend to favour the leading strand of the replicore, whilst pyrimidine tracts are more likely to occur on the lagging strand. Finally, the next circle (turquoise/red) is simply the AT content of the genome, averaged over a 50,000 bp window. Note that the terminus is slightly more AT rich, whilst the rest of the genome is slightly GC rich. (The AT content scale ranges from 45% to 55%).




    There are "clumps" of highly expressed genes, and these are anti-correlated with regions of condensed chromatin.

    E. coli chromatin atlas



    Link to more atlases for Escherichia coli genomes.


    Link to the main "Genome Atlas" web page




    REFERENCES

    Papers relevant to this lecture (handed out in class)

      Friday (6 April, 2001)

    1. David W. Ussery, "Genome Databases", The Encyclopedia of Genetics, in press, April, 2001.
    2. Ussery,D.W., Larsen,T.S., Wilkes,K.T., Friis,C., Worning,P., Krogh,A., Brunak,S. "Genome Organisation and Chromatin Structure in Escherichia coli", Biochimie,83:201-212, (2001).
    3. Carsten Friis, Lars Juhl Jensen, and David W. Ussery, "Visualisation of Pathogenicity Regions in Bacteria", Genetica, 108:47-51, 2000.
    4. David W. Ussery, "Bioinformatics2000 Meeting Report", Genome Biology, 1, (#3), 1-2, 2000.



    Other references

  • Richard R. Sinden, Christopher E. Pearson, Vladimir N. Potoman, and David W. Ussery, "DNA: Structure and Function", Advances in Genome Biology, 5A:1-141, (1998).
  • Ussery,D.W., Higgins,C.F., and Bloshoy,A., "Environmental Influences on DNA Curvature", J. Biomolecular Structure & Dynamics,16:811-823, (1999).[PubMed]




  • To be handed out next lecture (Tuesday, 17 April, 2001)

  • David W. Ussery, "DNA Structure: A-, B-, and Z-DNA Families", manuscript submitted to The Encyclopedia of Life Sciences, April, 2000.
  • Anders Gorm Pedersen, Lars Juhl Jensen, Hans-Henrik Stærfeldt, Søren Brunak, and David W. Ussery, "A DNA Structural Atlas for Escherichia coli", Journal of Molecular Biology, 299 (#4), 907-930, (2000).     [cover]

  • Link to JMB online version of this article.        PDF file     [PubMed]

  • Lars Juhl Jensen, Carsten Friis, and David W. Ussery, "Three Views of Microbial Genomes", Research in Microbiology, 150, pages 773-777, 1999.
  •    [cover]     [PubMed]        PDF file

  • David W. Ussery, "DNA Denaturation", manuscript submitted to The Encyclopedia of Genetics, September, 2000.


  • Link to a list of recent papers and talks on DNA structures.



    Books about DNA:

    Watson, James D. "A PASSION FOR DNA: Genes, Genomes, and Society", (Oxford University Press, Oxford, 2000).      Amazon      Barnes&Noble

    Sinden, Richard R., "DNA: STRUCTURE and FUNCTION", (Academic Press, New York, 1994).      Amazon      Barnes&Noble

    Calladine,C.R., Drew,H.R., "Understanding DNA: The Molecule and How It Works", (2nd edition, Academic Press, San Diego, 1997).      Amazon      Barnes&Noble



    A List of more than a thousand books about DNA






    Go to the CBS Home Page Back to the CBS homepage

    Back to Dave's Courses page

    Last modified Thursday, 9 November, 2000 by David Ussery