
- What is bioinformatics?
- Introduction
- The Problem: too much information!
- DNA Symmetry Elements
- DNA repeats
- DNA helix families
- Comparative Genomics
- What has been sequenced?
- What is missing?
- Comparison of bacterial chromosomes.
Bioinformatics is the application of machine learning methods to biological information. Part (but not all) of this information is in the form of sequences (e.g., DNA, RNA, proteins). In addition to sequences, there are other sources of data for bioinformatics analysis, such as micro-array data, 2-D gel information, image analysis, etc.
There are principally three types of biological sequences, with the information flowing as outlined in the Central Dogma of molecular biology:
Once a new sequence has been determined, there are various ways of trying to find the function:
- Through a comparison of how well it matches another sequence of known function (alignment methods).
- By looking for characteristic patterns within the sequence (de novo methods).
- Prediction of the sequence structure (ab initio methods).
In this talk (as well as the one this afternoon), I will focus on only the latter two approaches - that is, looking for patterns in DNA sequences and predicting local DNA structures based on the DNA sequence. Both talks will be about information in the DNA sequences within the context of sequenced bacterial chromosomes. Note that this is only one tiny fraction of the much larger subject area of bioinformatics.
Information depends on CONTEXT
For DNA sequences, there are several different types of information:
- Coding information for amino acid sequences in proteins.
- Coding information for RNA sequences.
- tRNA
- rRNA
- snRNA
- telomerase RNA
- other RNAs
- Protein binding site information.
- transcription factors
- chromatin proteins
- restriction enzymes
- DNA modification site information.
- methylation sites
- glycoslyation sites
- other modification sites
- Chromosome organisational information.
- regions of highly expressed genes
- origin and terminus of replication
- mutational "hot spots"
- Physical/mechanical local structural information.
- meltability
- helix rigidty
- intrinsic DNA curvature
- nucleosomal positioning
- Repeat/symmetry elements.
- repeats (direct, inverted, mirror, everted)
- A-DNA (certain purine stretches) or Z-DNA (certain pyr/pur stretches)
- structural periodicity patterns
Too Much Information!
Currently several hundred prokaryotic genomes have been sequenced, and nearly 100 genomes are publicly available for analysis. The flow of information is essentially the same as above, that is:
Genome -> Transcriptome -> Proteome
Link to a list of sequenced bacterial genomes
Some philosophical thoughts about Information and the Size of Genomes.
The information in GenBank is doubling every 10 months.
What are the implications of this?
![]()
A look at genome sequencing since 1994:
YEAR # GENOMES Sequenced Running Total 1994 0 0 1995 2 2 1996 2 4 1997 5 9 1998 8 17 1999 13 30 2000 23 53 2001 >50 >100
Genome Databases Links

ONE possible way of trying to deal with all this information is to develop methods of visualising DNA structures within bacterial chromosomes. The method I have chosen to talk about today is based on two different groups of "DNA symmetry elements". The first is simply various types of repeats, and the second group is DNA helix families, which is caused by certain stretches of purines (or pyrimidines) for A-DNA, and certain stretches of alternating pyrimidine/purines for Z-DNA. The various conformations of these different sequences have putative biological functions, based in part on these structures. The repeats will be discussed first.
A. DNA Repeats
From a DNA sequence perspective, there are 4 types of repeats:
Direct Repeats
Simple Tandem Repeats
(Longer)Tandem Repeats
Direct (non-tandem)
Phased Repeats
Inverted Repeats
Mirror Repeats
Everted Repeats
| Repeat | Pattern | Possible structure | Biological function |
| Direct repeats | recA triple-stranded DNA | homologous recombination duplications |
|
| Mirror repeats | Intermolecular triplex Intramolecular Triple-strands |
recombination replication |
|
| Inverted repeats | cruciforms | deletions (in bacteria) insertion sequences |
|
| Everted repeats | parallel stranded DNA | unknown stabilisation of telomeres(?) |
B. DNA Helix Families
A-DNA (left), B-DNA (middle) and Z-DNA (right) -- 12 bp each
From Dickerson et al. in Cold Spring Harbor Symposium for Quantitative Biology (1982) v47 p13-24.
3 families of DNA helices:
![]()
A-DNA family - this is most common for double stranded RNA, RNA/DNA hybrids, as well as for certain DNA sequences, such as long stretches of purines. NMR studies have shown that as few as 5 bp of purines in a row can set up an A-type of helix. Most of the DNA inside of cells is likely to be a mixture of the A- and B-DNA conformations.
![]()
B-DNA family - the majority of DNA exists in the "B-DNA form" inside the cells of living organisms. This is the classical "Watson-Crick" structure, although there is considerable sequence-specific variation. Thus, for example, different sequences can have from 9 bp/turn of the helix to 12 bp/turn, depending on the sequence of the DNA! However, on AVERAGE, the DNA is about 10.5 bp/turn.
![]()
Z-DNA family - this is much more rare than the other two families, although certains sequences (such as runs of GC repeats (GCGCGC)) can form Z-DNA easily. In eukaryotes, CpG islands can form Z-DNA, and methylated CpG islands can form Z-DNA readily in vivo. Furthermore, specific proteins have been isolated which will bind preferentially to the left-handed Z-DNA conformation.
Part 3: Comparative Genomics
What has been sequenced?
As of 12 November, 2001. Link to an updated table
Kingdom Number
Species
sequencedNumber
chromosomes
sequenced *Total bp sequenced Archaea 13 20 26,409,849 Bacteria 61 165 208,304,570 bp Proctista 4 18 18,053,080 bp Fungi 2 23 36,117,519 bp Plants 1 7 47,623,657 bp Animals 3 36 2,979,841,298 bp Viruses 53 491 12,279,171 totals 137 760 3,328,629,144 * Includes plasmids from sequenced genomes.
What is missing?
There are (at least) TWO things missing: genomes from ecologically abundant and diverse niches, and larger genomes.
Although the number of genomes being sequenced is increasing rapidly, one has to this into perspective - the organisms can be placed into four different classes:
*note that NONE of the multi-cellular eukaryotic chromosomes have yet been completely sequenced (e.g., 1 contiguous piece of DNA, with no gaps).
Organism group Size (bp) No. sequenced viruses ~300 bp to ~350,000 bp 545 prokaryotes ~250,000 to ~15,000,000 bp 80
(public)single-celled eukaryotes ~12,000,000 to ~600,000,000,000 bp 8 multi-celled eukaryotes ~20,000,000 to ~500,000,000,000 bp 3*
This level of variation often does not correlate at all with "biological complexity". For example, a simple amoeba has 600,000,000,000 bp of DNA in its genome, or 200x as much as in humans! As another example, the genome in insects ranges from 20,000,000 bp, or just a bit larger than a bacteria, to more than 10 BILLION bp, or more than three times larger than the human genome. So far (understandably!) the trend has been to sequence the SMALL genomes, and pretend they are representative of the larger ones. This is certainly reasonable, but one should keep the large genomes in the back of their minds when trying to extrapolate from the genome sequence of the smallest genome to the "real world" of biological complexity.
Comparison of bacterial chromosomes.
So we have lots of sequenced genomes. How can we compare them? A simple first approach is to look at average properties for the whole chromosomes. For example, the figure below shows the average AT content for 20 different Archaeal chromosomes. Note that MOST of the Archaeal genomes are AT RICH, which is contrary to the old dogma that they must be GC rich, because many of them are thermophiles. We now know that the genomes for many of these organisms can survive high temperatures because the DNA is positively supercoiled, rather than negatively supercoiled. This means that it takes much more energy to melt the helix.
![]()
As a further example, characteristics for 20 different Proteobacter chromosomes are compared on the following web page:http://www.cbs.dtu.dk/staff/dave/MScourse/ProteobacterOkt2001.html
How Random is DNA?
Although estimating the levels of A-DNA and Z-DNA might be difficult, one thing that is clear is that there is a strong bias in genomes towards an over-representation of purine stretches, as well as pyr/pur stretches. In addition, the patterns found in eukaryotic DNA is quite different from bacterial DNA. So, for example, in the two plots below, the occurance of stretches of purines or pyr/pur tracts is essentially the same as predicted by a "random" model of DNA for E. coli, but is very different that expected, even when taking into account the pentameric composition ("6th order Markov Model). This implies that the DNA in eukaryotes is much less "random" than the DNA in bacteria - at least with respect to runs of purines or alternating pyr/pur tracts.
![]()
![]()
Fraction of purine and pyr/pur stretches of at least 10 bp in Sequenced Chromosomes From All 5 Kingdoms
![]()
Organism Kingdom Size Purine
stretchesPyr/Pur
stretcheslength dist.
plotE. coli K-12 Prokaryotae
(Bacteria)4,639,221 bp
(complete)1.1% 1.4% plot P. furiosis Prokaryotae
(Archaea)1,908,523 bp
(complete)6.0% 0.2% plot L. major
chromosome 1Protista
(protozoa)268,984 bp
(~40 Mbp total)6.0% 4.6% plot S. cerevisiae
All 16 chromosomesFungi
(yeast)12,057,849 bp
(complete)3.7% 0.8% plot A. thaliana
chromosome 1Plantae
(thale cress)28,920,698 bp
(~100 Mbp total)4.6% 0.9% plot H. sapiens
chromosome 1Animalae
(humans)282,193,664 bp
(~3000 Mbp total)5.3% 0.8% plot Expected values - n bp 0.2% 0.2% -
Link to a table comparing more than 700 chromosomes
![]()
Comparison of Fraction of purine and pyr/pur tracts in Prokaryotic Chromosomes
Archaea
![]()
Proteobacteria
![]()
Firmicutes
![]()
"Other" Bacterial Genomes
![]()
![]()
REFERENCES
Photocopies of the following articles are provided in the course programme:
- David W. Ussery, Thomas S. Larsen, K. Trevor Wilkes, Carsten Friis, Peder Worning, Anders Krogh, and Søren Brunak, "Genome Organisation and Chromatin Structure in Escherichia coli",
Biochimie, 83:201-212, (2001).
[PubMed] PDF file![]()
[cover]
Link to web page with supplemental information about this article.
- Anders Gorm Pedersen, Lars Juhl Jensen, Hans-Henrik Stærfeldt, Søren Brunak, and David W. Ussery, "A DNA Structural Atlas for Escherichia coli", Journal of Molecular Biology, 299 (#4), 907-930, (2000). [cover]
Link to JMB online version of this article. PDF file[PubMed]
- Lars Juhl Jensen, Carsten Friis, and David W. Ussery, "Three Views of Microbial Genomes", Research in Microbiology, 150, pages 773-777, 1999.
[cover] [PubMed] PDF file![]()
- Carsten Friis, Lars Juhl Jensen, and David W. Ussery,
"Visualisation of Pathogenicity Regions in Bacteria",
Genetica, 108:47-51, (2000).
[PubMed] PDF file![]()
[cover]
Link to Yersinia pestis pPCP1 atlases.
Link to S. typhimurium DT104 atlases.
Link to E. coli pO157 atlases.
Articles referred to in the lecture, but not handed out in class:
An Overview of Where to Find More Information on Sequenced Genomes:
- David W. Ussery,
"Genome Databases",
The Encyclopedia of Genetics, in press, September, 2001. PDF file![]()
Articles about A-DNA and Z-DNA in chromosomes:
- David W. Ussery,
"DNA Structure: A-, B-, and Z-DNA Families",
manuscript submitted to The Encyclopedia of Life Sciences, (to be published in autumn 2001). PDF file![]()
- David Ussery, Dikeos Mario Soumpasis, Søren Brunak, Hans-Henrik Stærfeldt, Peder Worning, and Anders Krogh
"Bias of Purine Stretches in Sequenced Genomes",
Computers in Chemistry, in press, to be published in January, 2002.
PDF file
Link to web page comparing fractions purine and pyr/pur tracts in more than 700 chromosomes
Cruciforms and Palindromes in Bacterial Chromosomes:
- Richard R. Sinden, David W. Ussery, Peder Worning, and William Rosche
"Genome Gymnastics and Spontaneous Mutagenesis: Intermolecular Leading Strand Misalignments Lead to Quasipalindrome Correction"
submitted as a "MicroReview" to Molecular Microbiology, PDF file
Most Bacterial Genomes are Over-annotated (by as much as 50%!):
- Marie Skovgaard, Lars Juhl Jensen, Søren Brunak, David W. Ussery, and Anders Krogh
"On the Total Number of Genes and Their Length Distribution in Complete Microbial Genomes",
Trends in Genetics, 17:425-428, 2001.
PDF file
Link to web page with supplemental information about this article.
Some of the comparison of proteobacter genomes was included in the following manuscript:
- Lise Petersen, Stephen L.W. On, and David Ussery
"Visualisation and Significance of DNA Structural Motifs in the Campylobacter jejuni genome",
manuscript submitted to Genome Letters, to be published in spring, 2002).
Other related articles:
- Richard R. Sinden, Christopher E. Pearson, Vladimir N. Potoman, and David W. Ussery, "DNA: Structure and Function", Advances in Genome Biology, 5A:1-141, (1998).
- Ussery,D.W., Higgins,C.F., and Bloshoy,A., "Environmental Influences on DNA Curvature", J. Biomolecular Structure & Dynamics,16:811-823, (1999).[PubMed]
- David W. Ussery,
"DNA Denaturation",
The Encyclopedia of Genetics, in press, September, 2001. PDF file![]()
- David W. Ussery,
"Bioinformatics2000 Meeting Report",
GenomeBiology, 1:(#3), pages 1-2, (2000).
PDF fileOn-Line Version at http://www.genomebiology.com/2000/1/3/reports/4014/
Link to a list of recent papers and talks on DNA structures.
Books about DNA:
Watson, James D. "A PASSION FOR DNA: Genes, Genomes, and Society", (Oxford University Press, Oxford, 2000). Amazon Barnes&Noble
Sinden, Richard R., "DNA: STRUCTURE and FUNCTION", (Academic Press, New York, 1994). Amazon Barnes&Noble
Calladine,C.R., Drew,H.R., "Understanding DNA: The Molecule and How It Works", (2nd edition, Academic Press, San Diego, 1997). Amazon Barnes&Noble
A List of more than a thousand books about DNA
Back to the CBS homepage
Back to Dave's Courses page![]()
Last modified Monday, 27 November, 2001 by David Ussery