
There are principally three types of biological sequences, with the information flowing as outlined in the Central Dogma of molecular biology:
Once a new sequence has been determined, there are various ways of trying to find the function:
- Through a comparison of how well it matches another sequence of known function (alignment methods).
- By looking for characteristic patterns within the sequence (de novo methods).
- Prediction of the sequence structure (ab initio methods).
In this talk (as well as the one after the break), I will focus on only the latter two approaches - that is, looking for patterns in DNA sequences and predicting local DNA structures based on the DNA sequence. Both talks will be about information in the DNA sequences within the context of sequenced bacterial chromosomes. Note that this is only one tiny fraction of the much larger subject area of bioinformatics.
A Brief History of Genomics
What is "genomics"?
genomed3i.noum. Biol. Formerly also genom -nom. [a.G. genom (H. Winkler Verbreitung u. Ursache d. Parthenogenesis (1920) iv. 165), irreg. f. gen gene1 + chromosom chromosome.]A haploid set of chromosomes; the sum-total of the genes in such a set. 1930 Cytologia I. 14 Chromosomes from different sets (or genoms) of Triticum vulgare show affinity toward each other. 1930 [see allopolyploidy]. 1932 Proc. 6th Int. Congr. Genetics I. 275 The inviability of deficient genomes in the haploid generation serves to some extent as an alternative distinction between mutation and deficiency. 1932 Proc. 6th Int. Congr. Genetics II. 5 There are two species having genoms resembling C. neglecta. 1952 C. P. Blacker Eugenics x. 243 The appearance of such terms as gene-complex and genome (denoting a set of chromosomes as a working unity) testify to the movement towards holism in genetics. 1965 A. M. Srb et al. Gen. Genetics (ed. 2) vii. 190 Among organisms with chromosomes, each species has a characteristic set of genes, or genome. In diploids a genome is found in each normal gamete. It consists of a full set of the different kinds of chromosomes. 1970 Sci. Amer. Oct. 19/1 The human genome..consists of perhaps as many as 10 million genes. |
A Few Words on the speed of DNA sequencing
I know this is a bit of a digression from MICROBIAL genomes, but I want to try and add a bit of historical perspective. In 1977, Fred Sanger sequenced the first bacteriophage (phiX174, 5386 bp long), for which he later won the Nobel prize. Although this was a dramatic improvement over the conventional methods, this was still very slow, compared to the amount of information in a single human cell.
About a decade later, the human genome project was launched; this was an international effort, and the U.S. would pay about $200,000,000 per year for 20 years! Most of this investment was in technology to speed sequencing, which in fact has been realised. Within a few years, it is likely that it will be possible to read the entire DNA sequence of a human cell, in a few hours.

Two versions: Celera and "Public". Agilent Technologies announces that they are developing "nanopore technology", which could allow the entire human genome to be sequenced in a few hours! |
A look at genome sequencing since 1994 (including bacteria, archaea, and eukaryotes):
Currently (4 September, 2002) NCBI lists 166 BACTERIAL genomes in its database.
YEAR # GENOMES Sequenced Running Total 1994 0 0 1995 2 2 1996 2 4 1997 5 9 1998 8 17 1999 13 30 2000 23 53 2001 42 95 2002 >100 >200
Note that I've only listed the PUBLICLY AVALIABLE genomes. There are probably more than a THOUSAND bacterial genomes which have been sequenced by various companies which will never make it into the public domain.
What is missing?
There are (at least) TWO things missing: genomes from ecologically abundant and diverse niches, and larger genomes.
On the size variation within bacterial genomes
The size of most bacterial genomes varies considerably. For example, consider the four Escherichia coli genomes which have been sequenced (so far!), as shown in the table below.
Organism %AT Size (bp) Atlas
Number
of genesCoding
densityReference Escherichia coli
Strain: K-12, isolate W3110
DDBJ NCBI tax49 4,636,552 Genome Atlas 4085 79% 1135 bp/gene - Escherichia coli
Strain: K-12, isolate MG1655
U. Wisconsin TIGR cmr NCBI tax NCBI entrez49 4,639,221 Genome Atlas 4397 87% 1055 bp/gene Science 277:1453-1474
September, 1997
[PubMed]Escherichia coli
Strain: O157:H7 (substrain EDL93)
U. Wisconsin NCBI tax NCBI entrez49 5,529,376 Genome Atlas 5283 86% 1047 bp/gene Nature 409:529-533
January, 2001
[PubMed]Escherichia coli
Strain: O157:H7 (substrain RIMD 0509952)
DDBJ NCBI tax NCBI entrez49 5,498,450 Genome Atlas 5361 88% 1026 bp/gene DNA Res. 8:11-22
February, 2001
[PubMed]
Link to a table with more E. coli Atlases.
Too Much Information!
Currently several hundred prokaryotic genomes have been sequenced, and nearly 100 genomes are publicly available for analysis. The flow of information is essentially the same as above, that is:
Genome -> Transcriptome -> Proteome
Link to a list of sequenced bacterial genomes
The DNA sequence contains several different types of information:
- The DNA sequence can code for an amino acid sequence for proteins
- Directly - e.g., it is "easy" to predict protein sequence from DNA sequence.
- Indirectly - Scrambled genes in protozoa (changes at the DNA level)
- Indirectly - RNA editing
- Indirectly - RNA splicing
- Indirectly - Protein splicing (e.g., Inteins)
- The DNA sequence can code for an RNA sequence
- tRNA
- rRNA
- snRNA
- telomeraseRNA
- other RNAs
- The DNA sequence can code for protein binding sites
- The DNA can code for architectural information
- intrinsic DNA curvature
- nucleosome positioning
- The DNA can code for structural / stability information
- transcription initiation
- origins of replication
- mutational "hot spots"
![]()
REFERENCES
Photocopies of the following article will be provided for this lecture:
- David W. Ussery,
"Genome Databases",
The Encyclopedia of Genetics, (Academic Press, New York, 2001), pages 517-521. PDF file![]()
Table with links to Genome Databases
Back to the CBS homepage
Back to Dave's Courses page![]()
Last modified Wednesday, 4 September, 2002 by David Ussery