David Ussery
Wednesday, 4 September, 2002
Comparative Microbial Genomics


Genome Databases Links


Comparative Microbial Genomics:
A Bioinformatics Approach

E.coli picture



Part 1   Introduction



"Those who which to succeed must ask the right preliminary questions." - Aristotle







There are principally three types of biological sequences, with the information flowing as outlined in the Central Dogma of molecular biology:



DNA -> RNA -> Protein



Once a new sequence has been determined, there are various ways of trying to find the function:

  • Through a comparison of how well it matches another sequence of known function (alignment methods).
  • By looking for characteristic patterns within the sequence (de novo methods).
  • Prediction of the sequence structure (ab initio methods).



In this talk (as well as the one after the break), I will focus on only the latter two approaches - that is, looking for patterns in DNA sequences and predicting local DNA structures based on the DNA sequence. Both talks will be about information in the DNA sequences within the context of sequenced bacterial chromosomes. Note that this is only one tiny fraction of the much larger subject area of bioinformatics.







A Brief History of Genomics



What is "genomics"?

genomed3i.noum. Biol. Formerly also genom -nom. [a.G. genom (H. Winkler Verbreitung u. Ursache d. Parthenogenesis (1920) iv. 165), irreg. f. gen gene1 + chromosom chromosome.]A haploid set of chromosomes; the sum-total of the genes in such a set.   
    The Oxford English Dictionary, 2d edition1930 Cytologia I. 14 Chromosomes from different sets (or genoms) of Triticum vulgare show affinity toward each other. 
1930 [see allopolyploidy]. 
1932 Proc. 6th Int. Congr. Genetics I. 275 The inviability of deficient genomes in the haploid generation serves to some extent as an alternative distinction between mutation and deficiency. 
1932 Proc. 6th Int. Congr. Genetics II. 5 There are two species having genoms resembling C. neglecta
1952 C. P. Blacker Eugenics x. 243 The appearance of such terms as gene-complex and genome (denoting a set of chromosomes as a working unity) testify to the movement towards holism in genetics. 
1965 A. M. Srb et al. Gen. Genetics (ed. 2) vii. 190 Among organisms with chromosomes, each species has a characteristic set of genes, or genome. In diploids a genome is found in each normal gamete. It consists of a full set of the different kinds of chromosomes. 
1970 Sci. Amer. Oct. 19/1 The human genome..consists of perhaps as many as 10 million genes.





A Few Words on the speed of DNA sequencing


I know this is a bit of a digression from MICROBIAL genomes, but I want to try and add a bit of historical perspective. In 1977, Fred Sanger sequenced the first bacteriophage (phiX174, 5386 bp long), for which he later won the Nobel prize.  Although this was a dramatic improvement over the conventional methods, this was still very slow, compared to the amount of information in a single human cell.

About a decade later, the human genome project was launched; this was an international effort, and the U.S. would pay about $200,000,000 per year for 20 years!  Most of this investment was in technology to speed sequencing, which in fact has been realised.  Within a few years, it is likely that it will be possible to read the entire DNA sequence of a human cell, in a few hours.




A Timeline of The Human Genome Sequencing Project
YEAR
# human genes mapped to a definite chromosome location
# years it would take to sequence the human genome
1967
none
sequencing not possible yet
1977
3 genes mapped 
4,000,000 years to finish at 1977 rate
1987
12 genes mapped 
1000 years to finish at 1987 rate
1997
30,000 genes mapped 
50 years to finish at 1997 rate
2001
~45,000 genes mapped 
Finished. (kind of)
Two versions: Celera and "Public".

Agilent Technologies announces that they are developing "nanopore technology", which could allow the entire human genome to be sequenced in a few hours!




A look at genome sequencing since 1994 (including bacteria, archaea, and eukaryotes):

YEAR# GENOMES SequencedRunning Total
1994
0
0
1995
2
2
1996
2
4
1997
5
9
1998
8
17
1999
13
30
2000
23
53
2001
42
95
2002
>100
>200

Currently (4 September, 2002) NCBI lists 166 BACTERIAL genomes in its database.

Note that I've only listed the PUBLICLY AVALIABLE genomes. There are probably more than a THOUSAND bacterial genomes which have been sequenced by various companies which will never make it into the public domain.








What is missing?

There are (at least) TWO things missing: genomes from ecologically abundant and diverse niches, and larger genomes.


Phylogenetic Tree









On the size variation within bacterial genomes



The size of most bacterial genomes varies considerably. For example, consider the four Escherichia coli genomes which have been sequenced (so far!), as shown in the table below.

Organism %AT Size (bp)
Atlas
Number
of genes
Coding
density
Reference
Escherichia coli
Strain: K-12, isolate W3110
DDBJ     NCBI tax
49  4,636,552  Genome Atlas 4085  79% 
1135 bp/gene
-
Escherichia coli
Strain: K-12, isolate MG1655
U. Wisconsin     TIGR cmr     NCBI tax     NCBI entrez
49  4,639,221  Genome Atlas 4397  87% 
1055 bp/gene
Science 277:1453-1474
September, 1997
[PubMed]
Escherichia coli
Strain: O157:H7 (substrain EDL93)
U. Wisconsin     NCBI tax     NCBI entrez
49  5,529,376  Genome Atlas 5283  86% 
1047 bp/gene
Nature 409:529-533
January, 2001
[PubMed]
Escherichia coli
Strain: O157:H7 (substrain RIMD 0509952)
DDBJ     NCBI tax     NCBI entrez
49  5,498,450  Genome Atlas 5361  88% 
1026 bp/gene
DNA Res. 8:11-22
February, 2001
[PubMed]


Link to a table with more E. coli Atlases.









Too Much Information!

Currently several hundred prokaryotic genomes have been sequenced, and nearly 100 genomes are publicly available for analysis. The flow of information is essentially the same as above, that is:




Genome -> Transcriptome -> Proteome




Link to a list of sequenced bacterial genomes


The DNA sequence contains several different types of information:

  1. The DNA sequence can code for an amino acid sequence for proteins
    • Directly - e.g., it is "easy" to predict protein sequence from DNA sequence.
    • Indirectly - Scrambled genes in protozoa (changes at the DNA level)
    • Indirectly - RNA editing
    • Indirectly - RNA splicing
    • Indirectly - Protein splicing (e.g., Inteins)


  2. The DNA sequence can code for an RNA sequence
    • tRNA
    • rRNA
    • snRNA
    • telomeraseRNA
    • other RNAs


  3. The DNA sequence can code for protein binding sites


  4. The DNA can code for architectural information
    • intrinsic DNA curvature
    • nucleosome positioning


  5. The DNA can code for structural / stability information
    • transcription initiation
    • origins of replication
    • mutational "hot spots"


REFERENCES



Photocopies of the following article will be provided for this lecture:

  1. David W. Ussery,
    "Genome Databases",
    The Encyclopedia of Genetics, (Academic Press, New York, 2001), pages 517-521.        PDF file 
    Table with links to Genome Databases




  2. Go to the CBS Home Page Back to the CBS homepage

    Back to Dave's Courses page

    Last modified Wednesday, 4 September, 2002 by David Ussery