30431 Introduktion til Bioinformatik
 
Introduction to Bioinformatics
DNA Structures and Information
David Ussery
Tuesday, 8 September, 1998
 
 
leaf bar
 

I. Schrödinger and Morse Code

    In 1943, Erwin Schrödinger gave a famous series of lectures at Trinity College in Dublin, Ireland, where he speculated about the physics of biology.  He proposed two main ideas, which will be discussed briefly.

  • order from order
  • order from disorder
  • In the former, Schrödinger postulated that perhaps the genetic material (then unknown, but thought to be protein) might be an "aperiodic solid", which contains coded information - perhaps somewhat like Morse code.  This idea is actually the basis for the "Central Dogma" of molecular biology.  In this situation, a reductionistic view has been quite successful in understanding much of biology in terms of genes.
     

     

     

    The flow of Genetic Information:
     

    DNA -> RNA -> protein
     
     
     
    This is known as:
    The Central Dogma of Molecular Biology
     
     
    Shown below is an Illustration of the transcription of DNA to RNA to protein which forms the backbone of molecular biology.
    Central Dogma of Molecular Biology

    LEGEND

    Or in the words of Francis Crick:
    Once information has passed into protein, it cannot get out again.
     

    However, the "Central Dogma" has had to be revised a bit.  It turns out that one CAN go back from RNA to DNA, and that RNA can also make copies of itself.  It is still not possible to go from Proteins back to RNA or DNA, and no known mechanism has yet been demonstrated for proteins making copies of themselves.
     

    New (revised a bit) Central Dogma
     
     
     
     
    Back to Schrödinger and Morse Code

        There are two aspects in which Schrödinger's order from disorder also play an important role in biology:

  • "negative entropy" - where an organism uses the energy obtained from burning food to offset the cost of storing information in the form of DNA, RNA, and protein sequences.

  •  
  • self-organisation - this is the process by which complex systems can spontaneously appear.  It is based on non-linear systems, far from equilibrium, and hence difficult (if not impossible) to predict, although some aspects can be modelled.


  • REFERENCES:
     

    "What is Life?" by Erwin Schrödinger (Cambridge University Press, 1944)

    "What is Life? The Next Fifty Years - Speculations on the Future of Biology" (edited by Michael P. Murphey and Luke A.J. O'Neill, Cambridge University Press, 1995.)

    "At Home in the Universe - The Search for the Laws of Self-organization and Complexity", by Stuart Kauffman (Oxford University Press, 1995).
     

     

    leaf bar
     
     
      II. Biological Sequences as Information
    DNADNA sequence as information

    The DNA sequence contains several different types of information:

    1. The DNA sequence can code for an amino acid sequence for proteins

    2. The DNA sequence can code for an RNA sequence

  • tRNA
  • rRNA
  • snRNA
  • telomeraseRNA
  • other RNAs
  • 3. The DNA sequence can code for protein binding sites
     
     
    4. The DNA can code for architectural information
  • intrinsic DNA curvature
  • nucleosome positioning
  • 5. The DNA can code for structural / stability information
  • transcription initiation
  • origins of replication
  • mutational "hot spots"
  •  
     

     
     
    RNARNA sequence as information

    The RNA sequence also contains several different types of information:

    1. The mRNAs can contain several different levels of information:
  • specifies amino acid sequence for proteins
  • localisation signals for WHERE the protein will be made
  • stability signals to determine HOW MUCH protein is made
  • splice sites
  • editing sites
  • 2. The tRNAs code for the genetic code - same in all living organisms (n.b. diff. in mitochondria)
     
     
    4. The rRNAs code for the structures of ribosomes
  • intrinsic DNA curvature
  • nucleosome positioning
  • 5. Other RNA/protein complexes have important biological functions
  • RNA template for teleomerase enzyme - necessary to prevent cancer
  • snRNAs necessary for mRNA splicing
  • snoRNAs are small nucleolar RNAs.
  •  

     
     
    Protein sequence as information

    The PROTEIN sequence contains several different types of information:

    1. The protein sequence can code for an "active site" for enzymes

    2. The protein sequence can code for structural roles:

  • microtubules
  • myosin
  • collagen
  • etc.
  • 3. The protein sequence can code for ion channels/pumps
     
     
    4. The protein sequence can code for localisation information
  • within the cell
  • extra-cytoplasmic
  • 5. The protein sequence can code for modification sites
     

     
    III. A Few Words on the speed of DNA sequencing

     
    In 1977, Fred Sanger sequenced the first bacteriophage (4096 bp long), for which he later won the Nobel prize.  Although this was a dramatic improvement over the conventional methods, this was still very slow, compared to the amount of information in a single human cell.

    About a decade later, the human genome project was launced; this was an international effort, and the U.S. would pay about $200,000,000 per year for 20 years!  Most of this investment was in technology to speed sequencing, which in fact has been realised.  Within a year, it is likely that it will be possible to read the entire DNA sequence of a human cell, in a few hours.
     
     

     
     
     
    A Timeline of The Human Genome Sequencing Project
    YEAR
    # human genes mapped to a definite chromosome location
    # years it would take to sequence the human genome
    1967
    none
    sequencing not possible yet
    1977
    3 genes mapped 
    4,000,000 years to finish at 1977 rate
    1987
    12 genes mapped 
    1000 years to finish at 1987 rate
    1997
    30,000 genes mapped 
    50 years to finish at present rate
     
     NOTE: The genome project is actually ahead of schedule, and it is very likely that the first complete sequence of a human genome will be finished within 1 or 2 years from now (probably during the year 1999 or 2000).
     

     

    leaf 41
     

     The human genome project has also had a major influence on the rest of biology, as other organisms are being sequenced as goals towards the ambitious end of the 3,000,000,000 bp (or so) nucleotide sequence for the human genome.  In particular, the sequencing of complete bacterial genomes is revolutionizing the field of microbiology.   Presently, bacterial genomes are being sequence at a rate of slightly faster than one new genome every month!  As technology improves, this rate will increase.  It is estimated that within the next two years, we will know the complete genomic sequence of most major pathogenic bacteria.
     
     

    Organisms sequenced
    Year
    # genomes sequenced
    1994
    0
    1995
    2
    1996
    4
    1997
    7
    1998
    17 (est.)
     
    Reference: Tang,C.M., Hood,D.W., Moxon,E.R., "Haemophilus influence: the impact of whole genome sequencing on microbiology", Trends in Genetics, 13:399-404, (1997).
     
     

     
    IV. On the relative sizes of genomes and proteomes
     

    What is "genomics"?

     
    genome d3i.noum. Biol. Formerly also genom -nom. [a. G. genom (H. Winkler Verbreitung u. Ursache d. Parthenogenesis (1920) iv. 165), irreg. f. gen gene1 + chromosom chromosome.] A haploid set of chromosomes; the sum-total of the genes in such a set.   
      The Oxford English Dictionary, 2d edition1930 Cytologia I. 14 Chromosomes from different sets (or genoms) of Triticum vulgare show affinity toward each other. 
    1930 [see allopolyploidy]. 
    1932 Proc. 6th Int. Congr. Genetics I. 275 The inviability of deficient genomes in the haploid generation serves to some extent as an alternative distinction between mutation and deficiency. 
    1932 Proc. 6th Int. Congr. Genetics II. 5 There are two species having genoms resembling C. neglecta
    1952 C. P. Blacker Eugenics x. 243 The appearance of such terms as gene-complex and genome (denoting a set of chromosomes as a working unity) testify to the movement towards holism in genetics. 
    1965 A. M. Srb et al. Gen. Genetics (ed. 2) vii. 190 Among organisms with chromosomes, each species has a characteristic set of genes, or genome. In diploids a genome is found in each normal gamete. It consists of a full set of the different kinds of chromosomes. 
    1970 Sci. Amer. Oct. 19/1 The human genome..consists of perhaps as many as 10 million genes.
     
     
     What is "Proteomics"?
       
     

    Figure 9_21 from Hartl & Jones, 1998  Bacteriophage l has a genome of about 50,000 bp.  If you were to print the entire sequence out, with roughly 25,000 bp per page, it would take about 2 pages.  (The sequence would be in a very small font, and you could barely read it!)

     
    Figure 9-21The common bacteria Escherichia coli is perhaps the best-studied organism in all of biology.  However, when the complete genomic sequenc of E.coli was published about a year ago, many were surprised that only about a third of the proteins had been well-characterised.  Another third was perhaps known about, based on DNA sequence analysis, but the remaining third of potential proteins was not expected.

    Figure 9-21The yeast Saccharomyces cerevisiae was the first (and only so far) eukaryote to be sequenced.  The yeast genome would occupy a thin volume of about 500 pages, or roughly twice the thickness of the E.coli volume.  I should mention that recent genetic analysis of the complete yeast genome has found that it likely has arisen from a duplication event - that is, yeast came from a more primitive organism which contained only half the number of chromosomes.

    Figure 9-21The first "animal" to be sequenced is likely to be the nematode C.elegans, which is about 100,000 bp long. The plant Arabidopsis thaliana is also being sequenced, and it is about the same size.  Both of these genomes are likely to be completed within the next year or so.
     
     

    FINALLY, the human genome, by comparison, is quite large.  Using the same analogy as above, the human genome would fill 80 volumes!
     
     

    Figure 9-21
     
     



     
     animl12.gif

     A List of Genomes that have been Completely Sequenced:
    (so far, as of 1 Sept., 1998) 
     
     

    Organism # Type Size (Mbp) number of genes
    date sequenced
    published
    Haemophilus influenzae  1 Bacteria (Gm-)
    1.83
    1703
     August, 1995
    Mycoplasma genitalium  2 Bacteria (Gm-)
    0.58
    470
    October, 1995
    Saccharomyces cerevisiae  3 Eukaryotic 
    ("baker's yeast")
    13.00
    5885
    January, 1996
    Methanococcus jannashchii  4 Archaebacteria
    1.66
    1738
    August, 1996
    Synechocystis sp.  5 Bacteria ("blue-green algae")
    3.57
    3168
    September, 1996
    Mycoplasma pneumoniae  6 Bacteria (Gm-)
    0.81
    677
    November, 1996
    Escherichia coli  7 Bacteria (Gm-)
    4.60
    4288
    January,1997
    Methanobacterium  
    thermoautotrophicum
     8 Archaebacteria
    1.75
    1918
    May,1997
    Archaeoglobus fulgidus  9 Archaebacteria
    2.18
    2436
    June,1997
    Helicobacter pylori 10 Bacteria (Gm-)
    1.66
    1590
    June, 1997
    Borrelia burgdorferi 11 Bacteria (Gm-)
    0.92
    853
    July,1997
    Treponema pallidum 12 Bacteria (Gm-)
    1.05
    ~1000
    October, 1997
    Bacillus subtilis 13 Bacteria (Gm+)
    4.20
    4100
     November,1997
    Pyrococcus horikoshii 14 Archaebacteria
    1.74
    ~1700
    January,1998
    Aquifex aeolicus 15 Eubacteria
    1.55
    1512
    March,1998
    Mycobacterium tuberculosis 16 Bacteria (Gm+) 
     
    4.41
    3924
    June,1998
    Treponema pallidum 17 Bacteria (Gm-)
    1.14
    1041
    July,1998
     
     

     
    Sequences either complete, but not published (1 Sept. '98),
    or anticipated being complete within 1998
     
    Organism
    #
    Type
    Size (Mbp)
    number of genes
    date sequenced
    published
    Bacillus sp. C-125
     1
    Bacteria (Gm+)
    4.2
    ~4000
    Complete,
    not yet published
    Pseudomonas aeruginosa 
     2
    Bacteria (Gm-)
    5.9
    ~5000
    Complete,
    not yet published
    Pyrobaculum aerophilum
     3
    Archaebacteria
    2.3
    ~2000
    Complete,
    not yet published
    Pyrococcus abyssii
     4
    Archaebacteria
    1.9
    ~1900
    Complete,
    not yet published
    Rickettsia prowazekii
     5
    Bacteria (Gm-)
    1.1
    ~1100
    Complete,
    not yet published
    Ureaplasma urealyticum
     6
    Bacteria (Gm-)
    0.75
    ~800
    Complete,
    not yet published
    Deinococcus radiodurans
     7
    Bacteria (Gm-)
    3.0
    ~3000
    Anticipate
    published in 1998
    Mycobacterium tuberculosis
    CSU#93 (clinical isolate)
     8
    Bacteria (Gm+
    4.4
    ~4000
    Anticipate
    published in 1998
     Rhodobacter capsulatus
     9
     Bacteria 
    (photosynthetic)
    3.7
    ~3500
    Anticipate
    published in 1998
     Streptococcus  
    pneumoniae
    10
     Bacteria (Gm+)
    2.2
    ~2000
    Anticipate
    published in 1998
     Thermotoga maritima
    11
     Archaebacteria
    1.8
    ~1800
    Anticipate
    published in 1998
    Ureaplasma urealyticum 12
     Bacteria 
    0.75
    ~800
    Anticipate
    published in 1998
    Vibrio cholerae 13
    Bacteria (Gm-)
    2.5
    ~2500
    Anticipate
    published in 1998
     
     
     

    A list of microbial genomes which are being sequenced and are presently searchable through TIGR:
     

    Deinococcus radiodurans
    Enterococcus faecalis
    Mycobacterium tuberculosis CSU#93
    Neisseria meningitidis MC58 
    Plasmodium falciparum 
    Streptococcus pneumoniae
    Thermotoga maritima 
    Treponema pallidum
     Vibrio cholerae 
     
    bacteria bar
     
    note: the organisms on the above chart were classified according to the "three-kingdom" type of scheme:
    Figure 1-16 from Hartl & Jones, 1998
     
     

    Link to TIGR

     
    MAGPIEInformation also came from Magpie Genome Sequencing project list - click on the bird for a link.
     

    There are presently more than  100  organisms (including humans), whose genomes are in the process of being sequenced....
     
     



     
    LINKS:
     
    Back to today's lecture outline
     

    Back to the course syllabus
     

    Back to Dave's Homepage


    Back to the CBS Homepage  CBS home page
     
     
     

    980830 du