Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

DNA Structures in Whole Genomes



DTU Ph.D. Course number 27803
Biological Sequence Analysis and Protein Modelling
Link to the main course web page
David Ussery
Friday, 30 April, 2004
Link to part 2: Escherichia coli Genomes
Link to Atlas Web pages






Part 1: An Introduction to Bioinformatics of Bacterial Genomes





Bacillus anthracis pX01 Cruciform atlas
DNA atlas for Bacillus anthracis plasmid pX01. This is Figure 5 from Trends in Genetics, 19:365-369, (2003)..




Overview


  1. INTRODUCTION

         --Break--


  2. Bacterial Chromatin and Gene Expression in Escherichia coli






leaf65.gif





Part 1   Introduction to Bioinformatics





"We are drowning in a sea of data but starving for knowledge." - Sydney Brenner








What is Bioinformatics?



Please accept my apology, for offering a definition of the subject of this course on the last day of class, but it will help set the background for the rest of my talks.

Bioinformatics is "the science of information and information flow in biological systems, esp. of the use of computational methods in genetics and genomics.", according to the Oxford English Dictionary.

In my own opinion, I think that bioinformatics is about the application of machine learning methods to help us understand biological information. There's just simply way too much information here for us to understand without the help of computers! Part (but not all) of this information is in the form of sequences (e.g., DNA, RNA, proteins). In addition to sequences, there are other sources of data for bioinformatics analysis, such as micro-array data (which you have spent the last two days working with), 2-D gel information, image analysis, etc. See the recent review article by
Mark Gerstein for a more detailed analysis of the question "What is Bioinformatics", from a medical perspective. Note that the word "bioinformatics" was first used back in 1978, by Pauline Hogweg, at the University of Utretcht, in The Netherlands (published in Simulation, 31:90-91), and the subject has been popular in Europe throughout the 1980s and 1990s (CBS was formed as a bioinformatics group in 1993). However, the term has only come into common usage in the U.S. in the past few years - hence an attempt to "define the word" nearly 25 years after it was first used in the scientific literature.



There are principally three types of biological sequences, with the information flowing as outlined in the Central Dogma of molecular biology:



DNA -> RNA -> Protein




Once a new sequence has been determined, there are various ways of trying to find the function:
  • Through a comparison of how well it matches another sequence of known function (alignment methods).
  • By looking for characteristic patterns within the sequence (de novo methods).
  • Prediction of the sequence structure (ab initio methods).




In this talk (as well as the two "case stories" after the break), I will focus on only the latter two approaches - that is, looking for patterns in DNA sequences and predicting local DNA structures based on the DNA sequence. Both talks will be about information in the DNA sequences within the context of sequenced bacterial chromosomes. Note that this is only one tiny fraction of the much larger subject area of bioinformatics.








A Brief History of Genomics



What is "genomics"?

genomed3i.noum. Biol. Formerly also genom -nom. [a.G. genom (H. Winkler Verbreitung u. Ursache d. Parthenogenesis (1920) iv. 165), irreg. f. gen gene1 + chromosom chromosome.]A haploid set of chromosomes; the sum-total of the genes in such a set.   
    The Oxford English Dictionary, 2d edition1930 Cytologia I. 14 Chromosomes from different sets (or genoms) of Triticum vulgare show affinity toward each other. 
1930 [see allopolyploidy]. 
1932 Proc. 6th Int. Congr. Genetics I. 275 The inviability of deficient genomes in the haploid generation serves to some extent as an alternative distinction between mutation and deficiency. 
1932 Proc. 6th Int. Congr. Genetics II. 5 There are two species having genoms resembling C. neglecta
1952 C. P. Blacker Eugenics x. 243 The appearance of such terms as gene-complex and genome (denoting a set of chromosomes as a working unity) testify to the movement towards holism in genetics. 
1965 A. M. Srb et al. Gen. Genetics (ed. 2) vii. 190 Among organisms with chromosomes, each species has a characteristic set of genes, or genome. In diploids a genome is found in each normal gamete. It consists of a full set of the different kinds of chromosomes. 
1970 Sci. Amer. Oct. 19/1 The human genome..consists of perhaps as many as 10 million genes.






A Few Words on the speed of DNA sequencing


I know this is a bit of a digression from MICROBIAL genomes, but I want to try and add a bit of historical perspective. In 1977, Fred Sanger sequenced the first bacteriophage (phiX174, 5386 bp long), for which he later won the Nobel prize.  Although this was a dramatic improvement over the conventional methods, this was still very slow, compared to the amount of information in a single human cell.


About a decade later, the human genome project was launched; this was an international effort, although it was initially funded mostly by the U.S. Department of Energy - which would pay about $200,000,000 per year for 20 years!  Most of this investment was in technology to speed sequencing, which in fact has been realised.  Within a few years, it is likely that it will be possible to read the entire DNA sequence of a human cell, in a few hours.






A Timeline of The Human Genome Sequencing Project
YEAR
# human genes mapped to a definite chromosome location
# years it would take to sequence the human genome
1967
none
sequencing not possible yet
1977
3 genes mapped 
4,000,000 years to finish at 1977 rate
1987
12 genes mapped 
1000 years to finish at 1987 rate
1997
30,000 genes mapped 
50 years to finish at 1997 rate
2001
~45,000 genes mapped 
Finished. (kind of)
Two versions: Celera and "Public".

Agilent Technologies announces that they are developing "nanopore technology", which could allow the entire human genome to be sequenced in a few hours!





A look at genome sequencing since 1994 (including bacteria, archaea, and eukaryotes):

YEAR# GENOMES
Sequenced
Running
Total
199400
199522
199624
199759
1998817
19991330
20002353
20014295
200291186
2003128314
2004a29343
note: 2004 numbers are for January thru April only.
also note that not all of these genomes (especially eukaryotes) have been fully sequenced!

Currently (24 April, 2004) NCBI lists 276 BACTERIAL genomes in its database.


Note that I've only listed the PUBLICLY AVALIABLE genomes. There are probably more than a THOUSAND bacterial genomes which have been sequenced by various companies which will never make it into the public domain.







What is missing?


There are (at least) TWO things missing: genomes from ecologically abundant and diverse niches, and larger genomes.



Phylogenetic Tree




On the relative sizes of genomes





Many traditional biologists still tend to try the "traditional molecular genetics" approach used to study many E. coli genes, to studying eukaryotic genes and genomes, by which I mean, having a look at the sequence and seeing if one can make sense of it. This simply will not work for the whole genomes; for example it would take more than a hundred years just to read the HAPLOID sequence of a single human cell.





Organism
# bp  
Time* # genes #bp/gene
phi-X174
5,386 bp
1.5 hours
9
598
Escherichia coli 
4,639,221 bp
54 days
4,288
1,072
Saccharomyces cerevisiae
12,057,849 bp
140 days
6,269
1,923
Caenorhabditis elegans
~97,000,000 bp
3.1 years
19,099
5,079
Arabidopsis thaliana
~125,000,000 bp
4 years
25,498
4,902
Drosophila melanogastor
~180,000,000 bp
5.7 years
13,600
13,235
humans
~3,400,000,000 bp
108 years
~30,000
113,333
*TIME = the amount of time to read the entire genome, at a rate of 1 bp per second.









Figure 9_21 from Hartl & Jones, 1998  Bacteriophage l (lambda) has a genome of about 50,000 bp.  If you were to print the entire sequence out, with roughly 25,000 bp per page, it would take about 2 pages.  (The sequence would be in a very small font, and you could barely read it!)
 
Figure 9-21The common bacteria Escherichia coli is perhaps the best-studied organism in all of biology.  However, when the complete genomic sequence of E.coli was published in 1997, many were surprised that only about a third of the proteins had been well-characterised.  Another third was perhaps known about, based on DNA sequence analysis, but the remaining third of potential proteins was not expected.


Figure 9-21The yeast Saccharomyces cerevisiae was the first eukaryote to be sequenced.  The yeast genome would occupy a thin volume of about 500 pages, or roughly twice the thickness of the E.coli volume.  Genetic analysis of the complete yeast genome has found that it likely has arisen from a duplication event - that is, yeast came from a more primitive organism which contained only half the number of chromosomes.


Figure 9-21The first "animal" to be sequenced is likely to be the nematode C. elegans, which has a genome of about 97,000,000 bp. The plant Arabidopsis thaliana has also been sequenced, and it is only slightly larger. (125 Million bp).  Although both of these genomes have been "sequenced", there are many large gaps remaining to be filled.




FINALLY, the human genome, by comparison, is quite large.  Using the same analogy as above, the human genome would fill 80 volumes!




Figure 9-21








Although the number of genomes being sequenced is increasing rapidly, one has to this into perspective - the organisms can be placed into four different classes:


Organism group Size (bp) No. sequenced
viruses ~300 bp  to               ~350,000 bp 1279
Prokaryotes  ~250,000 to          ~15,000,000 bp 227
(public)
single-celled eukaryotes
~12,000,000 to ~600,000,000,000 bp 43
multi-celled eukaryotes ~20,000,000 to ~50,000,000,000 bp 16*


*note that NONE of the multi-cellular eukaryotic chromosomes have yet been completely sequenced (e.g., 1 contiguous piece of DNA, with no gaps).





animl12.gif




Some philosophical thoughts about Information and the Size of Genomes.





There seems to be a general trend in terms of size of genomes: the simple bacteriophage viruses have the smallest genomes, then bacteria, then simple eukaryotes (yeast), then simple animals & plants, then humans.



Although many people often do not realise it, there is a tendency for us to view organisms from a "human-centered" perspective. One popular view which still lurks in our culture is Aristotle's "Ladder of Nature".



Aristotle's ladder


HOWEVER, in fact there is not such a nice correlation between an organism's complexity and the size of its genome. Based on analysis of estimates of genome sizes from more than 8000 different species, the following outline can be made:


DOGS figure



Here's an expanded version of the shorter table presented above:
 
Organism
# bp
# genes
ratio
#bp / #genes
phi-X174
5386
9
598
HIV-1
10,000
10
1000
Haemophilus influenzae
1,830,000
1703
1075
Escherichia coli 
4,600,000
4288
1072
Methanococcus jannashchii
1,660,000
1738
955
Synechocystis sp.
3,570,000
3168
1123
Amoeba dubia
~670,000,000,000
~5000?
134,000,000
Amoeba proteus 
~270,000,000,000
~5000?
54,000,000
Saccharomyces cerevisiae
~13,000,000
5885
2209
Erysiphe cichoracearum  
(fungus)
~1,500,000,000
~10,000?
150,000
Coscinodiscus asteromphalus  
(diatom)
~25,000,000,000
~5000?
5,000,000
Caenorhabditis elegans
~100,000,000
~14,000
7000
Parascaris equorum  
(worm)
2,500,000,000
~15,000
166,700
Drosophila melanogastor
~170,000,000
~12,000
14,000
Arabidopsis thaliana 
~120,000,000
~10,000
12,000
Lilium formosanum  
(lily)
~36,000,000,000
~15, 000?
2,400,000
Ophioglossum petiolatum  
(fern)
~160,000,000,000
~20,000?
8,000,000
Zea mays
~5,000,000,000
~20,000?
250,000
Allium cepa  
(onion)
~18,000,000,000
~20,000?
900,000
Amphiuma means  
(newt)
~84,000,000,000
~40,000?
2,100,000
Protopterus aethiopicus  
(lungfish)
~140,000,000,000
~40,000?
3,500,000
humans
~3,400,000,000
~80,000
42,500










Too Much Information!



Currently several hundred prokaryotic genomes have been sequenced, and more than 100 genomes are publicly available for analysis. The flow of information is essentially the same as above, that is:




Genome -> Transcriptome -> Proteome





Link to a list of sequenced bacterial genomes

The PROBLEMS







The problem, in a nutshell, is simply TOO MUCH INFORMATION. For example, we have access to more information today in 24 hours than someone from the 16th century had in their ENTIRE LIFETIME!

GenBank is growing faster than the speed and memory size of computers. This means that it will continue to take longer and longer to search through databases, even if one were to purchase the most recent and fastest computers available.


ANOTHER (related) problem is WHY DO ORGANISMS HAVE SO MUCH DNA? One possible explanation is that DNA is playing a structural role, in addition to coding for information.



Egypt11 bar






Biological Information and DNA sequences



Schrödinger and Morse Code


    In 1943, Erwin Schrödinger gave a famous series of lectures at Trinity College in Dublin, Ireland, where he speculated about the physics of biology.  He proposed two main ideas, which will be discussed briefly.
  • order from order
  • order from disorder
  • In the former, Schrödinger postulated that perhaps the genetic material (then unknown, but thought to be protein) might be an "aperiodic solid", which contains coded information - perhaps somewhat like Morse code.  This idea is actually the basis for the "Central Dogma" of molecular biology.  In this situation, a reductionistic view has been quite successful in understanding much of biology in terms of genes.
     
     
     
    The flow of Genetic Information:
     
    DNA -> RNA -> protein
     
     
     
    This is known as:
    The Central Dogma of Molecular Biology
     
     
    Shown below is an Illustration of the transcription of DNA to RNA to protein which forms the backbone of molecular biology.
    Central Dogma of Molecular Biology

    LEGEND
    • DNA codes for the production of RNA.
    • RNA codes for the production of protein.
    • Protein does not code for the production of protein, RNA or DNA.
    • The end.
    Or in the words of Francis Crick:
    Once information has passed into protein, it cannot get out again.
     


    However, the "Central Dogma" has had to be revised a bit.  It turns out that one CAN go back from RNA to DNA, and that RNA can also make copies of itself.  It is still not possible to go from Proteins back to RNA or DNA, and no known mechanism has yet been demonstrated for proteins making copies of themselves.
     
    New (revised a bit) Central Dogma
     
     
     
     
    Back to Schrödinger and Morse Code
        There are two aspects in which Schrödinger's order from disorder also play an important role in biology:
  • "negative entropy" - where an organism uses the energy obtained from burning food to offset the cost of storing information in the form of DNA, RNA, and protein sequences.

  •  
  • self-organisation - this is the process by which complex systems can spontaneously appear.  It is based on non-linear systems, far from equilibrium, and hence difficult (if not impossible) to predict, although some aspects can be modelled.


  • So what does all this have to do with DNA sequences? First, information is dependent on CONTEXT (see
    the E. coli lecture for more on genomic context). Thus, the DNA can use this information to "code for" cellular functions, such as a telomere or centromere or origin of replication.

    Second, if DNA is viewed as a computer programme, then where one is left with the question of WHAT (or WHO) wrote the programme? This misconception is essentially at the heart of the "Intelligent Design" movement in the U.S.



    Egypt12 bar



    The DNA sequence contains several different types of information:


    1. The DNA sequence can code for an amino acid sequence for proteins
      • Directly - e.g., it is "easy" to predict protein sequence from DNA sequence.
      • Indirectly - Scrambled genes in protozoa (changes at the DNA level)
      • Indirectly - RNA editing
      • Indirectly - RNA splicing
      • Indirectly - Protein splicing (e.g., Inteins)


    2. The DNA sequence can code for an RNA sequence
      • tRNA
      • rRNA
      • snRNA
      • telomeraseRNA
      • other RNAs


    3. The DNA sequence can code for protein binding sites


    4. The DNA can code for architectural information
      • intrinsic DNA curvature
      • nucleosome positioning


    5. The DNA can code for structural / stability information
      • transcription initiation
      • origins of replication
      • mutational "hot spots"



    Egypt12 bar







    Part 1   Introduction to DNA Atlases






    How can we deal with such an overflow of information? The average person living in the 16th century, had access to less information in their lifetime than we have today instantaneously, through the Internet.




    One way of dealing with the problem of how to display so much sequence information is to have a look at the whole chromosome at once, smoothing over a large window. The entire bacterial chromosome is displayed as a circle, with different colours representing various parameters. First, as an introduction to atlases, we will look at base-composition. Then we will have a look at levels of expression of mRNA and proteins throghout the chromosome. As an example, I will use the first "organism" to be sequenced, Escherichia coli bacteriophage phi-X174. This was sequenced about 30 years ago; Fred Sanger got the Nobel prize for sequencing this (actually, for developing a faster way of sequencing DNA). It took him a couple of years to sequence this - to put it into perspective, at the same rate, it would take more than a MILLION years to sequence the human genome!





    >NC_001422
    GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAA
    AAATTATCTTGATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGAC
    TGCTGGCGGAAAATGAGAAAATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTT
    GCGACCTTTCGCCATCAACTAACGATTCTGTCAAAAACTGACGCGTTGGATGAGGAGAAG
    TGGCTTAATATGCTTGGCACGTTCGTCAAGGACTGGTTTAGATATGAGTCACATTTTGTT
    CATGGTAGAGATTCTCTTGTTGACATTTTAAAAGAGCGTGGATTACTATCTGAGTCCGAT
    GCTGTTCAACCACTAATAGGTAAGAAATCATGAGTCAAGTTACTGAACAATCCGTACGTT
    TCCAGACCGCTTTGGCCTCTATTAAGCTCATTCAGGCTTCTGCCGTTTTGGATTTAACCG
    AAGATGATTTCGATTTTCTGACGAGTAACAAAGTTTGGATTGCTACTGACCGCTCTCGTG
    CTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTGGGATACCCTCGCT
    TTCCTGCTCCTGTTGAGTTTATTGCTGCCGTCATTGCTTATTATGTTCATCCCGTCAACA
    TTCAAACGGCCTGTCTCATCATGGAAGGCGCTGAATTTACGGAAAACATTATTAATGGCG
    TCGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACCTTGCGTGTACGCGCAGGAA
    ACACTGACGTTCTTACTGACGCAGAAGAAAACGTGCGTCAAAAATTACGTGCGGAAGGAG
    TGATGTAATGTCTAAAGGTAAAAAACGTTCTGGCGCTCGCCCTGGTCGTCCGCAGCCGTT
    GCGAGGTACTAAAGGCAAGCGTAAAGGCGCTCGTCTTTGGTATGTAGGTGGTCAACAATT
    TTAATTGCAGGGGCTTCGGCCCCTTACTTGAGGATAAATTATGTCTAATATTCAAACTGG
    CGCCGAGCGTATGCCGCATGACCTTTCCCATCTTGGCTTCCTTGCTGGTCAGATTGGTCG
    TCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGACTCCTTCGAGATGGACGCCGT
    TGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCTATTGACTCTACTGTAGACAT
    TTTTACTTTTTATGTCCCTCATCGTCACGTTTATGGTGAACAGTGGATTAAGTTCATGAA
    GGATGGTGTTAATGCCACTCCTCTCCCGACTGTTAACACTACTGGTTATATTGACCATGC
    CGCTTTTCTTGGCACGATTAACCCTGATACCAATAAAATCCCTAAGCATTTGTTTCAGGG
    TTATTTGAATATCTATAACAACTATTTTAAAGCGCCGTGGATGCCTGACCGTACCGAGGC
    TAACCCTAATGAGCTTAATCAAGATGATGCTCGTTATGGTTTCCGTTGCTGCCATCTCAA
    AAACATTTGGACTGCTCCGCTTCCTCCTGAGACTGAGCTTTCTCGCCAAATGACGACTTC
    TACCACATCTATTGACATTATGGGTCTGCAAGCTGCTTATGCTAATTTGCATACTGACCA
    AGAACGTGATTACTTCATGCAGCGTTACCATGATGTTATTTCTTCATTTGGAGGTAAAAC
    CTCTTATGACGCTGACAACCGTCCTTTACTTGTCATGCGCTCTAATCTCTGGGCATCTGG
    CTATGATGTTGATGGAACTGACCAAACGTCGTTAGGCCAGTTTTCTGGTCGTGTTCAACA
    GACCTATAAACATTCTGTGCCGCGTTTCTTTGTTCCTGAGCATGGCACTATGTTTACTCT
    TGCGCTTGTTCGTTTTCCGCCTACTGCGACTAAAGAGATTCAGTACCTTAACGCTAAAGG
    TGCTTTGACTTATACCGATATTGCTGGCGACCCTGTTTTGTATGGCAACTTGCCGCCGCG
    TGAAATTTCTATGAAGGATGTTTTCCGTTCTGGTGATTCGTCTAAGAAGTTTAAGATTGC
    TGAGGGTCAGTGGTATCGTTATGCGCCTTCGTATGTTTCTCCTGCTTATCACCTTCTTGA
    AGGCTTCCCATTCATTCAGGAACCGCCTTCTGGTGATTTGCAAGAACGCGTACTTATTCG
    CCACCATGATTATGACCAGTGTTTCCAGTCCGTTCAGTTGTTGCAGTGGAATAGTCAGGT
    TAAATTTAATGTGACCGTTTATCGCAATCTGCCGACCACTCGCGATTCAATCATGACTTC
    GTGATAAAAGATTGAGTGTGAGGTTATAACGCCGAAGCGGTAAAAATTTTAATTTTTGCC
    GCTGAGGGGTTGACCAAGCGAAGCGCGGTAGGTTTTCTGCTTAGGAGTTTAATCATGTTT
    CAGACTTTTATTTCTCGCCATAATTCAAACTTTTTTTCTGATAAGCTGGTTCTCACTTCT
    GTTACTCCAGCTTCTTCGGCACCTGTTTTACAGACACCTAAAGCTACATCGTCAACGTTA
    TATTTTGATAGTTTGACGGTTAATGCTGGTAATGGTGGTTTTCTTCATTGCATTCAGATG
    GATACATCTGTCAACGCCGCTAATCAGGTTGTTTCTGTTGGTGCTGATATTGCTTTTGAT
    GCCGACCCTAAATTTTTTGCCTGTTTGGTTCGCTTTGAGTCTTCTTCGGTTCCGACTACC
    CTCCCGACTGCCTATGATGTTTATCCTTTGAATGGTCGCCATGATGGTGGTTATTATACC
    GTCAAGGACTGTGTGACTATTGACGTCCTTCCCCGTACGCCGGGCAATAACGTTTATGTT
    GGTTTCATGGTTTGGTCTAACTTTACCGCTACTAAATGCCGCGGATTGGTTTCGCTGAAT
    CAGGTTATTAAAGAGATTATTTGTCTCCAGCCACTTAAGTGAGGTGATTTATGTTTGGTG
    CTATTGCTGGCGGTATTGCTTCTGCTCTTGCTGGTGGCGCCATGTCTAAATTGTTTGGAG
    GCGGTCAAAAAGCCGCCTCCGGTGGCATTCAAGGTGATGTGCTTGCTACCGATAACAATA
    CTGTAGGCATGGGTGATGCTGGTATTAAATCTGCCATTCAAGGCTCTAATGTTCCTAACC
    CTGATGAGGCCGCCCCTAGTTTTGTTTCTGGTGCTATGGCTAAAGCTGGTAAAGGACTTC
    TTGAAGGTACGTTGCAGGCTGGCACTTCTGCCGTTTCTGATAAGTTGCTTGATTTGGTTG
    GACTTGGTGGCAAGTCTGCCGCTGATAAAGGAAAGGATACTCGTGATTATCTTGCTGCTG
    CATTTCCTGAGCTTAATGCTTGGGAGCGTGCTGGTGCTGATGCTTCCTCTGCTGGTATGG
    TTGACGCCGGATTTGAGAATCAAAAAGAGCTTACTAAAATGCAACTGGACAATCAGAAAG
    AGATTGCCGAGATGCAAAATGAGACTCAAAAAGAGATTGCTGGCATTCAGTCGGCGACTT
    CACGCCAGAATACGAAAGACCAGGTATATGCACAAAATGAGATGCTTGCTTATCAACAGA
    AGGAGTCTACTGCTCGCGTTGCGTCTATTATGGAAAACACCAATCTTTCCAAGCAACAGC
    AGGTTTCCGAGATTATGCGCCAAATGCTTACTCAAGCTCAAACGGCTGGTCAGTATTTTA
    CCAATGACCAAATCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGACTTAGTTCATC
    AGCAAACGCAGAATCAGCGGTATGGCTCTTCTCATATTGGCGCTACTGCAAAGGATATTT
    CTAATGTCGTCACTGATGCTGCTTCTGGTGTGGTTGATATTTTTCATGGTATTGATAAAG
    CTGTTGCCGATACTTGGAACAATTTCTGGAAAGACGGTAAAGCTGATGGTATTGGCTCTA
    ATTTGTCTAGGAAATAACCGTCAGGATTGACACCCTCCCAATTGTATGTTTTCATGCCTC
    CAAATCTTGGAGGCTTTTTTATGGTTCGTTCTTATTACCCTTCTGAATGTCACGCTGATT
    ATTTTGACTTTGAGCGTATCGAGGCTCTTAAACCTGCTATTGAGGCTTGTGGCATTTCTA
    CTCTTTCTCAATCCCCAATGCTTGGCTTCCATAAGCAGATGGATAACCGCATCAAGCTCT
    TGGAAGAGATTCTGTCTTTTCGTATGCAGGGCGTTGAGTTCGATAATGGTGATATGTATG
    TTGACGGCCATAAGGCTGCTTCTGACGTTCGTGATGAGTTTGTATCTGTTACTGAGAAGT
    TAATGGATGAATTGGCACAATGCTACAATGTGCTCCCCCAACTTGATATTAATAACACTA
    TAGACCACCGCCCCGAAGGGGACGAAAAATGGTTTTTAGAGAACGAGAAGACGGTTACGC
    AGTTTTGCCGCAAGCTGGCTGCTGAACGCCCTCTTAAGGATATTCGCGATGAGTATAATT
    ACCCCAAAAAGAAAGGTATTAAGGATGAGTGTTCAAGATTGCTGGAGGCCTCCACTATGA
    AATCGCGTAGAGGCTTTGCTATTCAGCGTTTGATGAATGCAATGCGACAGGCTCATGCTG
    ATGGTTGGTTTATCGTTTTTGACACTCTCACGTTGGCTGACGACCGATTAGAGGCGTTTT
    ATGATAATCCCAATGCTTTGCGTGACTATTTTCGTGATATTGGTCGTATGGTTCTTGCTG
    CCGAGGGTCGCAAGGCTAATGATTCACACGCCGACTGCTATCAGTATTTTTGTGTGCCTG
    AGTATGGTACAGCTAATGGCCGTCTTCATTTCCATGCGGTGCACTTTATGCGGACACTTC
    CTACAGGTAGCGTTGACCCTAATTTTGGTCGTCGGGTACGCAATCGCCGCCAGTTAAATA
    GCTTGCAAAATACGTGGCCTTATGGTTACAGTATGCCCATCGCAGTTCGCTACACGCAGG
    ACGCTTTTTCACGTTCTGGTTGGTTGTGGCCTGTTGATGCTAAAGGTGAGCCGCTTAAAG
    CTACCAGTTATATGGCTGTTGGTTTCTATGTGGCTAAATACGTTAACAAAAAGTCAGATA
    TGGACCTTGCTGCTAAAGGTCTAGGAGCTAAAGAATGGAACAACTCACTAAAAACCAAGC
    TGTCGCTACTTCCCAAGAAGCTGTTCAGAATCAGAATGAGCCGCAACTTCGGGATGAAAA
    TGCTCACAATGACAAATCTGTCCACGGAGTGCTTAATCCAACTTACCAAGCTGGGTTACG
    ACGCGACGCCGTTCAACCAGATATTGAAGCAGAACGCAAAAAGAGAGATGAGATTGAGGC
    TGGGAAAAGTTACTGTAGCCGACGTTTTGGCGGCGCAACCTGTGACGACAAATCTGCTCA
    AATTTATGCGCGCTTCGATAAAAATGATTGGCGTATCCAACCTGCA
    




    Base-Composition Atlas for E. coli Phi-X174 bacteriophage







    There are several things to notice in this plot. First, the genome is circular. The density of the four nucleotides are plotted in the four outer-most circles. This density is not evenly distributed; although all four of the scales range from 0% (min., no colour) to 40% (max colour intensity), it can be easily seen that the sequence is dominated by T's (red circle), and that there are relatively few G's (outermost turquoise circle) and C's (pink circle), and a few A-rich regions (green 2nd circle).

    There are many genes which overlap (the genes are indicated in the "annotation circle", which is the fifth circle from the outside - with the blue bands representing genes in the forward direction). Note that all the genes are oriented in the same direction. Whilst this is true for some viruses, it is rarely true for organisms with larger genomes. There is a strong bias towards T's over A's (AT skew - which is simply #A's minus #T's, over a given window), a less strong bias of G's over C's (GC skew) and finally the genome is generally AT rich (red in innermost circle).





    LOCUS       NC_001422               5386 bp ss-DNA     circular PHG 31-JUL-2001
    DEFINITION  Coliphage phiX174, complete genome.
    ACCESSION   NC_001422
    VERSION     NC_001422.1  GI:9626372
    KEYWORDS    .
    SOURCE      coliphage phiX174.
      ORGANISM  coliphage phiX174
                Viruses; ssDNA viruses; Microviridae; Microvirus.
    REFERENCE   1  (bases 2370 to 2421)
      AUTHORS   Robertson,H.D., Barrell,B.G., Weith,H.L. and Donelson,J.E.
      TITLE     Isolation and sequence analysis of a ribosome protected fragment
                from bacteriophage phi-X 174 DNA
      JOURNAL   Nature New Biol. 241, 38-40 (1973)
      MEDLINE   73161742
    REFERENCE   2  (bases 1047 to 1094)
      AUTHORS   Ziff,E.B., Sedat,J.W. and Galibert,F.
      TITLE     Determination of the nucleotide sequence of a fragment of
                bacteriophage phi-X 174 DNA
      JOURNAL   Nature New Biol. 241, 34-37 (1973)
      MEDLINE   73161741
    REFERENCE   3  (bases 2370 to 2420)
      AUTHORS   Barrell,B.G., Weith,H.L., Donelson,J.E. and Robertson,H.D.
      TITLE     Sequence analysis of the ribosome-protected bacteriophage phi-X174
                DNA fragment containing the gene G initiation site
      JOURNAL   J. Mol. Biol. 92, 377-393 (1975)
      MEDLINE   75192039
    REFERENCE   4  (bases 2365 to 2591)
      AUTHORS   Air,G.M., Blackburn,E.H., Sanger,F. and Coulson,A.R.
      TITLE     The nucleotide and amino acid sequences of the N (5') terminal
                region of gene G of bacteriophage phi-X174
      JOURNAL   J. Mol. Biol. 96, 703-719 (1975)
      MEDLINE   76072037
    REFERENCE   5  (bases 4137 to 4207)
      AUTHORS   van Mansfeld,A.D.M., Vereijken,J.M. and Jansz,H.S.
      TITLE     The nucleotide sequence of a DNA fragment, 71 base pairs in length,
                near the origin of DNA replication of bacteriophage phi-X174
      JOURNAL   Nucleic Acids Res. 3, 2827-2844 (1976)
      MEDLINE   77057432
    REFERENCE   6  (bases 2395 to 2922)
      AUTHORS   Air,G.M., Sanger,F. and Coulson,A.R.
      TITLE     Nucleotide and amino acid sequences of gene G of phi-X174
      JOURNAL   J. Mol. Biol. 108, 519-533 (1976)
      MEDLINE   77121207
    REFERENCE   7  (bases 1017 to 1762)
      AUTHORS   Air,G.M., Blackburn,E.H., Coulson,A.R., Galibert,F., Sanger,F.,
                Sedat,J.W. and Ziff,E.B.
      TITLE     Gene F of bacteriophage phi-x174. Correlation of nucleotide
                sequences from the DNA and amino acid sequences from the gene
                product
      JOURNAL   J. Mol. Biol. 107, 445-458 (1976)
      MEDLINE   77074163
    REFERENCE   8  (bases 730 to 903)
      AUTHORS   Blackburn,E.H.
      TITLE     Transcription and sequence analysis of a fragment of bacteriophage
                phi-X174 DNA
      JOURNAL   J. Mol. Biol. 107, 417-431 (1976)
      MEDLINE   77074161
    REFERENCE   9  (bases 1017 to 1081)
      AUTHORS   Sedat,J., Ziff,E. and Galibert,F.
      TITLE     Direct determination of DNA nucleotide sequences: Structure of
                large specific fragments of bacteriophage phi-X174 DNA
      JOURNAL   J. Mol. Biol. 107, 391-416 (1976)
      MEDLINE   77074160
    REFERENCE   10 (bases 2263 to 2421)
      AUTHORS   Fiddes,J.C.
      TITLE     Nucleotide sequence of the intercistronic region between genes G
                and F in bacteriophage phi-X174 DNA
      JOURNAL   J. Mol. Biol. 107, 1-24 (1976)
      MEDLINE   77074135
    REFERENCE   11 (bases 1 to 5375)
      AUTHORS   Sanger,F., Air,G.M., Barrell,B.G., Brown,N.L., Coulson,A.R.,
                Fiddes,J.C., Hutchison,C.A., Slocombe,P.M. and Smith,M.
      TITLE     Nucleotide sequence of bacteriophage phi-X174 DNA
      JOURNAL   Nature 265, 687-695 (1977)
      MEDLINE   77171175
    REFERENCE   12 (bases 4505 to 5374)
      AUTHORS   Brown,N.L. and Smith,M.
      TITLE     The sequence of a region of bacteriophage phi-X174 DNA coding for
                parts of genes A and B
      JOURNAL   J. Mol. Biol. 116, 1-30 (1977)
      MEDLINE   78069208
    REFERENCE   13 (sites)
      AUTHORS   Fiddes,J.C.
      TITLE     The nucleotide sequence of a viral DNA
      JOURNAL   Sci. Am. 237, 54-67 (1977)
      MEDLINE   78054683
    REFERENCE   14 (bases 5022 to 5132)
      AUTHORS   Brown,N.L. and Smith,M.
      TITLE     DNA sequence of a region of the phi-X174 genome coding for a
                ribosome binding site
      JOURNAL   Nature 265, 695-698 (1977)
      MEDLINE   77171176
    REFERENCE   15 (bases 5346 to 5386; 1 to 159)
      AUTHORS   Smith,M., Brown,N.L., Air,G.M., Barrell,B.G., Coulson,A.R.,
                Hutchison,C.A.I.I.I. and Sanger,F.
      TITLE     DNA sequence at the C termini of the overlapping genes A and B in
                bacteriophage phi-X174
      JOURNAL   Nature 265, 702-705 (1977)
      MEDLINE   77171178
    REFERENCE   16 (bases 1 to 5386)
      AUTHORS   Sanger,F., Coulson,A.R., Friedmann,T., Air,G.M., Barrell,B.G.,
                Brown,N.L., Fiddes,J.C., Hutchison,C.A., Slocombe,P.M. and Smith,M.
      TITLE     The nucleotide sequence of bacteriophage phi-X174
      JOURNAL   J. Mol. Biol. 125, 225-246 (1978)
      MEDLINE   79091185
    REFERENCE   17 (bases 1290 to 1302; 1340 to 1430; 1510 to 1570; 1600 to 1750)
      AUTHORS   Air,G.M., Coulson,A.R., Fiddes,J.C., Friedmann,T., Hutchison,C.A.,
                Sanger,F., Slocombe,P.M. and Smith,A.J.
      TITLE     Nucleotide sequence of the F protein coding region of bacteriophage
                phi-X174 and the amino acid sequence of its product
      JOURNAL   J. Mol. Biol. 125, 247-254 (1978)
      MEDLINE   79091186
    REFERENCE   18 (bases 4256 to 4317)
      AUTHORS   Langeveld,S.A., van Mansfeld,A.D.M., de Winter,J.M. and
                Weisbeek,P.J.
      TITLE     Cleavage of single-stranded DNA by the A and A* proteins of
                bacteriophage phi-X174
      JOURNAL   Nucleic Acids Res. 7, 2177-2188 (1979)
      MEDLINE   80101074
    REFERENCE   19 (bases 4248 to 4332)
      AUTHORS   Heidekamp,F., Langeveld,S.A., Baas,P.D. and Jansz,H.S.
      TITLE     Studies of the recognition sequence of phi-X174 gene A protein.
                Cleavage site of phi-X gene A protein in St-1 RFI DNA
      JOURNAL   Nucleic Acids Res. 8, 2009-2021 (1980)
      MEDLINE   81053861
    REFERENCE   20 (bases 436 to 490; 630 to 669; 930 to 979)
      AUTHORS   Takeshita,M., Kappen,L.S., Grollman,A.P., Eisenberg,M. and
                Goldberg,I.H.
      TITLE     Strand scission of deoxyribonucleic acid by neocarzinostatin,
                auromomycin, and bleomycin: studies on base release and nucleotide
                sequence specificity
      JOURNAL   Biochemistry (N.Y.) 20 (26), 7599-7606 (1981)
      MEDLINE   82113627
       PUBMED   6173064
    REFERENCE   21 (bases 1064 to 1757)
      AUTHORS   Melville,M.-P., Piette,J., Lopez,M., Decuyper,J. and van de
                Vorst,A.
      TITLE     Termination sites of the in vitro DNA sysnthesis on single-stranded
                DNA photosensitized by promazines
      JOURNAL   J. Biol. Chem. 259, 15069-15077 (1984)
      MEDLINE   85079985
    REFERENCE   22 (bases 449 to 482; 504 to 598; 1047 to 1111)
      AUTHORS   Ueda,K., Morita,J. and Komano,T.
      TITLE     Sequence specificity of heat-labile sites in DNA induced by
                mitomycin C
      JOURNAL   Biochemistry (N.Y.) 23 (8), 1634-1640 (1984)
      MEDLINE   84203526
       PUBMED   6232949
    REFERENCE   23 (bases 2380 to 2512; 2593 to 2786; 2788 to 2947)
      AUTHORS   Air,G.M., Els,M.C., Brown,L.E., Laver,W.G. and Webster,R.G.
      TITLE     Location of antigenic sites in the three-dimensional structure of
                the influenza N2 virus neuraminidase
      JOURNAL   Virology 145, 237-248 (1985)
      MEDLINE   85274373
    COMMENT     REVIEWED REFSEQ: This record has been curated by NCBI staff. The
                reference sequence was derived from J02482.
                [8]  intermittent sequences.
                [15]  review; discussion of complete genome.
                Double checked with sumex tape.
                Single-stranded circular DNA which codes for eleven proteins.
                Replicative form is duplex, icosahedron, related to s13 & g4. [21]
                indicates that mitomycin C reduced with sodium borohydride induced
                heat-labile sites in DNA most preferentially at dinucleotide
                sequence 'gt' (especially 'Pu-g-t').
                Bacteriophage phi-X174 single stranded DNA molecules were
                irradiated with near UV light in the presence of promazine
                derivatives, after priming with restriction fragments or synthetic
                primers [22].  The resulting DNA fragments were used as templates
                for in vitro complementary chain synthesis by E.coli DNA polymerase
                I [22].  More than 90% of the observed chain terminations were
                mapped one nucleotide before a guanine residue [22].  Photoreaction
                occurred more predominantly with guanine residues localized in
                single-stranded parts of the genome [22].  These same guanine
                residues could also be damaged when the reaction was performed in
                the dark, in the presence of promazine cation radicals [22].
    FEATURES             Location/Qualifiers
         source          1..5386
                         /organism="coliphage phiX174"
                         /specific_host="Escherichia coli"
                         /db_xref="taxon:10847"
         variation       23
                         /note="c in wt; t in am18 and am35 [14]"
         variation       25
                         /note="g in wt; c in ts116 [14]"
         CDS             51..221
                         /note="K (function unknown)"
                         /codon_start=1
                         /transl_table=11
                         /protein_id="NP_040706.1"
                         /db_xref="GI:9626376"
                         /translation="MSRKIILIKQELLLLVYELNRSGLLAENEKIRPILAQLEKLLLC
                         DLSPSTNDSVKN"
         variation       57
                         /note="c in wt; t in am6 [14]"
         variation       117
                         /note="g in wt; a in am6 [14]"
         CDS             133..393
                         /note="C (DNA maturation)"
                         /codon_start=1
                         /transl_table=11
                         /protein_id="NP_040707.1"
                         /db_xref="GI:9626377"
                         /translation="MRKFDLSLRSSRSSYFATFRHQLTILSKTDALDEEKWLNMLGTF
                         VKDWFRYESHFVHGRDSLVDILKERGLLSESDAVQPLIGKKS"
         mRNA            358..3975
                         /note="mRNA (major alt.)"
         mRNA            358..991
                         /note="mRNA (minor alt.)"
         CDS             390..848
                         /note="D (capsid morphogenesis)"
                         /codon_start=1
                         /transl_table=11
                         /protein_id="NP_040708.1"
                         /db_xref="GI:9626378"
                         /translation="MSQVTEQSVRFQTALASIKLIQASAVLDLTEDDFDFLTSNKVWI
                         ATDRSRARRCVEACVYGTLDFVGYPRFPAPVEFIAAVIAYYVHPVNIQTACLIMEGAE
                         FTENIINGVERPVKAAELFAFTLRVRAGNTDVLTDAEENVRQKLRAEGVM"
         CDS             568..843
                         /note="E (cell lysis)"
                         /codon_start=1
                         /transl_table=11
                         /protein_id="NP_040709.1"
                         /db_xref="GI:9626379"
                         /translation="MVRWTLWDTLAFLLLLSLLLPSLLIMFIPSTFKRPVSSWKALNL
                         RKTLLMASSVRLKPLNCSRLPCVYAQETLTFLLTQKKTCVKNYVRKE"
         CDS             848..964
                         /note="J (core protein, DNA condensation)"
                         /codon_start=1
                         /transl_table=11
                         /protein_id="NP_040710.1"
                         /db_xref="GI:9626380"
                         /translation="MSKGKKRSGARPGRPQPLRGTKGKRKGARLWYVGGQQF"
         CDS             1001..2284
                         /note="F (major coat protein)"
                         /codon_start=1
                         /transl_table=11
                         /protein_id="NP_040711.1"
                         /db_xref="GI:9626381"
                         /translation="MSNIQTGAERMPHDLSHLGFLAGQIGRLITISTTPVIAGDSFEM
                         DAVGALRLSPLRRGLAIDSTVDIFTFYVPHRHVYGEQWIKFMKDGVNATPLPTVNTTG
                         YIDHAAFLGTINPDTNKIPKHLFQGYLNIYNNYFKAPWMPDRTEANPNELNQDDARYG
                         FRCCHLKNIWTAPLPPETELSRQMTTSTTSIDIMGLQAAYANLHTDQERDYFMQRYHD
                         VISSFGGKTSYDADNRPLLVMRSNLWASGYDVDGTDQTSLGQFSGRVQQTYKHSVPRF
                         FVPEHGTMFTLALVRFPPTATKEIQYLNAKGALTYTDIAGDPVLYGNLPPREISMKDV
                         FRSGDSSKKFKIAEGQWYRYAPSYVSPAYHLLEGFPFIQEPPSGDLQERVLIRHHDYD
                         QCFQSVQLLQWNSQVKFNVTVYRNLPTTRDSIMTS"
         CDS             2395..2922
                         /note="G (major spike protein)"
                         /codon_start=1
                         /transl_table=11
                         /protein_id="NP_040712.1"
                         /db_xref="GI:9626382"
                         /translation="MFQTFISRHNSNFFSDKLVLTSVTPASSAPVLQTPKATSSTLYF
                         DSLTVNAGNGGFLHCIQMDTSVNAANQVVSVGADIAFDADPKFFACLVRFESSSVPTT
                         LPTAYDVYPLNGRHDGGYYTVKDCVTIDVLPRTPGNNVYVGFMVWSNFTATKCRGLVS
                         LNQVIKEIICLQPLK"
         CDS             2931..3917
                         /note="H (minor spike protein, adsorption)"
                         /codon_start=1
                         /transl_table=11
                         /protein_id="NP_040713.1"
                         /db_xref="GI:9626383"
                         /translation="MFGAIAGGIASALAGGAMSKLFGGGQKAASGGIQGDVLATDNNT
                         VGMGDAGIKSAIQGSNVPNPDEAAPSFVSGAMAKAGKGLLEGTLQAGTSAVSDKLLDL
                         VGLGGKSAADKGKDTRDYLAAAFPELNAWERAGADASSAGMVDAGFENQKELTKMQLD
                         NQKEIAEMQNETQKEIAGIQSATSRQNTKDQVYAQNEMLAYQQKESTARVASIMENTN
                         LSKQQQVSEIMRQMLTQAQTAGQYFTNDQIKEMTRKVSAEVDLVHQQTQNQRYGSSHI
                         GATAKDISNVVTDAASGVVDIFHGIDKAVADTWNNFWKDGKADGIGSNLSRK"
         misc_feature    3962
                         /note="transcription start site"
         CDS             join(3981..5386,1..136)
                         /note="A (rf replication, viral strand synthesis)"
                         /codon_start=1
                         /transl_table=11
                         /protein_id="NP_040703.1"
                         /db_xref="GI:9626373"
                         /translation="MVRSYYPSECHADYFDFERIEALKPAIEACGISTLSQSPMLGFH
                         KQMDNRIKLLEEILSFRMQGVEFDNGDMYVDGHKAASDVRDEFVSVTEKLMDELAQCY
                         NVLPQLDINNTIDHRPEGDEKWFLENEKTVTQFCRKLAAERPLKDIRDEYNYPKKKGI
                         KDECSRLLEASTMKSRRGFAIQRLMNAMRQAHADGWFIVFDTLTLADDRLEAFYDNPN
                         ALRDYFRDIGRMVLAAEGRKANDSHADCYQYFCVPEYGTANGRLHFHAVHFMRTLPTG
                         SVDPNFGRRVRNRRQLNSLQNTWPYGYSMPIAVRYTQDAFSRSGWLWPVDAKGEPLKA
                         TSYMAVGFYVAKYVNKKSDMDLAAKGLGAKEWNNSLKTKLSLLPKKLFRIRMSRNFGM
                         KMLTMTNLSTECLIQLTKLGYDATPFNQILKQNAKREMRLRLGKVTVADVLAAQPVTT
                         NLLKFMRASIKMIGVSNLQSFIASMTQKLTLSDISDESKNYLDKAGITTACLRIKSKW
                         TAGGK"
         rep_origin      4306
                         /note="origin of viral strand synthesis"
         CDS             join(4497..5386,1..136)
                         /note="A* (shut off host DNA synthesis)"
                         /codon_start=1
                         /transl_table=11
                         /protein_id="NP_040704.1"
                         /db_xref="GI:9626374"
                         /translation="MKSRRGFAIQRLMNAMRQAHADGWFIVFDTLTLADDRLEAFYDN
                         PNALRDYFRDIGRMVLAAEGRKANDSHADCYQYFCVPEYGTANGRLHFHAVHFMRTLP
                         TGSVDPNFGRRVRNRRQLNSLQNTWPYGYSMPIAVRYTQDAFSRSGWLWPVDAKGEPL
                         KATSYMAVGFYVAKYVNKKSDMDLAAKGLGAKEWNNSLKTKLSLLPKKLFRIRMSRNF
                         GMKMLTMTNLSTECLIQLTKLGYDATPFNQILKQNAKREMRLRLGKVTVADVLAAQPV
                         TTNLLKFMRASIKMIGVSNLQSFIASMTQKLTLSDISDESKNYLDKAGITTACLRIKS
                         KWTAGGK"
         misc_feature    4899
                         /note="transcription start site"
         CDS             join(5075..5386,1..51)
                         /note="B (capsid morphogenesis)"
                         /codon_start=1
                         /transl_table=11
                         /protein_id="NP_040705.1"
                         /db_xref="GI:9626375"
                         /translation="MEQLTKNQAVATSQEAVQNQNEPQLRDENAHNDKSVHGVLNPTY
                         QAGLRRDAVQPDIEAERKKRDEIEAGKSYCSRRFGGATCDDKSAQIYARFDKNDWRIQ
                         PAEFYRFHDAEVNTFGYF"
    BASE COUNT     1291 a   1157 c   1254 g   1684 t
    ORIGIN      
            1 gagttttatc gcttccatga cgcagaagtt aacactttcg gatatttctg atgagtcgaa
           61 aaattatctt gataaagcag gaattactac tgcttgttta cgaattaaat cgaagtggac
          121 tgctggcgga aaatgagaaa attcgaccta tccttgcgca gctcgagaag ctcttacttt
          181 gcgacctttc gccatcaact aacgattctg tcaaaaactg acgcgttgga tgaggagaag
          241 tggcttaata tgcttggcac gttcgtcaag gactggttta gatatgagtc acattttgtt
          301 catggtagag attctcttgt tgacatttta aaagagcgtg gattactatc tgagtccgat
          361 gctgttcaac cactaatagg taagaaatca tgagtcaagt tactgaacaa tccgtacgtt
          421 tccagaccgc tttggcctct attaagctca ttcaggcttc tgccgttttg gatttaaccg
          481 aagatgattt cgattttctg acgagtaaca aagtttggat tgctactgac cgctctcgtg
          541 ctcgtcgctg cgttgaggct tgcgtttatg gtacgctgga ctttgtggga taccctcgct
          601 ttcctgctcc tgttgagttt attgctgccg tcattgctta ttatgttcat cccgtcaaca
          661 ttcaaacggc ctgtctcatc atggaaggcg ctgaatttac ggaaaacatt attaatggcg
          721 tcgagcgtcc ggttaaagcc gctgaattgt tcgcgtttac cttgcgtgta cgcgcaggaa
          781 acactgacgt tcttactgac gcagaagaaa acgtgcgtca aaaattacgt gcggaaggag
          841 tgatgtaatg tctaaaggta aaaaacgttc tggcgctcgc cctggtcgtc cgcagccgtt
          901 gcgaggtact aaaggcaagc gtaaaggcgc tcgtctttgg tatgtaggtg gtcaacaatt
          961 ttaattgcag gggcttcggc cccttacttg aggataaatt atgtctaata ttcaaactgg
         1021 cgccgagcgt atgccgcatg acctttccca tcttggcttc cttgctggtc agattggtcg
         1081 tcttattacc atttcaacta ctccggttat cgctggcgac tccttcgaga tggacgccgt
         1141 tggcgctctc cgtctttctc cattgcgtcg tggccttgct attgactcta ctgtagacat
         1201 ttttactttt tatgtccctc atcgtcacgt ttatggtgaa cagtggatta agttcatgaa
         1261 ggatggtgtt aatgccactc ctctcccgac tgttaacact actggttata ttgaccatgc
         1321 cgcttttctt ggcacgatta accctgatac caataaaatc cctaagcatt tgtttcaggg
         1381 ttatttgaat atctataaca actattttaa agcgccgtgg atgcctgacc gtaccgaggc
         1441 taaccctaat gagcttaatc aagatgatgc tcgttatggt ttccgttgct gccatctcaa
         1501 aaacatttgg actgctccgc ttcctcctga gactgagctt tctcgccaaa tgacgacttc
         1561 taccacatct attgacatta tgggtctgca agctgcttat gctaatttgc atactgacca
         1621 agaacgtgat tacttcatgc agcgttacca tgatgttatt tcttcatttg gaggtaaaac
         1681 ctcttatgac gctgacaacc gtcctttact tgtcatgcgc tctaatctct gggcatctgg
         1741 ctatgatgtt gatggaactg accaaacgtc gttaggccag ttttctggtc gtgttcaaca
         1801 gacctataaa cattctgtgc cgcgtttctt tgttcctgag catggcacta tgtttactct
         1861 tgcgcttgtt cgttttccgc ctactgcgac taaagagatt cagtacctta acgctaaagg
         1921 tgctttgact tataccgata ttgctggcga ccctgttttg tatggcaact tgccgccgcg
         1981 tgaaatttct atgaaggatg ttttccgttc tggtgattcg tctaagaagt ttaagattgc
         2041 tgagggtcag tggtatcgtt atgcgccttc gtatgtttct cctgcttatc accttcttga
         2101 aggcttccca ttcattcagg aaccgccttc tggtgatttg caagaacgcg tacttattcg
         2161 ccaccatgat tatgaccagt gtttccagtc cgttcagttg ttgcagtgga atagtcaggt
         2221 taaatttaat gtgaccgttt atcgcaatct gccgaccact cgcgattcaa tcatgacttc
         2281 gtgataaaag attgagtgtg aggttataac gccgaagcgg taaaaatttt aatttttgcc
         2341 gctgaggggt tgaccaagcg aagcgcggta ggttttctgc ttaggagttt aatcatgttt
         2401 cagactttta tttctcgcca taattcaaac tttttttctg ataagctggt tctcacttct
         2461 gttactccag cttcttcggc acctgtttta cagacaccta aagctacatc gtcaacgtta
         2521 tattttgata gtttgacggt taatgctggt aatggtggtt ttcttcattg cattcagatg
         2581 gatacatctg tcaacgccgc taatcaggtt gtttctgttg gtgctgatat tgcttttgat
         2641 gccgacccta aattttttgc ctgtttggtt cgctttgagt cttcttcggt tccgactacc
         2701 ctcccgactg cctatgatgt ttatcctttg aatggtcgcc atgatggtgg ttattatacc
         2761 gtcaaggact gtgtgactat tgacgtcctt ccccgtacgc cgggcaataa cgtttatgtt
         2821 ggtttcatgg tttggtctaa ctttaccgct actaaatgcc gcggattggt ttcgctgaat
         2881 caggttatta aagagattat ttgtctccag ccacttaagt gaggtgattt atgtttggtg
         2941 ctattgctgg cggtattgct tctgctcttg ctggtggcgc catgtctaaa ttgtttggag
         3001 gcggtcaaaa agccgcctcc ggtggcattc aaggtgatgt gcttgctacc gataacaata
         3061 ctgtaggcat gggtgatgct ggtattaaat ctgccattca aggctctaat gttcctaacc
         3121 ctgatgaggc cgcccctagt tttgtttctg gtgctatggc taaagctggt aaaggacttc
         3181 ttgaaggtac gttgcaggct ggcacttctg ccgtttctga taagttgctt gatttggttg
         3241 gacttggtgg caagtctgcc gctgataaag gaaaggatac tcgtgattat cttgctgctg
         3301 catttcctga gcttaatgct tgggagcgtg ctggtgctga tgcttcctct gctggtatgg
         3361 ttgacgccgg atttgagaat caaaaagagc ttactaaaat gcaactggac aatcagaaag
         3421 agattgccga gatgcaaaat gagactcaaa aagagattgc tggcattcag tcggcgactt
         3481 cacgccagaa tacgaaagac caggtatatg cacaaaatga gatgcttgct tatcaacaga
         3541 aggagtctac tgctcgcgtt gcgtctatta tggaaaacac caatctttcc aagcaacagc
         3601 aggtttccga gattatgcgc caaatgctta ctcaagctca aacggctggt cagtatttta
         3661 ccaatgacca aatcaaagaa atgactcgca aggttagtgc tgaggttgac ttagttcatc
         3721 agcaaacgca gaatcagcgg tatggctctt ctcatattgg cgctactgca aaggatattt
         3781 ctaatgtcgt cactgatgct gcttctggtg tggttgatat ttttcatggt attgataaag
         3841 ctgttgccga tacttggaac aatttctgga aagacggtaa agctgatggt attggctcta
         3901 atttgtctag gaaataaccg tcaggattga caccctccca attgtatgtt ttcatgcctc
         3961 caaatcttgg aggctttttt atggttcgtt cttattaccc ttctgaatgt cacgctgatt
         4021 attttgactt tgagcgtatc gaggctctta aacctgctat tgaggcttgt ggcatttcta
         4081 ctctttctca atccccaatg cttggcttcc ataagcagat ggataaccgc atcaagctct
         4141 tggaagagat tctgtctttt cgtatgcagg gcgttgagtt cgataatggt gatatgtatg
         4201 ttgacggcca taaggctgct tctgacgttc gtgatgagtt tgtatctgtt actgagaagt
         4261 taatggatga attggcacaa tgctacaatg tgctccccca acttgatatt aataacacta
         4321 tagaccaccg ccccgaaggg gacgaaaaat ggtttttaga gaacgagaag acggttacgc
         4381 agttttgccg caagctggct gctgaacgcc ctcttaagga tattcgcgat gagtataatt
         4441 accccaaaaa gaaaggtatt aaggatgagt gttcaagatt gctggaggcc tccactatga
         4501 aatcgcgtag aggctttgct attcagcgtt tgatgaatgc aatgcgacag gctcatgctg
         4561 atggttggtt tatcgttttt gacactctca cgttggctga cgaccgatta gaggcgtttt
         4621 atgataatcc caatgctttg cgtgactatt ttcgtgatat tggtcgtatg gttcttgctg
         4681 ccgagggtcg caaggctaat gattcacacg ccgactgcta tcagtatttt tgtgtgcctg
         4741 agtatggtac agctaatggc cgtcttcatt tccatgcggt gcactttatg cggacacttc
         4801 ctacaggtag cgttgaccct aattttggtc gtcgggtacg caatcgccgc cagttaaata
         4861 gcttgcaaaa tacgtggcct tatggttaca gtatgcccat cgcagttcgc tacacgcagg
         4921 acgctttttc acgttctggt tggttgtggc ctgttgatgc taaaggtgag ccgcttaaag
         4981 ctaccagtta tatggctgtt ggtttctatg tggctaaata cgttaacaaa aagtcagata
         5041 tggaccttgc tgctaaaggt ctaggagcta aagaatggaa caactcacta aaaaccaagc
         5101 tgtcgctact tcccaagaag ctgttcagaa tcagaatgag ccgcaacttc gggatgaaaa
         5161 tgctcacaat gacaaatctg tccacggagt gcttaatcca acttaccaag ctgggttacg
         5221 acgcgacgcc gttcaaccag atattgaagc agaacgcaaa aagagagatg agattgaggc
         5281 tgggaaaagt tactgtagcc gacgttttgg cggcgcaacc tgtgacgaca aatctgctca
         5341 aatttatgcg cgcttcgata aaaatgattg gcgtatccaa cctgca
    //
    




    Genome Atlas for E. coli Phi-X174 bacteriophage







    Notice that some of the structural features (such as the perfect palindromes circle) are often found near the end of genes. A more detailed explanation for the various parameters will be given in the next lecture, but for now the important point is that there is much information in the sequence which can be visualised in the atlas, which is not readily apparent from merely looking at the GenBank file alone.






    leaf65.gif





    Part 1   A Brief Introduction to [a few] Alternative Conformations of DNA





    DNA symmetry elements defined


    ONE possible way of trying to deal with all this information is to develop methods of visualising DNA structures within bacterial chromosomes. The method I have chosen to talk about today is based on two different groups of "DNA symmetry elements". The first is simply various types of repeats, and the second group is DNA helix families, which is caused by certain stretches of purines (or pyrimidines) for A-DNA, and certain stretches of alternating pyrimidine/purines for Z-DNA. The various conformations of these different sequences have putative biological functions, based in part on these structures. The repeats will be discussed first.






    DNA Repeats



    From a DNA sequence perspective, there are 4 types of repeats:

    Direct Repeats
  • Simple Tandem Repeats

  • (Longer)Tandem Repeats

  • Direct (non-tandem)

  • Phased Repeats



  • Inverted Repeats



    Mirror Repeats




    Everted Repeats









    Table of DNA sequence repeats, structures, and biological functions

    Repeat Pattern Possible structure Biological function
     +   (N)n Direct repeats recA triple-stranded DNA homologous recombination
    duplications
     +   (R)n Mirror repeats Intermolecular triplex
    Intramolecular Triple-strands
    recombination
    replication
     + (N)n
    Inverted repeats cruciforms deletions (in bacteria)
    insertion sequences
     + (N)n
    Everted repeats parallel stranded DNA unknown
    stabilisation of telomeres(?)













    DNA Helix Families



    A-, B-, and Z-DNAs
    A-DNA (left), B-DNA (middle) and Z-DNA (right) -- 12 bp each
    From Dickerson et al. in Cold Spring Harbor Symposium for Quantitative Biology (1982) v47 p13-24.





    3 families of DNA helices:





    A-DNA conformation

    A-DNA family - this is most common for double stranded RNA, RNA/DNA hybrids, as well as for certain DNA sequences, such as long stretches of purines. NMR studies have shown that as few as 5 bp of purines in a row can set up an A-type of helix. Most of the DNA inside of cells is likely to be a mixture of the A- and B-DNA conformations.















    B-DNA conformation

    B-DNA family - the majority of DNA exists in the "B-DNA form" inside the cells of living organisms. This is the classical "Watson-Crick" structure, although there is considerable sequence-specific variation. Thus, for example, different sequences can have from 9 bp/turn of the helix to 12 bp/turn, depending on the sequence of the DNA! However, on AVERAGE, the DNA is about 10.5 bp/turn.





















    Z-DNA conformation

    Z-DNA family - this is much more rare than the other two families, although certains sequences (such as runs of GC repeats (GCGCGC)) can form Z-DNA easily. In eukaryotes, CpG islands can form Z-DNA, and methylated CpG islands can form Z-DNA readily in vivo. Furthermore, specific proteins have been isolated which will bind preferentially to the left-handed Z-DNA conformation.




























    How Random is DNA?



    Although estimating the levels of A-DNA and Z-DNA might be difficult, one thing that is clear is that there is a strong bias in genomes towards an over-representation of purine stretches, as well as pyr/pur stretches. In addition, the patterns found in eukaryotic DNA is quite different from bacterial DNA. So, for example, in the two plots below, the occurance of stretches of purines or pyr/pur tracts is essentially the same as predicted by a "random" model of DNA for E. coli, but is very different that expected, even when taking into account the pentameric composition ("6th order Markov Model). This implies that the DNA in eukaryotes is much less "random" than the DNA in bacteria - at least with respect to runs of purines or alternating pyr/pur tracts.


    Purine tracts in E.coli





    Purine tracts in human chromosome 1








    Link to a table comparing bias in purine stretches in sequenced Archaeal genomes.

    Link to a table comparing bias in purine stretches in sequenced Bacterial genomes.







    REFERENCES



    Papers relevant to this lecture (included in your binder)



  • Anders Gorm Pedersen,    Lars Juhl Jensen,    Hans-Henrik Stærfeldt,    Søren Brunak,    and     David W. Ussery,
    "A DNA Structural Atlas for Escherichia coli",
    Journal of Molecular Biology, 299 (#4), 907-930, (2000).
          
    [PubMed]        PDF file             [Cover]       [Description of cover figure]

  • Link to Escherichia coli Atlases table




  • Marie Skovgaard,    Lars Juhl Jensen,    Carsten Friis Carsten Friis    Hans-Henrik Stærfeldt,    Peder Worning,    Søren Brunak,    and     David Ussery
    "The Atlas Visualisation of Genome-wide Information",
    Methods in Microbiology, 33,49-63, (2002).      PDF file     





    Other references





    Background on Bioinformatics


  • N.M. Luscombe, D. Greenbaum, and M. Gerstein,
    "What is Bioinformatics? A Proposed Definition and Overview of the Field",
    Methods of Information in Medicine, 40:346-358, (2001).
    [PubMed]     Link to Mark Gerstein's lab publications (where you can find a PDF version of this).



  • David W. Ussery,
    "Bioinformatics2000 Meeting Report",
    GenomeBiology, 1:(#3), pages 1-2, (2000).
           PDF file             On-Line Version at http://www.genomebiology.com/2000/1/3/reports/4014/





  • Background on DNA Atlases


  • Lars Juhl Jensen,    Carsten Friis,    and     David W. Ussery,
    "Three Views of Microbial Genomes",
    Research in Microbiology, 150, pages 773-777, (1999).
           [PubMed]        PDF file            [Cover]

  • Link to the Mycoplasma genitalium atlas page.


  • Carsten Friis,    Lars Juhl Jensen,    and David W. Ussery,
    "Visualisation of Pathogenicity Regions in Bacteria",
    Genetica, 108:47-51, (2000).
           [PubMed]        PDF file     
  •        [cover]
    Link to Yersinia pestis pPCP1 atlases.
    Link to S. typhimurium DT104 atlases.
    Link to E. coli pO157 atlases.


  • David W. Ussery,    Thomas S. Larsen,    K. Trevor Wilkes, Carsten Friis,    Peder Worning,    Anders Krogh,    and     Søren Brunak,
    "Genome Organisation and Chromatin Structure in Escherichia coli",
    Biochimie, 83:201-212, (2001).
           [PubMed]        PDF file             [cover]

  • Link to web page with supplemental information about this article.




    Background on DNA structures


  • Richard R. Sinden, Christopher E. Pearson, Vladimir N. Potoman, and David W. Ussery, "DNA: Structure and Function", Advances in Genome Biology, 5A:1-141, (1998).



  • Ussery,D.W., Higgins,C.F., and Bloshoy,A., "Environmental Influences on DNA Curvature", J. Biomolecular Structure & Dynamics,16:811-823, (1999).[PubMed]




  • David W. Ussery,
    "DNA Denaturation",
    The Encyclopedia of Genetics, (Academic Press, New York, 2001), pages 550-553.        PDF file     


  • David W. Ussery,
    "DNA Structure: A-, B-, and Z-DNA Families", The Encyclopedia of Life Sciences, (Macmillan Publishers, London, 2002).        PDF file     



  • David Ussery,    Dikeos Mario Soumpasis,    Søren Brunak,    Hans-Henrik Stærfeldt,    Peder Worning,    and     Anders Krogh
    "Bias of Purine Stretches in Sequenced Genomes",
    Computers in Chemistry, 26, 531-541, (2002).
  • PDF file     
    Link to web page comparing fractions purine and pyr/pur tracts in more than 700 chromosomes


  • Vera van Noort, Peder Worning, David W. Ussery, William Rosche, and Richard R. Sinden
    "Strand misalignments lead to quasipalindrome correction"
    Trends in Genetics, 19:365-369, (2003).        
    [PubMed]        PDF file     Reproduced with permission from Trends in Genetics.




  • Link to a list of recent papers and talks on DNA structures.







    Books about DNA:



    Watson, James D. "A PASSION FOR DNA: Genes, Genomes, and Society", (Oxford University Press, Oxford, 2000).      Amazon      Barnes&Noble

    Sinden, Richard R., "DNA: STRUCTURE and FUNCTION", (Academic Press, New York, 1994).      Amazon      Barnes&Noble

    Calladine,C.R., Drew,H.R., "Understanding DNA: The Molecule and How It Works", (2nd edition, Academic Press, San Diego, 1997).      Amazon      Barnes&Noble



    A List of more than a thousand books about DNA







    Link to the CBS Bacterial Genomes Atlas web page




    CORRESPONDENCE

    David W. Ussery: