- Introduction to DNA Atlases
- Genome Organisation
- Bacterial Chromatin
- Localisation of Gene Expression
One way of dealing with the problem of how to display so much sequence information is to have a look at the whole chromosome at once, smoothing over a large window. The entire bacterial chromosome is displayed as a circle, with different colours representing various parameters. First, as an introduction to atlases, we will look at the ``Genome Atlas'', which maps properties of the DNA sequence along the chromosome. For more information, the "DNA Structural Atlas" is described in Pedersen et al., 2000, and the "Genome Atlas" is a combination of the Structural Atlas plus information about global repeats and base-composition (Jensen, Friis, and Ussery, 1999).
After an introduction to atlases and a discussion of global properties of whole genomes, we will then have a look at localisation of chromatin-associated protein binding sites, and finally at the levels of expression of mRNA and proteins throghout the chromosome. As examples, I will use my very favourite organism, Escherichia coli K-12.
There are several things to notice in this plot. First, in the words of Douglas Adams, ``Don't Panic!''. It's not that bad, really! Once you get used to looking at atlases, they can be a very useful tool for visualising information about DNA sequences in whole genomes. The scales are plotted with the genome average values grey, and the extreme values (more than 3 standard deviations from average) are strongly coloured. Thus you don't SEE the "average" values, but only the regions which differ significantly from the average.
There are three different types of information in a genome atlas. The first (which is the outer-most three circles) are mapping DNA structural/mechanical properties. This is followed by an annotation circle, showing where the annotated genes are located. The next two circles (blue and red) are global repeats (blue is direct repeats, and red is inverted repeats). Finally, the inner-most two circles contain base-composition information, about the GC-skew (purple/turquoise) and AT content (turquoise/red). A description of the Genome Atlas for E. coli shown here will be given below. In addition, here is a link to a web page giving a more detailed description of the Genome Atlas:
DNA Mechanical/Structural Properties (outer three circles)
The outermost circle (orange to blue) shows the relative magnitude of DNA curvature, smoothed over a large window. Note that in the E. coli K-12 genome, there are several dark blue regions, indicating areas which are much more curved than the average (grey), and that there are relatively few regions which are LESS curved than average (e.g., dark orange). Also note that in general the region around the replication terminus is more curved (bluish) than the region around the origin (greyish).
The second circle is the stacking energy, in kcal/mol. Less stacking energy (e.g., a smaller number) means that the helix will melt more readily, and is shaded red. This is correlated with (but not quite the same as) the AT content. Note that there are several regions which are dark red, indicating they will melt quite readily. For example, see the region around the rfaJ gene, near the replication origin. Again, note that there are more red regions (e.g., regions which would melt more easily) than there are green regions (which are more difficult to melt). The distribution for stacking energy is also skewed.
The third circle is ``position preference'', which is related to the flexibility of the DNA. The scale is such that green regions are MORE FLEXIBLE, whilst violet regions are more rigid. There seems to be a fairly good correlation between flexible (``green'') regions and clusters of highly expressed genes.
All of this is described in more detail elsewhere, as well as in our "DNA Structural Atlas" paper (see the paper by Pedersen et al.in the references below).
Annotation circleUsually, the GenBank files for sequenced genomes will contain locations of predicted genes. The genes going in the ``forward'' (clockwise) direction are blue, whilst genes on the other strand are red. Also, rRNA and tRNA genes are shown, although for most bacterial chromosomes, the tRNA genes are too small. Notice that in the E. coli K-12 genome, the seven rRNA genes light up with distinctive structural features - they are generally more GC-rich, more flexible (and of course more highly expressed) and are repeated throughout the chromosome.
Global DNA Repeat circlesThere are many different ways of calculating repeats. For the Genome Atlas plots, "Global Direct Repeats" are obtained from a blast search of the genome against itself. Only matches of 100 bp or longer are counted. The (obvious) perfect match of the genome to itself is taken away, and the rest of the best matches are recorded along the length of the chromosome. The scale is the log of the expectation score or ``E value''. Thus, a value of 9 or greater is very significant (e.g., more than 1 in 1,000,000,000).
The inverted repeats is calculated in the same way, but only the hits on the opposite strand are taken into consideration. Note that there are far fewer significant hits on the other strand. This is true for many bacterial genomes, although in eukaryotic chromosomes, the global inverted repeats is often equal to the global direct repeats.
Base-Compostion circlesThe inner-most two circles contain information about the base-composition of the genome, smoothed over a large window. The ``GC-skew'' is simply the number of G's minus the number of C's, over a 10,000 bp window. Thus, if there are more G's than C's, a positive number results (turquoise), whilst more C's than G's, will result in a negative number (violet). This can be useful in visualising the replication origin and terminus.
The inner-most circle is the percent AT. In the case of E. coli, which is nearly 50% AT content, on average, the scale goes from 45% (turquoise) to 55% (red).
Similar patterns can be seen in Genome Atlases for the other three Escherichia coli chromosomes which have been sequenced (so far!), as shown in the table below. The global repeats can be used as "fingerprints" to see differences in closely related genomes - for example, have a look around the rpoD gene in the two K-12 strains, to see some repeats present in one isolate that is missing in the other.
Strain: K-12, isolate W3110
DDBJ NCBI tax
Strain: K-12, isolate MG1655
U. Wisconsin TIGR cmr NCBI tax NCBI entrez
Strain: O157:H7 (substrain EDL93)
U. Wisconsin NCBI tax NCBI entrez
Strain: O157:H7 (substrain RIMD 0509952)
DDBJ NCBI tax NCBI entrez
||DNA Res. 8:11-22
In addition to showing overall global properties of the chromosome (such as replication origin and terminus), the Genome Atlas can also highlight regions different from the rest of the genome. For example, in the plasmid pO157, there are some regions which are much more AT rich (probably these came about as a result of horizontal gene transfer - I didn't have time to discuss this in very much detail during the lecture...)
Note that the marked genes are different from the other genes. For example, the ``L7095'' gene (upper left-hand side) is more curved, readily melted and rigid than the average for the 92,000 bp plasmid. In addition, it is flanked by two inverted repeats (which turn out to be Insertion Sequence or IS elements). Furthermore, this gene is much more AT rich than the average for the rest of the plasmid. This COULD be due to the fact that this gene came from an organism with a more AT rich genome, or (more likely in my opinion) it is more AT rich because it is important for this gene to vary in sequence (e.g., have a higher mutational frequencey).
There are about a dozen chromatin proteins in E. coli, most of which can be quite abundant (around 200,000 molecules per cell!) at some time during growth. However, there are four well characterised proteins, shown below. Two of these (i.e., Fis and IHF have been experimentally shown to have a fairly specific binding motif.
By using the data from more than 100 experimentally determined IHF binding sites (footprints), we have determined the following sequence preference. This is a ``logo plot''. More information about how to interpret logo plots can be found on Tom Schneider's web pages.
Using this model of the IHF binding sites, we found roughly 1000 potential IHF sites in the E. coli genome. In addition to the general IHF site, there is also certain IHF binding sites which occur within REP (Repetitive Extragenic Palindromic) sequences, of which we predicted about 175 sites (essentially we found the same 88 that had previously been predicted by another group, and another 87 sites on the other strand of DNA which had been missed in the previous study!). We also found roughly 6000 potential FIS sites. The location of these sites, colour-coded for reliability score, is shown in the atlas below.
Although the binding sites are scattered throughout the genome, there appears to be certain regions of higher concentrations or "clumping" of the sites, rather than a homogeneous distribution. This can be more readily visualised by a smoothed version of the same atlas (the data in the figure below was smoothed over a 100,000 bp window).
Link to more information about these plots and other chromatin protein binding sites in the E. coli genome.
Note that there seems to be some regions of the chromosome which have relatively fewer chromatin protein binding sites. Some of the regions (particularly around the replication origin) with less dense IHF protein binding sites occur in more flexible regions of the chromosome, based on the "position preference" measure, which is related to anisotropic flexibility of the DNA. The IHF-REP sites seem to be concentrated in a few regions closer to the replication orgin, and less abundant in the terminus regions. The Fis binding sites appear to be concentrated in two regions flanking the replication terminus.
Escherichia coli is probably the best characterised organism.
There are 4085 predicted genes in Escherichia coli strain K-12 isolate W3110.
There are 4289 predicted genes in Escherichia coli strain K-12 isolate MG1665.
There are 5283 predicted genes in Escherichia coli strain O157:H7 isolate EDL933 (enterohemorrhagic pathogen).
There are about 5361 predicted genes in Escherichia coli strain O157:H7 substrain RIMD 0509952 (enterohemorrhagic pathogen).
Roughly 2600 genes have been found to be expressed in Escherichia coli strain K-12 cells, under standard laboratory growth conditions.
About 2100 spots can be seen on 2-D protein gels.
Very roughly 1000 different genes (only about 600 mRNA transcripts) are expressed at "detectable levels" in E. coli cells grown in LB media.
Only about 350 proteins exist at concentrations of > 100 copies per cell. (These make up 90% of the total protein in E.coli!)
Most (>90%) of the proteins are present in very low amounts (less than 100 copies per cell).
It has been known since the 1960's that genes closer to the replication origin are more highly expressed. However, it has only been in the past few years that technology has allowed the simultaneous monitoring of ALL the genes in Escherichia coli. There are 4397 annotated genes in the E. coli K-12 genome. Shown below is an "Atlas plot" of the E. coli K-12 genome, with the outer circle representing the concentration of proteins (roughly in number of molecules/cell) and mRNA (again, roughly number of molecules/cell). Under these conditions (e.g., cells grown to late log phase, in minimal media), there were 2005 genes expressed at detectable levels, and only 233 proteins have been found to exist in "abundant" conditions (e.g., very roughly more than 100 molecules per cell).
For E. coli K-12 cells, grown in minimal media to late log phase:
4397 annotated genes -> 2005 mRNAs expressed -> 233 abundant proteins
(note that these numbers will vary for different experimental conditions....)
In this picture, the outer lane represents the concentration of proteins (blue), the next lane the concentration of mRNA (green), and then the annotated genes.
The inner three circles represent different aspects of the DNA base composition throughout the genome. The innermost circle (turquoise/violet) is the bias of G's towards one strand or the other (that is, a look at the mono-nucleotide distribution of the 4 DNA bases). The next lane is the density of stretches of purine (or pyrimidine) stretches of 10 bp or longer. Note that in both cases purines tend to favour the leading strand of the replicore, whilst pyrimidine tracts are more likely to occur on the lagging strand. Finally, the next circle (turquoise/red) is simply the AT content of the genome, averaged over a 50,000 bp window. Note that the terminus is slightly more AT rich, whilst the rest of the genome is slightly GC rich. (The AT content scale ranges from 45% to 55%).
Link to more atlases for Escherichia coli genomes.
Link to the main "Genome Atlas" web page for Bacterial Genomes
Link to the main "Genome Atlas" web page for Archaeal Genomes
Li, "Molecular Evolution",
(Sinauer Associates, Inc., Publishers, Sunderland, Massachusetts, USA,
"What is Life?" by Erwin Schrödinger (Cambridge University Press, 1944)
"What is Life? The Next Fifty Years - Speculations on the Future of Biology" (edited by Michael P. Murphey and Luke A.J. O'Neill, Cambridge University Press, 1995.)
"At Home in the Universe - The Search for the Laws of Self-organization and Complexity", by Stuart Kauffman (Oxford University Press, 1995).
"THE LOGIC OF LIFE - A History of Heredity", by Francois Jacob (Vintage Books, A Division of Random House, New York, 1973, translated by Betty E. Spillman).
ENTROPY - Toward a Unified Theory of Biology (by Daniel R. Brooks and E.O.
Wiley, The University of Chicago Press, Chicago, 1988 (2nd. Edition)).
Photocopies of the following articles are provided in the course programme:
Other related articles:
Link to a list of recent papers and talks on DNA structures.
Watson, James D. "A PASSION FOR DNA: Genes, Genomes, and Society", (Oxford University Press, Oxford, 2000). Amazon Barnes&Noble
Sinden, Richard R., "DNA: STRUCTURE and FUNCTION", (Academic Press, New York, 1994). Amazon Barnes&Noble
Calladine,C.R., Drew,H.R., "Understanding DNA: The Molecule and How It Works", (2nd edition, Academic Press, San Diego, 1997). Amazon Barnes&Noble
A List of more than a thousand books about DNA