This lecture is about ways of looking at DNA sequences in complete genomes and chromosomes, in terms of symmetry elements. There are two parts to this talk. In Part 1, I will discuss the fact that we simply have "Too Much Information" becoming available, and the problem will only get worse in the near future. There are ways of cataloging and organising the data, of course. I have found that the true diversity of genome sizes in Nature is often neglected, so we'll talk for a few minutes about the "C-value paradox", along with some possible ideas for WHY certain organisms have so much DNA.
I would like to think that one way of dealing with the explosion of sequence information, in terms of DNA sequences, is to think about it in biological terms, in particular in physical-chemical terms of structure and function of symmetry elements. For example, there are specific DNA sequences which "code" for a telomere, and different DNA sequences which are specific for centromeres. Specific DNA sequences, their structures, and biological functions will be discussed.
In Part 2, I will introduce "DNA Atlases", first having a look at base composition throughout sequenced chromosomes, and then looking at gene expression throughout the whole genome.
I have also made separate file, containing specific LEARNING OBJECTIVES for this lecture, as well as a "self-test quiz", which I recommend having a look at, BEFORE the lecture, if possible. I've incorporated the answers to questions 1 and 2 into PART 1 of the lecture notes.
Brevis esse laboro, Obscuro fio. - Horace
The information in GenBank is doubling every 10 months.
What are the implications of this?
A look at genome sequencing since 1994:
|YEAR||# GENOMES Sequenced|
Although the number of genomes being sequenced is increasing rapidly, one has to this into perspective - the organisms can be placed into four different classes:
|Organism group||Size (bp)||No. sequenced|
|viruses||~300 bp to ~350,000 bp||545|
|prokaryotes||~250,000 to ~15,000,000 bp||>100|
||~12,000,000 to ~600,000,000,000 bp||4|
|multi-celled eukaryotes||~20,000,000 to ~500,000,000,000 bp||3|
|Drosophila species||Genome Size |
(in base pairs)
|D. americana||~300,000,000 bp|
|D. arizonensis||~225,000,000 bp|
|D. eohydei (male)||~234,000,000 bp|
|D. eohydei (female)||~246,000,000 bp|
|D. funebris||~255,000,000 bp|
|D. hydei||~202,000,000 bp|
|D. melanogaster||~180,000,000 bp|
(~138,000,000 bp sequenced)
|D. miranda||~300,000,000 bp|
|D. nasutoides||~800,000,000 bp|
|D. neohydei||~192,000,000 bp|
|D. simulans||~127,500,000 bp|
|D. virilis||~345,000,000 bp|
In summary, the genome sizes of the Drosophila species that have been examined so far range from about 127 million bp to about 800 million bp. But of course at present we SUSPECT that they contain roughly the same number of genes, although it is possible (likely) that they contain duplicated regions (or perhaps even entire chromosomes; there is ample space to have an entire extra copy (or two or more) of the entire genome). In addition, they also contain various types of repeats, known as "selfish DNA".
Why does amoeba have more than 200x as much DNA as humans?
Think about it for a discussion in class. I have a possible explanation, although I'm not sure anyone really knows the answer to this, to be honest.
This brings us to the first question on the quiz:
Answers to the self-test quiz which you are supposed to do BEFORE the lecture:
1. The short answer - a very long time. About 2.4x1012 years.
That's about 160 times longer than the estimated age of the universe!
2. The piece of paper would be quite thick - it would reach outside the earth's
atmosphere and beyond the orbit of the planet Mars.
Today's lecture will cover:
Next Tuesday's lecture will cover:
One way of dealing with the problem of how to display so much sequence information is to have a look at the whole chromosome at once, smoothing over a large window. The entire bacterial chromosome is displayed as a circle, with different colours representing various parameters. First, as an introduction to atlases, we will look at base-composition. Then we will have a look at levels of expression of mRNA and proteins throghout the chromosome. As examples, I will use my very favourite organism, Escherichia coli K-12.
There are several things to notice in this plot. First, the concentration of the bases are not uniformly distributed throughout the genome, but there are "clumps" or clusters where specific bases are a bit more concentrated. Also, the G's (turquoise) clearly are seen to be favoured on one half of the chromosome, whilst the C's (magenta) are on the other strand. This shows up in the "GC-skew" lane as well (2nd circle from the middle). I have labelled the entire terminus region, which ranges from TerE (around 1.08 million bp (Mbp) to TerG (~2.38 Mbp) in Escherichia coli K-12. Finally, several genes corresponding to the darker bands (e.g., more biased nucleotide composition) are labelled.
The same pattern can be seen for the other three Escherichia coli chromosomes which have been sequenced (so far!), as shown in the table below.
Strain: K-12, isolate W3110
DDBJ NCBI tax
Strain: K-12, isolate MG1655
U. Wisconsin TIGR cmr NCBI tax NCBI entrez
Strain: O157:H7 (substrain EDL93)
U. Wisconsin NCBI tax NCBI entrez
Strain: O157:H7 (substrain RIMD 0509952)
Miyazaki, Japan NCBI tax NCBI entrez
||DNA Res. 8:11-22
In addition to showing overall global properties of the chromosome (such as replication origin and terminus), the base composition can also highlight regions different from the rest of the genome. For example, in the plasmid pO157, there are some regions which are much more AT rich (probably these came about as a result of horizontal gene transfer - we will discuss this again in the next lecture...)
Note that the "toxB" gene is much more AT rich than the average for the rest of the plasmid. This COULD be due to the fact that this gene came from an organism with a more AT rich genome, or (more likely in my opinion) it is more AT rich because it is important for this gene to vary in sequence (e.g., have a higher mutational frequencey).
Escherichia coli is probably the best characterised organism.
There are 4085 predicted genes in Escherichia coli strain K-12 isolate W3110.
There are 4289 predicted genes in Escherichia coli strain K-12 isolate MG1665.
There are 5283 predicted genes in Escherichia coli strain O157:H7 isolate EDL933 (enterohemorrhagic pathogen). There are about 5361 predicted genes in Escherichia coli strain O157:H7 substrain RIMD 0509952 (enterohemorrhagic pathogen).
Roughly 2600 genes have been found to be expressed in Escherichia coli strain K-12 cells, under standard laboratory growth conditions.
About 2100 spots can be seen on 2-D protein gels.
Very roughly 1000 different genes (only about 600 mRNA transcripts) are expressed at "detectable levels" in E. coli cells grown in LB media.
Only about 350 proteins exist at concentrations of > 100 copies per cell. (These make up 90% of the total protein in E.coli!)
Most (>90%) of the proteins are present in very low amounts (less than 100 copies per cell).
It has been known since the 1960's that genes closer to the replication origin are more highly expressed. However, it has only been in the past few years that technology has allowed the simultaneous monitoring of ALL the genes in Escherichia coli. There are 4397 annotated genes in the E. coli K-12 genome. Shown below is an "Atlas plot" of the E. coli K-12 genome, with the outer circle representing the concentration of proteins (roughly in number of molecules/cell) and mRNA (again, roughly number of molecules/cell). Under these conditions (e.g., cells grown to late log phase, in minimal media), there were 2005 genes expressed at detectable levels, and only 233 proteins have been found to exist in "abundant" conditions (e.g., very roughly more than 100 molecules per cell).
For E. coli K-12 cells, grown in minimal media to late log phase:
4397 annotated genes -> 2005 mRNAs expressed -> 233 abundant proteins
(note that these numbers will vary for different experimental conditions....)
In this picture, the outer lane represents the concentration of proteins (blue), the next lane the concentration of mRNA (green), and then the annotated genes.
The inner three circles represent different aspects of the DNA base composition throughout the genome. The innermost circle (turquoise/violet) is the bias of G's towards one strand or the other (that is, a look at the mono-nucleotide distribution of the 4 DNA bases). The next lane is the density of stretches of purine (or pyrimidine) stretches of 10 bp or longer. Note that in both cases purines tend to favour the leading strand of the replicore, whilst pyrimidine tracts are more likely to occur on the lagging strand. Finally, the next circle (turquoise/red) is simply the AT content of the genome, averaged over a 50,000 bp window. Note that the terminus is slightly more AT rich, whilst the rest of the genome is slightly GC rich. (The AT content scale ranges from 45% to 55%).
Link to more atlases for Escherichia coli genomes.
Link to the main "Genome Atlas" web page
Friday (6 April, 2001)
Link to a list of recent papers and talks on DNA structures.
Watson, James D. "A PASSION FOR DNA: Genes, Genomes, and Society", (Oxford University Press, Oxford, 2000). Amazon Barnes&Noble
Sinden, Richard R., "DNA: STRUCTURE and FUNCTION", (Academic Press, New York, 1994). Amazon Barnes&Noble
Calladine,C.R., Drew,H.R., "Understanding DNA: The Molecule and How It Works", (2nd edition, Academic Press, San Diego, 1997). Amazon Barnes&Noble
A List of more than a thousand books about DNA