Models of Evolution
Exercise written by: Anders Gorm Pedersen
Overview
In today's exercise we will explore a number of different, but closely related, models of
evolution. Using such models it is possible to estimate the number of unseen mutational events
and thereby obtain genetic distances that have been corrected for superimposed substitutions.
It is, however, important to realize that these corrections are based on the assumption that
we observe approximately the expected amount of change  if, for instance, 20 mutational
events end up leading to no observable changes then it is impossible to guess the actual
amount of change regardless of which correctional scheme we employ. Using more and longer sequences
helps ensuring that the observed change is closer to the expected change, so the correction
is more likely to be accurate with adequate amounts of data. As we will see later, the
same models also play an important role in phylogenetic reconstruction based on maximum
likelihood and Bayesian techniques.
Log in to your Unix shell account on "padawan".
Make sure to maximize the window: the analyses we will perform give lots of
output to the screen, so having a nice and large shell window makes it easier
to keep track of what happens.
Construct todays working directory:
mkdir models
Change to today's working directory
cd models
Copy files for today's exercise:
cp ~gorm/models/primatemitDNA.nexus ./primatemitDNA.nexus
cp ~gorm/models/titv.data ./titv.data
Inspect sequence file:
nedit primatemitDNA.nexus
This file contains an aligned set of mitochondrial DNA sequences from man, chimpanzee,
gorilla, orangutan and gibbon. Mitochondria are cellular organelles that are bounded by a
lipid membrane and contain their own genome. Mitochondrial DNA is related to certain
bacterial genomes, and it is believed that the original mitochondrium was a primitive
bacterial cell that was engulfed by an early ancestor of eukaryotic cells and that the pair
subsequently went on to form a constant symbiotic relationship.
Mitochondrial DNA has a higher rate of substitution than nuclear DNA. This
makes it useful for investigating phylogenetic relationships between closely related
species, such as the five primates included in the present data set.
Inspect additional data file:
nedit titv.data
This file contains a single header line and one column of numbers giving
estimated times of divergence between man and chimpanzee, man and gorilla, man and
orangutan, and man and gibbon. (Divergence times are in millions of years). This file
will be used later in the exercise when we investigate how various distance measures
increase over time.
The Jukes and Cantor model.
The Jukes and Cantor model of evolution has the following rate matrix:
 A C G T 

A   a a a 
 
C  a  a a 
 
G  a a  a 
 
T  a a a  

The total rate of substitution per site is u = 3a. We will now use the
gnuplot program to explore some features of evolution occurring according to this
model.
Start gnuplot program:
gnuplot
You have previously used the gnuplot program to plot data sets from a file. The program
can also be used to plot functions that are given in symbolic form.
Examine expected amount of change as a function of branch length:
For the Jukes and Cantor model the following equation gives the probability of observing
a change at a given site, expressed as a function of the substitution rate and
the elapsed time: D_s = 0.75 * (1  exp(1.33 * ut))
Where u is the substitution rate and t is the elapsed time. This is
also the expected fraction of sites showing observable change along a branch where
there has in fact been ut amount of change (if any single site has probability
P of changing, then on average P x L sites will have changed in a sequence
of length L). We will now explore how the expected amount of observable
change depends on the actual amount of change:
set xlab "Branch length"
set ylab "Observed difference"
plot [0:10] [0:1] 0.75*(1exp(1.33*x)) tit "Exp. observable differences" , x tit "Real distance", 0.75 tit "y=0.75"
The plot command above has to be entered on a single line! In this expression, x
represents the product ut, meaning the actual amount of change that has
occurred (the branch length). The curve we have plotted thus gives the expected observed difference as a
function of the actual amount of change.
Q1: Include the plot in your reply.
You will notice that the observed change increases from zero to an equilibrium value of
0.75. According to the Jukes and Cantor model, sequences can therefore become no more than
75% different, which is the same as being 25% identical. Twentyfive percent identity is of
course the similarity we would get from constructing two random sequences containing 25% of
each base and simply placing them side by side. You will also note that for low values of
x, the observable difference rises almost linearly. Try to inspect this in greater
detail:
plot [0:1] [0:1] 0.75*(1exp(1.33*x)) tit "Exp. observable differences" , x tit "Real distance", 0.75 tit "y=0.75"
Note how the expected observed difference is very close to the real distance up to
about x=0.1 changes per site. This means there is only a limited effect of JC
corrections when a pair of sequences differ at less than 10% of their sites.
Examine estimated branch length as a function of observed difference:
Above we examined how the (expected) observed distance depended on the real
distance. We will now examine how the real distance can be estimated from the observed
distance. This is done by solving the above equation for ut, giving us an
expression that allows us to estimate the real amount of change as a function of the
observed change: ut = 0.75 * ln(1  1.33 * D_s)
Note that this correction will only work if the observed difference is
approximately as expected. Consider this: In the dicerolling simulation we found
that if there has been 0.67 changes per site then the expected observed difference is
44%. However, as you saw in the simulation, the actual observed difference can be
different from the expected 44% (say, 33% or 58%). If the observed difference is not
the same as the expected observed difference, then we will obviously also get the
wrong number even after correction.
We will now plot (estimated) real change as a function of observed difference (this is
the inverse of what you did before):
set xlab "Observed difference"
set ylab "Real distance"
plot [0:0.8] 0.75*log(11.33*x) tit "Estimated real distance", x tit "Observed distance"
(The function "log" means the natural logarithm in gnuplot). Note how the correction
becomes increasingly more important as the observed distance increases. Also note that
this correction does not allow the observed distance to rise above 0.75, although that
situation may arise in real data. Above 75% difference the corrected distance is not
defined. When using JC corrected distances for phylogenetic reconstruction, you should
therefore beware of this situation.
Q2: Use the equation above to estimate the actual distance if the observed
distance is 0.1, 0.4, and 0.6 respectively.
The Kimura 2 parameter model.
The Kimura 2 parameter model of evolution has the following rate matrix:
 A C G T 

A   b a b 
 
C  b  b a 
 
G  a b  b 
 
T  b a b  

Note how transitions (A <> G and C <> T) have a different rate than transversions
(A <> C, A <> T, C <> G, and G <> T).
Also note that the total rate of substitution per site is u = a + 2b. Based on
this matrix, the expected ratio of transitions to transversions is:
R = a/2b
Meaning that if transitions and transversions had the same rate (Jukes and Cantor), then
we would expect:
R = 0.5
Empirically, this is typically not the case. In fact one often sees R ≥ 2 and for
mitochondrial DNA a typical value is R=10 (meaning that a is 20 times higher than b)! We
will now use the gnuplot program to explore some features of evolution occurring
according to this model.
It can be shown (Felsenstein equation 13.2) that under the K2P model, the chance of
observing a transition and a transversion respectively depends on R and
t in the following way:
P(transition) = 0.25  0.5 * exp(A*t) + 0.25 * exp(B*t)
P(transversion) = 0.5  0.5 * exp(B*t)
where A = (2R1)/(R+1) and B = (2)/(R+1). Note that in these
equations we have chosen to measure time in suitable units such that the overall rate of
substitution (u = a + 2b) has the value 1 substitution per site per unit time. (An
example: if u is 10^9 substitutions per site per year, then we would choose to measure
time in billions of years, instead of in years. The substitution rate would now be 1
substitution per site per billion years). This means that the amount of change
accumulated during t time units simply is: D = 1 * t = t
Examine expected amount of change as a function of branch length:
set xlab "Real distance"
set ylab "Observed difference"
We will now examine how the expected amount of transitions and transversions
change with time when R=10. Fortunately, gnuplot can be used to compute and keep
track of the values of R, A, and B. In the gnuplot window enter the following:
R = 10
A = (2*R1.0)/(R+1.0)
B = (2)/(R+1.0)
You can check the computed values of A and B by using the print
command:
print A
print B
You should have obtained values of approximately A = 1.909 and B = 0.1818.
You can now plot the curves showing how the expected amount of transitions and
transversions change with time:
plot [0:40] [0:1] 0.250.5*exp(A*x)+0.25*exp(B*x) tit "Transitions", 0.5  0.5 * exp(B*x) tit "Transversions", 0.25 tit "y=0.25", 0.5 tit "y=0.50", 0.75 tit "y=0.75", 0.250.5*exp(A*x)+0.25*exp(B*x) + 0.5  0.5 * exp(B*x) tit "Total, observed difference"
Several interesting things are going on in this plot. First of all, note that I have
added a third curve showing the total observed difference. This is simply the sum of the
observed transitions and transversions.
Second, as was the case for the Jukes and Cantor model, the total observed difference
increases to a maximum value of 75% (corresponding to 25% similarity). This is again
caused by the fact that the rate matrix implicitly forces the equilibrium frequencies of
all four bases to be 25%. This is most easily seen from figure 13.1 in Felsenstein: the
total rate of change away from the purines (A and G) is the same as the total rate of
change to the purines. This means that after enough time has elapsed, there will be the
same chance that a site ends up as a purine or a pyrimidine. Since the rates within the
two categories are also symmetrical, these will also have equal frequencies. All in all
the equilibrium frequencies will end up at 25% for each base.
Third, note that the expected amount of transitional differences first rise rapidly
and then decline slowly to an equilibrium value of 0.25. Transversional differences rise
slowly to an equilibrium value of 0.5. The equilibrium values are determined by the fact
that when sufficient time has passed sequence similarities will essentially be random;
since there are twice as many possible transversions as transitions, these will in the
end make up two thirds of all observed changes. Early on, before this stage is reached,
the much higher rate of transitions will cause them to make up the vast majority of all
observed changes, and only after considerable time has elapsed will the transversions
catch up.
Q3: From the plot, estimate the real distance (xaxis) at which the
transition and transversion lines cross.
Experiment with other transition/transversion rate ratos:
The exact behaviour of the relationship between the two types of change
depends on the relative rates of transition and transversion. You should now
repeat the above analysis with (1) R=2, and (2) R=0.5. Remember to recompute A
and B after entering the new value of R (You will probably find it useful that
previously entered gnuplot commands can be recalled by pressing the uparrow).
Recall that R=0.5 means that transitions and transversion occur with the
same rate, a=b. For each of these two cases rerun the plot command and consider
the changes. Make sure to include the plots in your report.
Q4: If possible, estimate the real distance (xaxis) at which the
transition and transversion lines cross.
Q5: When R=0.5, the Kimura 2 parameter model is in fact equivalent to
another model  which one? And why?
Examine how apparent transition/transversion ratio changes with time:
The apparent transition transversion ratio is simply the observed number of
transitions divided by the observed number of transversions. The following plot
command shows this number as a function of time for the case R=10 (I have simply
taken the expression for observed transitions and divided it by the expression
for observed transversions):
set ylab "Observed transition/transversion ratio"
plot [0:4] (0.250.5*exp(1.909*x)+0.25*exp(0.1818*x))/ (0.5  0.5 * exp(0.1818*x)) notit
Note how the apparent ratio is close to the real ratio (R=10) when not much change
has occurred (i.e., for small x).
Q6: What value does the apparent transition/transversion ratio approach
asymptotically? (You will need to plot with a wider xrange to see this).
The model of evolution that we have explored here is not a particularly complicated one
 in fact it only has two free parameters. Nevertheless, you will by now appreciate that
it is capable of displaying fairly unintuitive behaviour. Stating our hypothesis about
this biological system in explicit mathematical terms is what allowed us to explore this
thoroughly.
Analysis of mitochondrial data set
In this part of the exercise we will explore a real mitochondrial data set containing
sequences from man, chimpanzee, gorilla, orangutan, and gibbon. We will investigate how
the use of different models of evolution affects the estimated distance matrix. Since
mitochondrial DNA is known to have very different transition and transversion rates, we
will pay special attention to this aspect.
Prepare editor window:
nedit titv.data &
This file contains a header line and a column listing the estimated divergence time
between man and each of the other four primates (in millions of years). These estimates are
associated with a fair amount of uncertainty, but the implied branching order is almost
certainly correct. You will be using this file for entering various measures that
you compute from the data file.
Q7: Include the table in your report. When more values are determined below, you
should make sure to enter them both in your report AND in the file (which we will use for
plotting later).
Start PAUP*, load data file:
paup primatemitDNA.nexus
Define outgroup:
outgroup gibbon
Activate outgroup rooting and select how tree will be printed:
set root=outgroup outroot=monophyl
Select distancebased treereconstruction:
set criterion=distance
Select uncorrected distances under the least squares criterion with inversesquare weighting:
dset distance=p objective=lsfit power=2
Short digression on PAUP* online help system:
We interrupt this exercise for a brief announcement: By now you should be familiar with
many of the commands used in PAUP, but you probably do not have an overview of the long
list of possible options that can be specified. Fortunately, PAUP has a command that is
useful in this context:
dset ?
Here I have used dset as an example, but typing any command followed by a
question mark ("?") will give you a list of all the possible options for that command,
along with a list of the current values. This is very useful if you want to experiment
with different settings in an analysis. When you want to learn more about the individual
settings, you can download the command reference in PDF from this site:
PAUP* command reference document
One final thing that may be good to know: PAUP* accepts abbreviated commands as long as
the abbreviation is unambiguous. That means you can for instance write set
crit=dist instead of the full set criterion=distance, and desc instead of
describetrees.
Construct least squares tree:
alltrees
Inspect tree:
describetrees all/plot=phylogram
This tree reflects our current belief about how these organisms are related.
Print distance matrix, note distances from human:
showdist
The showdist command lists the distance matrix computed according to the
currently active distancesetting (as specified in the dset command above).
Q8: Copy the entries giving the pdistance between human and each of the other
four primates into the proper place in the titv.data file and into the
corresponding table in your report. (The numbers should all be in a single column under
the "p_dist" header).
Select uncorrected distances, counting only transitions:
dset subst=ti
The option subst=ti specifies that only transitional substitutions should be
counted. The previously issued "distance=p" is still the active setting. You can
verify this by typing "dset ?" and checking the value listed for
distance.
Print distance matrix, note transitional distances from human:
showdist
In this distance matrix only the transitions have been counted for each pair of taxa.
Q9: Again find the numbers for human vs. everything else, and note the numbers in
the column labeled "Transitions(P)" in the file and in your report.
Select uncorrected distances, counting only transversions:
dset subst=tv
Print distance matrix, note transversional distances from human:
showdist
Q10: Enter the distances from everything to human in the column labeled
Transversions(Q).
Compute JCcorrected distances:
As we saw above, it is possible to come up with modelbased corrections for the effect of multiple
substitutions that allow us to estimate the real amount of change from the observed amount of change.
Q11: For each of the four lines in the titv.data file, and based on the
numbers in the column labeled p_dist, compute the JCcorrected distance using equation 11.18
in Felsenstein's book. Enter the results in the column labeled "JC" in the titv.data
file and in your report. Note: it is possible to use the dset and showdist commands
to do this work for you. Feel free to do that (but perform at least one manual computation
to ensure you know how to use the formula. Include this computation in the report).
IMPORTANT: In case you
do use this approach, make sure to first issue the command dset subst=all so you are
using all substitutions for the computaiton (not just the transitions).
Compute K2P corrected distance:
As was the case for the JC model, we can also compute estimated real distances under the K2P model. This can
be done using the first equation 13.3 in Felsenstein's book.
Q12: Using the numbers in columns P and Q, you should now use this equation to
compute the K2Pcorrected distance estimates. Enter the results in the column labeled K2P
(in the file as well as in your report). Make sure to save the file after all results
have been entered. Note: it is again possible to use the dset and showdist commands
to do this work for you. Feel free to do that (but perform at least one manual computation
to ensure you know how to use the formula. Include this computation in the report)
Plot distances:
First, open a second shell window on padawan and cd to the models directory.
cd models
gnuplot
set xlab "Time since divergence (MY)"
set ylab "Genetic distance (substitutions/site)"
plot 'titv.data' u 2:3 tit "pdistance" w linespoi, 'titv.data' u 2:4 tit "Transitions (P)" w linespoi, 'titv.data' u 2:5 tit "Transversions (Q)" w linespoi, 'titv.data' u 2:6 tit "JC" w linespoi, 'titv.data' u 2:7 tit "K2P" w linespoi
We have here plotted the total difference, the observed transitional and transversional
difference, as well as the JC and K2Pcorrected distances as a function of estimated
divergence times.
Q13: Include the plot in your report. Do the two different correction schemes result
in the same estimates of the real distance?
