Models of Evolution

Exercise written by: Anders Gorm Pedersen


In today's exercise we will explore a number of different, but closely related, models of evolution. Using such models it is possible to estimate the number of unseen mutational events and thereby obtain genetic distances that have been corrected for superimposed substitutions. It is, however, important to realize that these corrections are based on the assumption that we observe approximately the expected amount of change - if, for instance, 20 mutational events end up leading to no observable changes then it is impossible to guess the actual amount of change regardless of which correctional scheme we employ. Using more and longer sequences helps ensuring that the observed change is closer to the expected change, so the correction is more likely to be accurate with adequate amounts of data. As we will see later, the same models also play an important role in phylogenetic reconstruction based on maximum likelihood and Bayesian techniques.

Getting started, biological background.

  1. Log in to your Unix shell account on "padawan".

    Make sure to maximize the window: the analyses we will perform give lots of output to the screen, so having a nice and large shell window makes it easier to keep track of what happens.

  2. Construct todays working directory:

    mkdir models
  3. Change to today's working directory

    cd models
  4. Copy files for today's exercise:

    cp ~gorm/models/ ./
    cp ~gorm/models/ ./
  5. Inspect sequence file:


    This file contains an aligned set of mitochondrial DNA sequences from man, chimpanzee, gorilla, orangutan and gibbon. Mitochondria are cellular organelles that are bounded by a lipid membrane and contain their own genome. Mitochondrial DNA is related to certain bacterial genomes, and it is believed that the original mitochondrium was a primitive bacterial cell that was engulfed by an early ancestor of eukaryotic cells and that the pair subsequently went on to form a constant symbiotic relationship.

    Mitochondrial DNA has a higher rate of substitution than nuclear DNA. This makes it useful for investigating phylogenetic relationships between closely related species, such as the five primates included in the present data set.

  6. Inspect additional data file:


    This file contains a single header line and one column of numbers giving estimated times of divergence between man and chimpanzee, man and gorilla, man and orangutan, and man and gibbon. (Divergence times are in millions of years). This file will be used later in the exercise when we investigate how various distance measures increase over time.

The Jukes and Cantor model.

The Jukes and Cantor model of evolution has the following rate matrix:

  | A   C   G   T |
A | -   a   a   a | 
  |		  |
C | a   -   a   a |
  |		  |
G | a   a   -   a |
  |		  |
T | a   a   a   - |

The total rate of substitution per site is u = 3a. We will now use the gnuplot program to explore some features of evolution occurring according to this model.

  1. Start gnuplot program:


    You have previously used the gnuplot program to plot data sets from a file. The program can also be used to plot functions that are given in symbolic form.

  2. Examine expected amount of change as a function of branch length:

    For the Jukes and Cantor model the following equation gives the probability of observing a change at a given site, expressed as a function of the substitution rate and the elapsed time: D_s = 0.75 * (1 - exp(-1.33 * ut))

    Where u is the substitution rate and t is the elapsed time. This is also the expected fraction of sites showing observable change along a branch where there has in fact been ut amount of change (if any single site has probability P of changing, then on average P x L sites will have changed in a sequence of length L). We will now explore how the expected amount of observable change depends on the actual amount of change:

    set xlab "Branch length"
    set ylab "Observed difference"
    plot [0:10] [0:1] 0.75*(1-exp(-1.33*x)) tit "Exp. observable differences" , x tit "Real distance", 0.75 tit "y=0.75"

    The plot command above has to be entered on a single line! In this expression, x represents the product ut, meaning the actual amount of change that has occurred (the branch length). The curve we have plotted thus gives the expected observed difference as a function of the actual amount of change.

    Q1: Include the plot in your reply.

    You will notice that the observed change increases from zero to an equilibrium value of 0.75. According to the Jukes and Cantor model, sequences can therefore become no more than 75% different, which is the same as being 25% identical. Twenty-five percent identity is of course the similarity we would get from constructing two random sequences containing 25% of each base and simply placing them side by side. You will also note that for low values of x, the observable difference rises almost linearly. Try to inspect this in greater detail:

    plot [0:1] [0:1] 0.75*(1-exp(-1.33*x)) tit "Exp. observable differences" , x tit "Real distance", 0.75 tit "y=0.75"

    Note how the expected observed difference is very close to the real distance up to about x=0.1 changes per site. This means there is only a limited effect of JC corrections when a pair of sequences differ at less than 10% of their sites.

  3. Examine estimated branch length as a function of observed difference:

    Above we examined how the (expected) observed distance depended on the real distance. We will now examine how the real distance can be estimated from the observed distance. This is done by solving the above equation for ut, giving us an expression that allows us to estimate the real amount of change as a function of the observed change: ut = -0.75 * ln(1 - 1.33 * D_s)

    Note that this correction will only work if the observed difference is approximately as expected. Consider this: In the dice-rolling simulation we found that if there has been 0.67 changes per site then the expected observed difference is 44%. However, as you saw in the simulation, the actual observed difference can be different from the expected 44% (say, 33% or 58%). If the observed difference is not the same as the expected observed difference, then we will obviously also get the wrong number even after correction.

    We will now plot (estimated) real change as a function of observed difference (this is the inverse of what you did before):

    set xlab "Observed difference"
    set ylab "Real distance"
    plot [0:0.8] -0.75*log(1-1.33*x) tit "Estimated real distance", x tit "Observed distance"

    (The function "log" means the natural logarithm in gnuplot). Note how the correction becomes increasingly more important as the observed distance increases. Also note that this correction does not allow the observed distance to rise above 0.75, although that situation may arise in real data. Above 75% difference the corrected distance is not defined. When using JC corrected distances for phylogenetic reconstruction, you should therefore beware of this situation.

    Q2: Use the equation above to estimate the actual distance if the observed distance is 0.1, 0.4, and 0.6 respectively.

The Kimura 2 parameter model.

    The Kimura 2 parameter model of evolution has the following rate matrix:

      | A   C   G   T |
    A | -   b   a   b | 
      |		  |
    C | b   -   b   a |
      |		  |
    G | a   b   -   b |
      |		  |
    T | b   a   b   - |

    Note how transitions (A <--> G and C <--> T) have a different rate than transversions (A <--> C, A <--> T, C <--> G, and G <--> T). Also note that the total rate of substitution per site is u = a + 2b. Based on this matrix, the expected ratio of transitions to transversions is:

    R = a/2b

    Meaning that if transitions and transversions had the same rate (Jukes and Cantor), then we would expect:

    R = 0.5

    Empirically, this is typically not the case. In fact one often sees R ≥ 2 and for mitochondrial DNA a typical value is R=10 (meaning that a is 20 times higher than b)! We will now use the gnuplot program to explore some features of evolution occurring according to this model.

    It can be shown (Felsenstein equation 13.2) that under the K2P model, the chance of observing a transition and a transversion respectively depends on R and t in the following way:

    P(transition) = 0.25 - 0.5 * exp(A*t) + 0.25 * exp(B*t)
    P(transversion) = 0.5 - 0.5 * exp(B*t)

    where A = (-2R-1)/(R+1) and B = (-2)/(R+1). Note that in these equations we have chosen to measure time in suitable units such that the overall rate of substitution (u = a + 2b) has the value 1 substitution per site per unit time. (An example: if u is 10^-9 substitutions per site per year, then we would choose to measure time in billions of years, instead of in years. The substitution rate would now be 1 substitution per site per billion years). This means that the amount of change accumulated during t time units simply is: D = 1 * t = t

  1. Examine expected amount of change as a function of branch length:

    set xlab "Real distance"
    set ylab "Observed difference"

    We will now examine how the expected amount of transitions and transversions change with time when R=10. Fortunately, gnuplot can be used to compute and keep track of the values of R, A, and B. In the gnuplot window enter the following:

    R = 10
    A = (-2*R-1.0)/(R+1.0)
    B = (-2)/(R+1.0)

    You can check the computed values of A and B by using the print command:

    print A
    print B

    You should have obtained values of approximately A = -1.909 and B = -0.1818. You can now plot the curves showing how the expected amount of transitions and transversions change with time:

    plot [0:40] [0:1] 0.25-0.5*exp(A*x)+0.25*exp(B*x) tit "Transitions", 0.5 - 0.5 * exp(B*x) tit "Transversions", 0.25 tit "y=0.25", 0.5 tit "y=0.50", 0.75 tit "y=0.75", 0.25-0.5*exp(A*x)+0.25*exp(B*x) + 0.5 - 0.5 * exp(B*x) tit "Total, observed difference"

    Several interesting things are going on in this plot. First of all, note that I have added a third curve showing the total observed difference. This is simply the sum of the observed transitions and transversions.

    Second, as was the case for the Jukes and Cantor model, the total observed difference increases to a maximum value of 75% (corresponding to 25% similarity). This is again caused by the fact that the rate matrix implicitly forces the equilibrium frequencies of all four bases to be 25%. This is most easily seen from figure 13.1 in Felsenstein: the total rate of change away from the purines (A and G) is the same as the total rate of change to the purines. This means that after enough time has elapsed, there will be the same chance that a site ends up as a purine or a pyrimidine. Since the rates within the two categories are also symmetrical, these will also have equal frequencies. All in all the equilibrium frequencies will end up at 25% for each base.

    Third, note that the expected amount of transitional differences first rise rapidly and then decline slowly to an equilibrium value of 0.25. Transversional differences rise slowly to an equilibrium value of 0.5. The equilibrium values are determined by the fact that when sufficient time has passed sequence similarities will essentially be random; since there are twice as many possible transversions as transitions, these will in the end make up two thirds of all observed changes. Early on, before this stage is reached, the much higher rate of transitions will cause them to make up the vast majority of all observed changes, and only after considerable time has elapsed will the transversions catch up.

    Q3: From the plot, estimate the real distance (x-axis) at which the transition and transversion lines cross.

  2. Experiment with other transition/transversion rate ratos:

    The exact behaviour of the relationship between the two types of change depends on the relative rates of transition and transversion. You should now repeat the above analysis with (1) R=2, and (2) R=0.5. Remember to recompute A and B after entering the new value of R (You will probably find it useful that previously entered gnuplot commands can be recalled by pressing the up-arrow). Recall that R=0.5 means that transitions and transversion occur with the same rate, a=b. For each of these two cases rerun the plot command and consider the changes. Make sure to include the plots in your report.

    Q4: If possible, estimate the real distance (x-axis) at which the transition and transversion lines cross.

    Q5: When R=0.5, the Kimura 2 parameter model is in fact equivalent to another model - which one? And why?

  3. Examine how apparent transition/transversion ratio changes with time:

    The apparent transition transversion ratio is simply the observed number of transitions divided by the observed number of transversions. The following plot command shows this number as a function of time for the case R=10 (I have simply taken the expression for observed transitions and divided it by the expression for observed transversions):

    set ylab "Observed transition/transversion ratio"
    plot [0:4] (0.25-0.5*exp(-1.909*x)+0.25*exp(-0.1818*x))/ (0.5 - 0.5 * exp(-0.1818*x)) notit

    Note how the apparent ratio is close to the real ratio (R=10) when not much change has occurred (i.e., for small x).

    Q6: What value does the apparent transition/transversion ratio approach asymptotically? (You will need to plot with a wider x-range to see this).

The model of evolution that we have explored here is not a particularly complicated one - in fact it only has two free parameters. Nevertheless, you will by now appreciate that it is capable of displaying fairly un-intuitive behaviour. Stating our hypothesis about this biological system in explicit mathematical terms is what allowed us to explore this thoroughly.

Analysis of mitochondrial data set

    In this part of the exercise we will explore a real mitochondrial data set containing sequences from man, chimpanzee, gorilla, orangutan, and gibbon. We will investigate how the use of different models of evolution affects the estimated distance matrix. Since mitochondrial DNA is known to have very different transition and transversion rates, we will pay special attention to this aspect.

  1. Prepare editor window:

    nedit &

    This file contains a header line and a column listing the estimated divergence time between man and each of the other four primates (in millions of years). These estimates are associated with a fair amount of uncertainty, but the implied branching order is almost certainly correct. You will be using this file for entering various measures that you compute from the data file.

    Q7: Include the table in your report. When more values are determined below, you should make sure to enter them both in your report AND in the file (which we will use for plotting later).

  2. Start PAUP*, load data file:

  3. Define outgroup:

    outgroup gibbon
  4. Activate outgroup rooting and select how tree will be printed:

    set root=outgroup outroot=monophyl
  5. Select distance-based tree-reconstruction:

    set criterion=distance
  6. Select uncorrected distances under the least squares criterion with inverse-square weighting:

    dset distance=p objective=lsfit power=2
  7. Short digression on PAUP* online help system:

    We interrupt this exercise for a brief announcement: By now you should be familiar with many of the commands used in PAUP, but you probably do not have an overview of the long list of possible options that can be specified. Fortunately, PAUP has a command that is useful in this context:

    dset ?

    Here I have used dset as an example, but typing any command followed by a question mark ("?") will give you a list of all the possible options for that command, along with a list of the current values. This is very useful if you want to experiment with different settings in an analysis. When you want to learn more about the individual settings, you can download the command reference in PDF from this site:

    PAUP* command reference document

    One final thing that may be good to know: PAUP* accepts abbreviated commands as long as the abbreviation is unambiguous. That means you can for instance write set crit=dist instead of the full set criterion=distance, and desc instead of describetrees.

  8. Construct least squares tree:

  9. Inspect tree:

    describetrees all/plot=phylogram

    This tree reflects our current belief about how these organisms are related.

  10. Print distance matrix, note distances from human:


    The showdist command lists the distance matrix computed according to the currently active distance-setting (as specified in the dset command above).

    Q8: Copy the entries giving the p-distance between human and each of the other four primates into the proper place in the file and into the corresponding table in your report. (The numbers should all be in a single column under the "p_dist" header).

  11. Select uncorrected distances, counting only transitions:

    dset subst=ti

    The option subst=ti specifies that only transitional substitutions should be counted. The previously issued "distance=p" is still the active setting. You can verify this by typing "dset ?" and checking the value listed for distance.

  12. Print distance matrix, note transitional distances from human:


    In this distance matrix only the transitions have been counted for each pair of taxa.

    Q9: Again find the numbers for human vs. everything else, and note the numbers in the column labeled "Transitions(P)" in the file and in your report.

  13. Select uncorrected distances, counting only transversions:

    dset subst=tv
  14. Print distance matrix, note transversional distances from human:


    Q10: Enter the distances from everything to human in the column labeled Transversions(Q).

  15. Compute JC-corrected distances:

    As we saw above, it is possible to come up with model-based corrections for the effect of multiple substitutions that allow us to estimate the real amount of change from the observed amount of change.

    Q11: For each of the four lines in the file, and based on the numbers in the column labeled p_dist, compute the JC-corrected distance using equation 11.18 in Felsenstein's book. Enter the results in the column labeled "JC" in the file and in your report. Note: it is possible to use the dset and showdist commands to do this work for you. Feel free to do that (but perform at least one manual computation to ensure you know how to use the formula. Include this computation in the report). IMPORTANT: In case you do use this approach, make sure to first issue the command dset subst=all so you are using all substitutions for the computaiton (not just the transitions).

  16. Compute K2P corrected distance:

    As was the case for the JC model, we can also compute estimated real distances under the K2P model. This can be done using the first equation 13.3 in Felsenstein's book.

    Q12: Using the numbers in columns P and Q, you should now use this equation to compute the K2P-corrected distance estimates. Enter the results in the column labeled K2P (in the file as well as in your report). Make sure to save the file after all results have been entered. Note: it is again possible to use the dset and showdist commands to do this work for you. Feel free to do that (but perform at least one manual computation to ensure you know how to use the formula. Include this computation in the report)

  17. Plot distances:

    First, open a second shell window on padawan and cd to the models directory.

    cd models
    set xlab "Time since divergence (MY)"
    set ylab "Genetic distance (substitutions/site)"
    plot '' u 2:3 tit "p-distance" w linespoi, '' u 2:4 tit "Transitions (P)" w linespoi, '' u 2:5 tit "Transversions (Q)" w linespoi, '' u 2:6 tit "JC" w linespoi, '' u 2:7 tit "K2P" w linespoi

    We have here plotted the total difference, the observed transitional and transversional difference, as well as the JC- and K2P-corrected distances as a function of estimated divergence times.

    Q13: Include the plot in your report. Do the two different correction schemes result in the same estimates of the real distance?