Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Bootstrapping

Exercise written by: Anders Gorm Pedersen


Overview

In phylogenetic analysis (and everywhere else for that matter) it is important to be able to assess the uncertainty one has about estimated parameter values. In previous classes we have discussed how quantification of uncertainty about parameter values is an important aspect of the Bayesian approach, but also outside the Bayesian paradigm it is useful to assess uncertainty.

One way of empirically assessing the uncertainty is to collect several independent data sets, and estimate the parameter from each data set. (For random reasons different data sets will give slightly different estimates of a parameter value). There are, however, problems where collecting large amounts of new data is not feasible. The goal of bootstrapping is precisely to find out how large the stochastic variation is, without actually collecting new data sets. (Hence the name "bootstrap": we are pulling ourselves up by the straps of our boots). In bootstrapping this variation is instead approximated by using the original data set as the basis for constructing many "new" data sets that are then used to estimate the parameter. This provides an estimate of how certain we are about the values of our parameters.

Knowing the amount of stochastic variation in an estimated parameter value is obviously useful in its own right (think "confidence interval"). However, it also forms the basis for (frequentist) statistical testing of whether different estimates are significantly different. For instance, fitting a K2P model to a data set will always result in a better likelihood than fitting a JC model (since the JC model is nested within K2P, and K2P has an extra model parameter). But in order to find out whether a K2P model fits a data set significantly better than JC, we have to know something about the K2P fits we can expect to get for random reasons (i.e., when JC is in fact the correct model). Specifically, we have to know the distribution of the difference between the log likelihoods of the two models, when the null model is correct.

Think about that last statement for a minute before proceeding...

As we saw last week, twice the log likelihood difference will in fact follow a chi-squared distribution when models are nested and the null model is true. When the models are not nested in the proper manner we have to determine the distribution empirically, typically by bootstrapping.

We will mainly use programs from the PAML package for the exercise: evolver for simulating data sets, and basemlfor fitting models to simulated (and real) data. The procedures involved in performing parametric bootstrapping are a bit tedious and time-consuming, so we will not look at many examples.


Getting started.

  1. Log in to your Unix shell account on "padawan" and construct todays working directory:

    mkdir bootstrap
    ls -l
  2. Change to today's working directory

    cd bootstrap
  3. Copy files for today's exercise:

    cp ~gorm/bootstrap/primatemitDNA.nexus ./primatemitDNA.nexus
    cp ~gorm/bootstrap/primatemitDNA.phylip ./primatemitDNA.phylip
    cp ~gorm/bootstrap/primatetree.ph ./primatetree.ph
    cp ~gorm/bootstrap/MCbase.dat ./MCbase.dat
    cp ~gorm/bootstrap/baseml.ctl ./baseml.ctl

    You have analyzed the data file previously in this course. The remaining files will all be used to control aspects of how the PAML programs deal with the data.


(Non-parametric) Bootstrapping: Clade-support

Briefly, the idea in ordinary (or non-parametric) bootstrapping is to construct new data sets by sampling with replacement from the initial data set. Specifically, the process is as follows: Imagine you have an alignment with 130 columns. Bootstrapping is then performed by randomly picking one of these columns 130 times in a row, in a way so that the same column may be chosen more than once. This gives a permuted dataset having the same size as the original data set, and where some columns are present multiple times, while some columns are absent. This is repeated a large number of times (usually 500-1000), so that several of the permuted data sets are constructed. Finally, model parameters can be estimated from each of these data sets (referred to as "bootstrap replicates") and the uncertainty assessed. In this exercise we will focus on topology (clade support), but the technique can be used for other types of parameters also.

  1. Start paup:

    paup
  2. Load data file:

    execute primatemitDNA.nexus

    This data set contains mitochondrial DNA from 5 different primates.

  3. Define outgroup:

    outgroup Gibbon
  4. Activate outgroup rooting and select how tree will be printed:

    set root=outgroup outroot=monophyl
  5. Select distance criterion:

    set criterion=distance
  6. Select K2P model-corrected distances with gamma distributed rates:

    dset distance=k2p rates=gamma
  7. Bootstrap neighbor joining tree:

    bootstrap search=nj nreps=1000 grpfreq=no

    This command will use neighbor joining on a set of 1000 bootstrap replicates to investigate the support for different branches in the tree. It will take a short while to finish. What happens is that from each of the derived data sets a NJ tree is built using the current settings (i.e., we are still using the K2P + gamma parameters estimated above). A consensus tree is then built from the 1000 resulting trees. The frequency with which any bipartition occurs gives some idea about how well supported it is by the data. This is NOT a number with any clear statistical meaning, but as a rule of thumb values over 85% are fairly good evidence for a grouping, while you should probably be sceptical about values below approximately 70%.

    Q1: Which bipartitions are in the tree, and what is the support for each of them?

  8. Perform maximum likelihood bootstrap

    set criterion=likelihood
    lset nst=2 tratio=estimate rates=gamma
    bootstrap search=heuristic nreps=1000 grpfreq=no

    Q2: Which bipartitions are in this tree, are they the same as in the NJ tree, and what is the support for each of them?

    It is also possible to do bootstrapping using other criteria such as here where we have used likelihood (again using a K2P model with gamma distributed rates). As you can see, this is quite a bit more time consuming (about 3-4 minutes for this data set on padawan), so it is sometimes impractical to perform a full bootstrap under likelihood. In such cases you can save time by fixing the model parameters before the bootstrap. A useful alternative might also be to use Bayesian phylogeny estimation, since this automatically gives you clade-support values as part of the process. Depending on your statistical paradigmatic preferences, this can also be argued to have the benefit that Bayesian support values are actual probabilities.

    Bootstrapping can also be done in many other programs (ClustalX is an example), but PAUP provides a much wider choice of criteria and model selection opportunities so it is clearly preferable for most data sets.

  9. Quit program:

    quit


Parametric Bootstrapping: Comparing models of evolution

The main idea in parametric bootstrapping is to use a full model of evolution to construct random sequences. These sequences are constructed by evolving a random start sequence along a tree according to a set of specified substitution parameters. (hence the name "parametric").

Parametric bootstrapping can be used to compare alternative models of evolution and to find out whether the more complicated model is significantly better than the simpler (null) model. The approach is as follows: (1) Fit the null model to the data, note the log likelihood, also note the estimated tree and parameter values. (2) Fit the alternate model to the data, note the likelihood. Compute the difference in likelihoods. (3) Using the tree and other parameters estimated under the null model, construct a large (>1000 perhaps) set of simulated data sets by "evolving" random sequences on the tree according to the fitted substitution parameters. These simulated data are representative of how data sets would look IF the null model was correct. (4) For each of the simulated data sets, fit the null and alternative models to the data and compute the difference in likelihoods. This gives us a distribution of the log likelihood differences that we could expect by chance, even if the null model was true. (5) Compare the observed difference in log likelihoods, to the empirical null-distribution of log likelihood differences in order to see whether the observed difference is significantly larger than expected by chance. (If the difference on the real data is similar to the typical difference on the simulated data, then it seems that the null model might be true. If the real difference is much much higher than the average random difference, then we might infer that the complicated model is indeed better.)

(Note: An alternative way to compare non-nested models is the AIC approach that we discussed last week).
  1. Inspect data set:

    nedit primatemitDNA.phylip

    This file contains the same sequences as primatemitDNA.nexus, but in phylip format, which is required by the PAML programs.

  2. Inspect previously constructed tree:

    nedit primatetree.ph

    The programs in the PAML package that we will use in this exercise, are not well suited for tree-reconstruction so I have supplied you with one. (Normally you would first use PAUP or MrBayes to find a tree and then move on to this step).

  3. Set parameters in control file:

    nedit baseml.ctl

    Unlike PAUP, the PAML programs are not run interactively. Instead you have to list all settings in a "control file" prior to starting the program. This control file is used to set parameters for the baseml program that is used for fitting nucleotide substitution models. The control file has one parameter per line; a parameter name is followed by an equals sign and then the value of the parameter; lines starting with an asterisk are "commented out" and are ignored.

    You should set the following parameters:

    seqfile =  primatemitDNA.phylip 
    treefile = primatetree.ph
    outfile = primate.resultfile
    model = 4
    fix_alpha = 1
    alpha = 0
    

    These settings ensure that you will use the sequence file primatemitDNA.nexus, the tree file primatetree.ph, the output file primate.resultfile, the model that will be used for fitting is HKY ("number 4" in baseml), and you will not use different (gamma distributed) rates at different sites (alpha = 0). Save the file, quit nedit and proceed to running the model fitting.

  4. Fit HKY model to data:

    baseml baseml.ctl

    If all settings were entered correctly, baseml should now run using the settings in baseml.ctl.

  5. Inspect result file:

    When the run has finished (after 1 second or so) you can inspect the result file:

    nedit primate.resultfile &

    Q3: Note the following things: (1) The average nucleotide frequencies. (2) The log likelihood (the lnL is written on a line starting "lnL(ntime:" - it is the negative number right after the parenthesis). (3) The tree with branch lengths, in Newick (parenthesis) format. (4) The free rate matrix parameters (there is one free parameter in this case). This is noted below a line that starts like this: "Parameters in the rate matrix (HKY85)".

    Q4: What is the meaning of the rate parameter in this model?

  6. Prepare simulator program:

    Above, you fitted our null model (HKY) to the data, and got estimates of all parameter values. In a short while, we want to use these estimates as the basis for constructing simulated data sets. You should now open the control file for the the sequence simulator program and change the settings accordingly:

    nedit MCbase.dat &

    This file starts with a set of uncommented parameter values. It is NOT a user friendly format, but you should be able to figure it out: specifically, you will find an explanation of the format at the end of the file. You should change the following parameters: (1) Number of sequences should be set to the same as the number in the data set. (2) Number of nucleotide sites should be set to the length of the input data. (3) Number of replicates should be set to 1000. (4) The tree from your result file should be inserted instead of the tree that is already there. Make sure to get the tree WITH branch lengths. Also make sure to keep the blank line after the tree, or evolver wont run (!) (5) Model should be set to 4 (meaning HKY). (6) "kappa or rate parameters in model" should be set to the rate parameter value you found for HKY above. (7) "alpha" (the gamma shape parameter) should be set to zero. (8) The average base frequencies from above should also be entered.

    You have now prepared the simulator to run with the parameters of the null model (HKY). We will return to simulation shortly, but first you should close this window and move on to the fitting of the alternate model to the real data.

  7. Set GTR parameters in control file:

    nedit baseml.ctl

    You should now set the model parameter so it indicates the GTR model:

    model = 7
  8. Fit GTR model to data:

    baseml baseml.ctl
  9. Inspect result file:

    When the run has finished (after 1 second or so) you can inspect the result file:

    nedit primate.resultfile

    Q5: This time you should just note the log likelihood of the fit. Is it better or worse than the HKY fit?

  10. Construct simulated data sets:

    evolver 5 MCbase.dat

    This starts the program evolver, which will now construct 1000 simulated data sets of 5 sequences, each 896 nucleotides long. The sequences will be evolved on the tree you found above, and using the base frequencies and substitution parameter estimated above also. These 1000 data sets therefore represent the type of data we would collect if sequences had in fact evolved according to an HKY model with the specified parameters. The simulator will take a short while to finish.

  11. Inspect simulated data:

    nedit mc.paml

    Q6: The 1000 simulated data sets have been collected in a single file. Notice that all replicates are different: Note the first ten nucleotides from the "Human" sequence in the first two replicates.

  12. Prepare analysis of simulated data using HKY:

    nedit baseml.ctl

    You are now ready to analyze the simulated data using our two alternative models. First, you will prepare the control file for analyzing the data using the HKY model. Specifically, You should set the following parameters:

    seqfile =  mc.paml 
    model = 4
    ndata = 1000
    

    The last line is important: it tells baseml that the input data file has 1000 data sets which should be analyzed. NOTE: you have to delete the leading asterisk "*" before this option is activated!!!!!!

  13. Analyze simulated data using HKY:

    baseml baseml.ctl
  14. Inspect result file:

    nedit primate.resultfile

    The primate.resultfile file now contains a summary of parameter estimates for all 1000 data sets. Here, we are only interested in the log-likelihods.

    Q7: Note the fit (the log likelihood) of the model to the first two data sets (recall that the lnL value is given on the line starting "lnL(ntime:").

  15. Extract all log-likelihoods from result file:

    cat primate.resultfile | grep 'lnL(ntime:' | gawk '{print $5}' > HKY.lnL

    This command extracts HKY log-likelihood values for all 1000 data sets. (Specifically, the grep command extracts all lines starting with "lnL(ntime:", and the gawk command then prints the fifth field - the lnL value - on that line. Finally all the lnL values are written to the file HKY.lnl).

  16. Prepare analysis of simulated data using GTR:

    nedit baseml.ctl

    You will now prepare the control file for analyzing the data using the GTR model (the alternate model). Specifically, You should set the following parameters:

    seqfile =  mc.paml 
    model = 7
    ndata = 1000
    

    Most of this was presumably already set to the correct values.

  17. Analyze simulated data using the GTR model:

    baseml baseml.ctl

    The GTR model is more complicated, so this step will take a bit longer than the HKY step above (probably about twice as long).

    Q8: Again note the fit (the log likelihood) of the model to the first two data sets.

  18. Extract all log-likelihoods from result file:

    cat primate.resultfile | grep 'lnL(ntime:' | gawk '{print $5}' > GTR.lnL
  19. Compute likelihood difference:

    Q9: At the beginning of this exercise you found the log likelihood of the HKY model and of the GTR model, on the real data set. Now, compute twice the difference between these numbers and write it down. We now want to find out if this difference is significant.

  20. Determine null-distribution of likelihood difference:

    The idea with the simulation exercise above was to find out how large a log likelihood difference one can get by chance between the two models. (Recall, that the data was simulated according to an HKY model, and we have then fitted both HKY and GTR models to this data). We now want to compute the distribution of these random differences:

    paste GTR.lnL HKY.lnL | gawk '{print 2*($1-$2)}' > lnL-diff.txt

    This command computes twice the log-likelihood difference for all pairs of fitted HKY and GTR models, and saves the numbers to a file.

  21. Inspect distribution:

    histo -V -orientation=landscape -w 0.2 -b 0 -t 50 lnL-diff.txt

    This shows the distribution of likelihood differences (between HKY and GTR) that can arise by chance even when the sequences in fact evolve according to HKY. Where in the distribution is your observed value approximately? (Recall that we are here looking at two times the difference in log likelihood, so your original difference should be mutiplied by two).

    As you will recall, we last time mentioned that if the models were nested, then this distribution should be a chi-squared distribution (with 4 degrees of freedom in this case, since GTR has 4 extra parameters compared to HKY). Check with this web-site whether that seems to be about right:

    Chi-squared distribution

    Q10: Include screen dumps of both the empirical distribution (your lnL data) and the theoretical chi-squared curve. Do the distributions look similar?

  22. Determine significance of observed difference:

    You now have a distribution of 1000 simulated log likelihood differences. To determine the p-value you need to figure out how many of these random differences that are larger than your observed value. To do that we first sort the simulated numbers:

    cat lnL-diff.txt | sort -n

    Q11: Count how many numbers in the simulated distribution that are larger than your observed value (where "observed value" means your observed lnL-difference times two). Then divide this number by the total number of values in the distribution (1000). This is your p-value. Is it higher or lower than 5%?

    Recall that the p-value essentially tells you how likely it is to obtain your observed difference by chance (meaning: if the null model IS in fact correct). A large p-value means you are likely to obtain you observed value by chance, a small p-value means it is unlikely to obtain the observed value by chance.

    Note that in the case we investigated here we actually have a good idea about the null-distribution, and we would therefore normally NOT use parametric bootstrapping (we would simply compare to a chi-squared distrbution with 4 degrees of freedom). Parametric bootstrapping would mostly be used for more complicated cases, such as when comparing non-nested models.