Bootstrapping
Exercise written by: Anders Gorm Pedersen
Overview
In phylogenetic analysis (and everywhere else for that matter) it is important to be able to
assess the uncertainty one has about estimated parameter values. In previous classes we have
discussed how quantification of uncertainty about parameter values is an important aspect
of the Bayesian approach, but also outside the Bayesian paradigm it is useful to assess
uncertainty.
One way of empirically assessing the uncertainty is to collect several independent data
sets, and estimate the parameter from each data set. (For random reasons different data sets
will give slightly different estimates of a parameter value). There are, however, problems
where collecting large amounts of new data is not feasible. The goal of bootstrapping is
precisely to find out how large the stochastic variation is, without actually collecting new
data sets. (Hence the name "bootstrap": we are pulling ourselves up by the straps of our
boots). In bootstrapping this variation is instead approximated by using the original data set
as the basis for constructing many "new" data sets that are then used to estimate the
parameter. This provides an estimate of how certain we are about the values of our
parameters.
Knowing the amount of stochastic variation in an estimated parameter value is obviously
useful in its own right (think "confidence interval"). However, it also forms the basis for
(frequentist) statistical testing of whether different estimates are significantly different.
For instance, fitting a K2P model to a data set will always result in a better likelihood than
fitting a JC model (since the JC model is nested within K2P, and K2P has an extra model
parameter). But in order to find out whether a K2P model fits a data set significantly
better than JC, we have to know something about the K2P fits we can expect to get for
random reasons (i.e., when JC is in fact the correct model). Specifically, we have to know
the distribution of the difference between the log likelihoods of the two models, when the null
model is correct.
Think about that last statement for a minute before proceeding...
As we saw last week, twice the log likelihood difference will in fact follow a chisquared
distribution when models are nested and the null model is true. When the models are not nested
in the proper manner we have to determine the distribution empirically, typically by
bootstrapping.
We will mainly use programs from the PAML package for the exercise: evolver for
simulating data sets, and basemlfor fitting models to simulated (and real) data.
The procedures involved in performing parametric bootstrapping are a bit tedious and
timeconsuming, so we will not look at many examples.
Log in to your Unix shell account on "padawan" and construct todays working
directory:
mkdir bootstrap
ls l
Change to today's working directory
cd bootstrap
Copy files for today's exercise:
cp ~gorm/bootstrap/primatemitDNA.nexus ./primatemitDNA.nexus
cp ~gorm/bootstrap/primatemitDNA.phylip ./primatemitDNA.phylip
cp ~gorm/bootstrap/primatetree.ph ./primatetree.ph
cp ~gorm/bootstrap/MCbase.dat ./MCbase.dat
cp ~gorm/bootstrap/baseml.ctl ./baseml.ctl
You have analyzed the data file previously in this course. The remaining files will
all be used to control aspects of how the PAML programs deal with the data.
(Nonparametric) Bootstrapping: Cladesupport
Briefly, the idea in ordinary (or nonparametric) bootstrapping is to construct new
data sets by sampling with replacement from the initial data set. Specifically, the
process is as follows: Imagine you have an alignment with 130 columns. Bootstrapping
is then performed by randomly picking one of these columns 130 times in a row, in a
way so that the same column may be chosen more than once. This gives a permuted
dataset having the same size as the original data set, and where some columns are
present multiple times, while some columns are absent. This is repeated a large number
of times (usually 5001000), so that several of the permuted data sets are
constructed. Finally, model parameters can be estimated from each of these data sets
(referred to as "bootstrap replicates") and the uncertainty assessed. In this exercise
we will focus on topology (clade support), but the technique can be used for other
types of parameters also.
Start paup:
paup
Load data file:
execute primatemitDNA.nexus
This data set contains mitochondrial DNA from 5 different primates.
Define outgroup:
outgroup Gibbon
Activate outgroup rooting and select how tree will be printed:
set root=outgroup outroot=monophyl
Select distance criterion:
set criterion=distance
Select K2P modelcorrected distances with gamma distributed rates:
dset distance=k2p rates=gamma
Bootstrap neighbor joining tree:
bootstrap search=nj nreps=1000 grpfreq=no
This command will use neighbor joining on a set of 1000 bootstrap replicates to
investigate the support for different branches in the tree. It will take a short while
to finish. What happens is that from each of the derived data sets a NJ tree is built
using the current settings (i.e., we are still using the K2P + gamma parameters
estimated above). A consensus tree is then built from the 1000 resulting trees. The
frequency with which any bipartition occurs gives some idea about how well supported
it is by the data. This is NOT a number with any clear statistical meaning, but as a
rule of thumb values over 85% are fairly good evidence for a grouping, while you
should probably be sceptical about values below approximately 70%.
Q1: Which bipartitions are in the tree, and what is the support for
each of them?
Perform maximum likelihood bootstrap
set criterion=likelihood
lset nst=2 tratio=estimate rates=gamma
bootstrap search=heuristic nreps=1000 grpfreq=no
Q2: Which bipartitions are in this tree, are they the same as
in the NJ tree, and what is the support for
each of them?
It is also possible to do bootstrapping using other criteria such as here where we have used
likelihood (again using a K2P model with gamma distributed rates). As you can see, this is quite a bit
more time consuming (about 34 minutes for this data set on padawan), so it is sometimes impractical to
perform a full bootstrap under likelihood. In such cases you can save time by fixing the model
parameters before the bootstrap. A useful alternative might also be to use Bayesian phylogeny
estimation, since this automatically gives you cladesupport values as part of the process. Depending
on your statistical paradigmatic preferences, this can also be argued to have the benefit that
Bayesian support values are actual probabilities.
Bootstrapping can also be done in many other programs (ClustalX is an example),
but PAUP provides a much wider choice of criteria and model selection opportunities
so it is clearly preferable for most data sets.
Quit program:
quit
Parametric Bootstrapping: Comparing models of evolution
The main idea in parametric bootstrapping is to use a full model of evolution to
construct random sequences. These sequences are constructed by evolving a random start
sequence along a tree according to a set of specified substitution parameters. (hence
the name "parametric").
Parametric bootstrapping can be used to compare alternative models of evolution and
to find out whether the more complicated model is significantly better than the
simpler (null) model. The approach is as follows: (1) Fit the null model to the data,
note the log likelihood, also note the estimated tree and parameter values. (2) Fit
the alternate model to the data, note the likelihood. Compute the difference in
likelihoods. (3) Using the tree and other parameters estimated under the null model,
construct a large (>1000 perhaps) set of simulated data sets by "evolving" random
sequences on the tree according to the fitted substitution parameters. These
simulated data are representative of how data sets would look IF the null model was
correct. (4) For each of the simulated data sets, fit the null and alternative models
to the data and compute the difference in likelihoods. This gives us a distribution of
the log likelihood differences that we could expect by chance, even if the null model
was true. (5) Compare the observed difference in log likelihoods, to the empirical
nulldistribution of log likelihood differences in order to see whether the observed
difference is significantly larger than expected by chance. (If the difference on the
real data is similar to the typical difference on the simulated data, then it seems
that the null model might be true. If the real difference is much much higher than
the average random difference, then we might infer that the complicated model is
indeed better.) (Note: An alternative way to compare nonnested models is the AIC
approach that we discussed last week).
Inspect data set:
nedit primatemitDNA.phylip
This file contains the same sequences as primatemitDNA.nexus, but in phylip format, which is
required by the PAML programs.
Inspect previously constructed tree:
nedit primatetree.ph
The programs in the PAML package that we will use in this exercise, are not well suited
for treereconstruction so I have supplied you with one. (Normally you
would first use PAUP or MrBayes to find a tree and then move on to this step).
Set parameters in control file:
nedit baseml.ctl
Unlike PAUP, the PAML programs are not run interactively. Instead you have to list all
settings in a "control file" prior to starting the program. This control file is used to
set parameters for the baseml program that is used for fitting nucleotide
substitution models. The control file has one parameter per line; a parameter name is
followed by an equals sign and then the value of the parameter; lines starting
with an asterisk are "commented out" and are ignored.
You should set the following parameters:
seqfile = primatemitDNA.phylip
treefile = primatetree.ph
outfile = primate.resultfile
model = 4
fix_alpha = 1
alpha = 0
These settings ensure that you will use the sequence file
primatemitDNA.nexus, the tree file primatetree.ph, the output
file primate.resultfile, the model that will be used for fitting is
HKY ("number 4" in baseml), and you will not use different (gamma distributed)
rates at different sites (alpha = 0). Save the file, quit nedit and proceed to
running the model fitting.
Fit HKY model to data:
baseml baseml.ctl
If all settings were entered correctly, baseml should now run using the settings
in baseml.ctl.
Inspect result file:
When the run has finished (after 1 second or so) you can inspect the
result file:
nedit primate.resultfile &
Q3: Note the following things: (1) The average nucleotide
frequencies. (2) The log likelihood (the lnL is written on a line starting
"lnL(ntime:"  it is the negative number right after the parenthesis). (3) The
tree with branch lengths, in Newick (parenthesis) format. (4) The free
rate matrix parameters (there is one free parameter in this case). This is
noted below a line that starts like this: "Parameters in the rate matrix
(HKY85)".
Q4: What is the meaning of the rate parameter in this model?
Prepare simulator program:
Above, you fitted our null model (HKY) to the data, and got estimates of
all parameter values. In a short while, we want to use these estimates as the basis for
constructing simulated data sets. You should now open the control file for the the
sequence simulator program and change the settings accordingly:
nedit MCbase.dat &
This file starts with a set of uncommented parameter values. It is NOT a
user friendly format, but you should be able to figure it out: specifically,
you will find an explanation of the format at the end of the file. You should
change the following parameters: (1) Number of sequences should be set to the
same as the number in the data set. (2) Number of nucleotide sites should be
set to the length of the input data. (3) Number of replicates should be set to
1000. (4) The tree from your result file should be inserted instead of the tree
that is already there. Make sure to get the tree WITH branch lengths. Also make
sure to keep the blank line after the tree, or evolver wont run (!) (5) Model
should be set to 4 (meaning HKY). (6) "kappa or rate parameters in model"
should be set to the rate parameter value you found for HKY above.
(7) "alpha" (the gamma shape parameter) should be set to zero. (8) The
average base frequencies from above should also be entered.
You have now prepared the simulator to run with the parameters of the null model (HKY). We will
return to simulation shortly, but first you should close this window and move on to the fitting
of the alternate model to the real data.
Set GTR parameters in control file:
nedit baseml.ctl
You should now set the model parameter so it indicates the GTR model:
model = 7
Fit GTR model to data:
baseml baseml.ctl
Inspect result file:
When the run has finished (after 1 second or so) you can inspect the
result file:
nedit primate.resultfile
Q5: This time you should just note the log likelihood of the fit. Is
it better or worse than the HKY fit?
Construct simulated data sets:
evolver 5 MCbase.dat
This starts the program evolver, which will now construct 1000 simulated data sets of 5
sequences, each 896 nucleotides long. The sequences will be evolved on the tree you found
above, and using the base frequencies and substitution parameter estimated above also.
These 1000 data sets therefore represent the type of data we would collect if
sequences had in fact evolved according to an HKY model with the specified parameters. The
simulator will take a short while to finish.
Inspect simulated data:
nedit mc.paml
Q6: The 1000 simulated data sets have been collected in a single file.
Notice that all replicates are different: Note the first ten nucleotides
from the "Human" sequence in the first two replicates.
Prepare analysis of simulated data using HKY:
nedit baseml.ctl
You are now ready to analyze the simulated data using our two alternative models.
First, you will prepare the control file for analyzing the data using the HKY model.
Specifically, You should set the following parameters:
seqfile = mc.paml
model = 4
ndata = 1000
The last line is important: it tells baseml that the input data file has
1000 data sets which should be analyzed. NOTE: you have to delete the leading
asterisk "*" before this option is activated!!!!!!
Analyze simulated data using HKY:
baseml baseml.ctl
Inspect result file:
nedit primate.resultfile
The primate.resultfile file now contains a summary of parameter estimates for all 1000 data
sets. Here, we are only interested in the loglikelihods.
Q7: Note the fit (the log likelihood) of the model to the first two data sets (recall
that the lnL value is given on the line starting "lnL(ntime:").
Extract all loglikelihoods from result file:
cat primate.resultfile  grep 'lnL(ntime:'  gawk '{print $5}' > HKY.lnL
This command extracts HKY loglikelihood values for all 1000 data sets. (Specifically, the
grep command extracts all lines starting with "lnL(ntime:", and the gawk
command then prints the fifth field  the lnL value  on that line. Finally all the lnL
values are written to the file HKY.lnl).
Prepare analysis of simulated data using GTR:
nedit baseml.ctl
You will now prepare the control file for analyzing the data using the GTR model (the
alternate model). Specifically, You should set the following parameters:
seqfile = mc.paml
model = 7
ndata = 1000
Most of this was presumably already set to the correct values.
Analyze simulated data using the GTR model:
baseml baseml.ctl
The GTR model is more complicated, so this step will take a bit longer than the HKY step
above (probably about twice as long).
Q8: Again note the fit (the log likelihood) of the model to the first two data sets.
Extract all loglikelihoods from result file:
cat primate.resultfile  grep 'lnL(ntime:'  gawk '{print $5}' > GTR.lnL
Compute likelihood difference:
Q9: At the beginning of this exercise you found the log likelihood of the HKY model and of
the GTR model, on the real data set. Now, compute twice the difference between these numbers and
write it down. We now want to find out if this difference is significant.
Determine nulldistribution of likelihood difference:
The idea with the simulation exercise above was to find out how large a log likelihood
difference one can get by chance between the two models. (Recall, that the data was
simulated according to an HKY model, and we have then fitted both HKY and GTR models
to this data). We now want to compute the distribution of these random differences:
paste GTR.lnL HKY.lnL  gawk '{print 2*($1$2)}' > lnLdiff.txt
This command computes twice the loglikelihood difference for all pairs of fitted HKY
and GTR models, and saves the numbers to a file.
Inspect distribution:
histo V orientation=landscape w 0.2 b 0 t 50 lnLdiff.txt
This shows the distribution of likelihood differences (between HKY and GTR) that can
arise by chance even when the sequences in fact evolve according to HKY. Where in the
distribution is your observed value approximately? (Recall that we are here looking at two
times the difference in log likelihood, so your original difference should be mutiplied by two).
As you will recall, we last time mentioned that if the models were nested, then this
distribution should be a chisquared distribution (with 4 degrees of freedom in this case,
since GTR has 4 extra parameters compared to HKY). Check with this website whether that seems to be
about right:
Chisquared distribution
Q10: Include screen dumps of both the empirical distribution (your lnL data) and
the theoretical chisquared curve. Do the distributions look similar?
Determine significance of observed difference:
You now have a distribution of 1000 simulated log likelihood differences. To determine
the pvalue you need to figure out how many of these random differences that are larger
than your observed value. To do that we first sort the simulated numbers:
cat lnLdiff.txt  sort n
Q11: Count how many numbers in the simulated distribution that are larger than your observed
value (where "observed value" means your observed lnLdifference times two). Then divide this number
by the total number of values in the distribution (1000). This is your pvalue. Is it higher or lower
than 5%?
Recall that the pvalue essentially tells you how likely it is to obtain your observed
difference by chance (meaning: if the null model IS in fact correct). A large pvalue means
you are likely to obtain you observed value by chance, a small pvalue means it is unlikely
to obtain the observed value by chance.
Note that in the case we investigated here we actually have a good idea about the
nulldistribution, and we would therefore normally NOT use parametric bootstrapping (we would
simply compare to a chisquared distrbution with 4 degrees of freedom). Parametric bootstrapping
would mostly be used for more complicated cases, such as when comparing nonnested models.
