|
Bayesian Phylogenetic Analysis
Exercise written by: Anders Gorm Pedersen
Overview
Today's exercise will focus on phylogenetic analysis using Bayesian methods. We
already discussed Bayesian statistics and Bayesian phylogeny in class, but if you want to
refresh your memory, then you can read a brief introduction at the end of this page.
In this exercise we will explore how one can determine and use posterior probability
distributions over trees, over clades, and over substitution parameters. We will also
touch upon the difference between marginal and joint probability distributions.
Log in to your account.
You will need three shell windows for today's exercise. Two are used for running
the phylogeny program "MrBayes". The last one will be used for various analyses of
the files that are output by MrBayes.
Construct todays working directory:
mkdir bayes
ls -l
Change to today's working directory
cd bayes
Make sure that you cd to today's directory in all three shell windows.
Copy files for today's exercise:
cp ~gorm/bayes/primatemitDNA.nexus ./primatemitDNA.nexus
cp ~gorm/bayes/neanderthalsmall.nexus ./neanderthalsmall.nexus
cp ~gorm/bayes/hcvsmall.nexus ./hcvsmall.nexus
Start long run for later analysis:
Before we really start with today's exercise, you have to start a run that will
take quite long time to finish. The results will be analyzed later today, but since we
do not want to sit around waiting for it, we will start it now and discuss the results
later. For now you should just issue the commands given below without wondering what
they mean. I will explain them later in the exercise.
mb
This starts the program, giving you a prompt ("MrBayes> ") where you
can enter commands. The next set of commands will load the data file, specify
outgroup, specify the model, and start the run:
execute neanderthalsmall.nexus
outgroup PAN6030
lset nst=2 rates=gamma
mcmc ngen=500000 nchains=3 diagnfreq=10000 savebrlens=yes temp=0.05
The analysis will run for about 30 minutes. Once you have started the
run, you should make sure to minimize the window so it won't get in your way.
We will return and analyze the results later.
Posterior probability of trees
In today's exercise we will be using the program "MrBayes" to perform
Bayesian phylogenetic analysis. MrBayes is a program that, like PAUP*, can be
controlled by giving commands at a command line prompt. In fact, there is a
substantial overlap between the commands used to control MrBayes and the PAUP
command language. This should be a help when you are trying to understand how to
use the program.
Note that the command "help" will give you a list of all available
commands. Issuing "help command" will give you a more detailed
description of the specified command along with current option values. This is
similar to how "command ?" works in PAUP. There is also a very
useful MrBayes Wiki
with a manual and frequently asked questions.
Before you start the exercise, you should make sure that you are now in one of the
un-used shell windows, and that you have cd'ed to today's directory! We will now explore
how to find and use the posterior probability distribution over trees.
Start program:
mb
Get a quick overview of available commands:
help
Load your sequences:
execute primatemitDNA.nexus
This file contains mitochondrial DNA sequences from 5 different primates. Note that
MrBayes accepts input in nexus format, and that this is the same command that was
used to load sequences in PAUP*. In general, you can use many of the PAUP commands in
MrBayes also.
Inspect data set:
showmatrix
Define outgroup:
outgroup Gibbon
Specify your model of sequence evolution:
lset nst=2 rates=gamma
This command is again very much like the corresponding one in PAUP. You are
specifying that you want to use a model with two substitution types (nst=2),
and this is automatically taken to mean that you want to distinguish between
transitions and transversions. Furthermore, rates=gamma means that you want
the model to use a gamma distribution to account for different rates at different
sites in the sequence.
Start Markov chain Monte Carlo sampling:
Make sure to make the shell window as wide as possible and then issue this
command to start the run:
mcmc ngen=50000 samplefreq=100 nchains=4 savebrlens=yes diagnfreq=10000
What you are doing here is to use the method known as MCMCMC
("Metropolis-coupled Markov chain Monte Carlo") to empirically determine the
posterior probability distribution of trees, branch lengths and substitution
parameters. Recall that in the Bayesian framework this is how we learn about
parameter values: instead of finding the best point estimates, we typically want
to quantify the probability of the entire range of possible values. An
estimate of the time left is shown in the last column of output.
Let us examine the command in detail. First, ngen=50000
samplefreq=100 lets the search run for 50,000 steps ("generations") and
saves a tree once every 100 rounds (meaning that a total of 500 trees will be
saved). The option nchains=4 means that the MCMCMC sampling uses 4
parallel chains (but see below): one "cold" from which sampling takes place, and
three "heated" that move around in the parameter space more quickly to find
additional peaks in the probability distribution. As you might have guessed, savebrlens=yes means we want the
branch length information saved along with the sampled trees.
The option diagnfreq=10000 has to do with testing whether the
MrBayes run is succesful. Briefly, MrBayes will start two entirely independent
runs starting from different random trees. In the early phases of the run, the
two runs will sample very different trees but when they have reached convergence
(when they produce a good sample from the posterior probability distribution),
the two tree samples should be very similar. Every diagnfreq
generations, the program will compute a measure of how similar the tree-samples
are (specifically, the measure is the average standard deviation of split
frequencies). As a rule of thumb, you may want to run until this value is less
than 0.01.
During the run you will see reports about the progress of the two sets of
four chains. Each line of output lists the generation number and the log
likelihoods of the current tree/parameter combination for each of the two
groups of four chains (a column of asterisks separate the results for tho
independent runs). The cold chains are the ones enclosed in brackets [], while
the heated chains are enclosed in parentheses (). Occasionally the chains will
swap so one of the heated chains now becomes cold (and sampling then takes
place from this chain).
Continue run until parallel runs converge on same solution:
At the end of the run, Mrbayes will print the average standard deviation of
split frequencies (which is a measure of how similar the tree samples of the
two independent runs are). We recommend that you continue with the analysis
until the value gets below 0.01 (if the value is larger than 0.01 then you
should answer "yes" when the program asks "Continue the analysis?
(yes/no)".)
Once you have reached convergence you should note the total number of
generations and the average standard deviation of split frequencies on todays
results form.
Have a look at the resulting sample files:
Switch to your third shell-window (the one where you are not running
MrBayes). Open one of the parameter sampling files in an nedit window:
nedit primatemitDNA.nexus.run1.p
This file contains one line for each sampled point (you may want to turn off
line-wrapping in nedit under the preferences menu). Each row corresponds
to a certain sample time (or generation). Each column contains the
sampled values of one specific parameter. The first line contains headings
telling what the different columns are: "lnL" is the log likelihood
of the current parameter estimates, "TL" is the tree length (sum of
all branch lengths), "kappa" is the transition/transversion ratio,
"pi(A)" is the frequency of A (etc.), and "alpha" is the
shape parameter for the gamma distribution. (Column headings may be shifted relative
to their corresponding columns). Note how the values of most
parameters change a lot during the initial "burnin" period, before they settle
near their most probable values. Now, close the nedit window and have a look at
the file containing sampled trees:
nedit primatemitDNA.nexus.run1.t
Tree topology is also a parameter in our model, and exactly like for the
other parameters we also get samples from tree-space. One tree is printed per
line in the parenthetical format used by most phylogeny software. There are 5
taxa in the present data set, meaning that the tree-space consists of only 15
different possible trees. Since we have taken more than 15 sample points, there
must be several lines containing the same tree topology.
Examine "burn-in":
Recall, that the idea in MCMCMC sampling is to move around in parameter space
in such a way that the points will be visited according to their posterior
probability (i.e., a region with very high posterior probability will be visited
frequently). As mentioned in the lecture it takes a while after starting such a
process before the sampling gets "out of the foot-hills and into the mountains"
of the distribution, and only after this "burnin" period can the sample points
be trusted. Now,plot the lnL value for one of the run files:
mbplot primatemitDNA.nexus.run1.p 2 0 50000
mbplot is a small script written by me, that extracts the relevant columns
from the .p file and uses gnuplot to produce a plot of how the value changes
with generation number. The options mean that you will get a plot of the lnL
(column 2 in the file) from generation 0 to generation 50,000. Note how the
likelihood value (y-axis) increases through the generations (x-axis) during
the initial burnin, to finally reach a plateau. You can experiment with
plotting other columns as well if you want.
Investigate posterior probability distribution over trees:
MrBayes provides the sumt command to summarize the sampled trees. Before
using it, we need to decide on the burn-in. Since the convergence diagnostic we
used previously to determine when to stop the analysis discarded the first 25%
of the samples, it makes sense to also discard 25% of the samples obtained
during the analysis. Note that you have to tell MrBayes how many sample
points to discard, not how many generations. You get the number of
sample points by dividing the number of generations by 100 (recall that we only
sampled once every 100 generations). This means that if you ran the analysis
for 100,000 generations, then the burnin is: (100,000/100)x0.25 =
250 sample points
Return to the shell window where you have MrBayes running. In the command
below you should replace XX with the number of sample points you want to
discard.
sumt contype=halfcompat showtreeprobs=yes burnin=XX
(Scroll back so you can see the top of the output when the command is done).
This command gives you a summary of the trees that are in the file you
examined manually above. As mentioned, the option burnin=XX tells
MrBayes how many sample points to discard before analyzing the data. The option
contype=halfcompat requests that a majority rule consensus tree is
calculated from the set of trees that are left after discarding the burnin.
This consensus is the first tree plotted to the screen. Below the consensus
cladogram, a consensus phylogram is plotted. The branch lengths in this have
been averaged over the trees in which that branch was present (a particular
branch corresponds to a bi-partition of the data, and will typically not be
present in every sampled tree). The cladogram also has "clade credibility"
values. We will return to the meaning of these later in today's exercise.
What most interests us right now is the list of trees that is printed after
the phylogram. These trees are labeled "Tree 1", "Tree 2", etc, and are sorted
according to their posterior probability which is indicated by a lower-case p
after the tree number. (The upper-case P gives the cumulated probability of
trees shown so far, and is useful for constructing a credible set). This list
highlights how Bayesian phylogenetic analysis is different from maximum
likelihood: Instead of finding the best tree(s), we now get a full list of how
probable any possible tree is. How did the less probable
trees differ from the most likely one?
The list of trees and probabilities was printed because of the option
showtreeprobs=yes. Note that you probably do not want to issue that command
if you have much more than 5 taxa! In that case you could instead inspect the file
named primatemitDNA.nexus.trprobs which is now present in the same directory
as your other files (this file is automatically produced by the sumt
command).
Report results:
Make a sketch of the most probable tree and note its posterior
probability on today's results form.
Analysis of Neanderthal data (posterior probability of clades)
By now the run you started at the beginning of today's exercise should be
finished. Find the shell window and maximize it (you will need a wide window
to view the plotted trees). For the purpose of this exercise you don't need to
continue the analysis until the average standard deviation of split frequencies
becomes less than 0.01, but ordinarily you would want to continue the analysis
until this was achieved (you would probably need to run at least 1 million
generations and perhaps also use more chains).
Specifically, this data set consists of an alignment of mitochondrial DNA
from human (34 sequences), chimpanzee (1 sequence), and Neanderthal (3
sequences). The Neanderthal DNA was extracted from archaeological material,
specifically bones found at Feldhofer in Germany, Mezmaiskaya in Northern
Caucasus, and Vindija in Croatia. (Aligned mitochondrial DNA was kindly made
available by Professor Gabriel Gutierrez Pozo, Univ. Sevilla, Spain)
The classical view emerging from anatomical and archaeological studies places
Neanderthals as a different species from Homo sapiens. This is in agreement with
the "Out-of-Africa hypothesis", which states that Neanderthals coexisted without
mating with modern humans who originated in Africa somewhere between 100,000 to
200,000 years ago. There is, however, also anatomical and paleontological
research which supports the so-called "multi-regional hypothesis", which
propounds that some populations of archaic Homo evolved into modern human
populations in many geographical regions. Consequently, Neanderthals could have
contributed to the genetic pool of present-day Europeans. We will use the
present data set to consider this issue.
Find posterior probability of clades:
sumt contype=halfcompat showtreeprobs=no burnin=XX
Again remember to put your actual number of samples to discard instead of
the XX in the command line above, and again use 25% of the total number of
samples (not generations) as burnin.
Now examine the consensus tree that is plotted to screen: On the branches
that are resolved, you will notice that numbers have been plotted. These are
clade-credibility values, and are in fact the posterior probability that the
clade is real (based on the present data set). These numbers are different from
bootstrap values: unlike bootstrap support (which have no clear statistical
meaning) these are actual probabilities. Furthermore, they have been found
using a full probabilistic model, instead of neighbor joining, and have still
finished in a reasonable amount of time. These features make Bayesian phylogeny
very useful for assessing hypotheses about monophyly.
Based on your values, how much support does the present data set give to
the out-of-Africa hypothesis? (Are Neanderthals a monophyletic sister group to
homo?). Note the clade credibilities for Neanderthal and for Homo on today's
result form.
Probability distributions over other parameters
As the last thing today, we will now turn away from the tree topology, and
instead examine the other parameters that also form part of the probabilistic
model. We will do this using a reduced version of the Hepatitis C virus data
set that we have examined previously. Stay in the shell window where you
just performed the analysis of Neanderthal sequences.
Load data set:
execute hcvsmall.nexus
Define site partition:
charset 1stpos=1-.\3
charset 2ndpos=2-.\3
charset 3rdpos=3-.\3
partition bycodon = 3:1stpos,2ndpos,3rdpos
set partition=bycodon
prset ratepr=variable
This is an alternative way of specifying that different sites have different
rates. Instead of using a gamma distribution and learning which sites have what
rates from the data, we are instead using our prior knowledge about the
structure of the genetic code to specify that all 1st codon positions have the
same rate, all 2nd codon positions have the same rate, and all 3rd codon
positions have the same rate. Specifically, charset 1stpos=1-.\3 means
that we define a character set named "1stpos" which includes site 1 in the
alignment followed by every third site ("\3", meaning it includes sites
1, 4, 7, 11, ...) until the end of the alignment (here denoted ".").
Specify model:
lset nst=6
This specifies that we want to use a model of the General Time Reversible (GTR)
type, where all 6 substitution types have separate rate parameters.
When the lset command was discussed previously, a few issues were
glossed over. Importantly, and unlike PAUP, the lset command in MrBayes gives
no information about whether nucleotide frequencies are equal or not, and
whether they should be estimated from the data or not. In MrBayes this is
instead controlled by defining the prior probability of the nucleotide
frequencies (the command prset can be used to set priors). For
instance, a model with equal nucleotide frequencies correspond to having prior
probability 1 (one) for the frequency vector (A=0.25, C=0.25, G=0.25, T=0.25),
and zero prior probability for the infinitely many other possible vectors. As
you will see below, the default prior is not this limited, and the program will
therefore estimate the frequencies from the data.
Inspect model details:
showmodel
This command gives you a summary of the current model settings. You will also get a
summary of how the prior probabilities of all model parameters are set. You will for
instance notice that the nucleotide frequencies (parameter labeled "Statefreq")
have a "Dirichlet" prior. We will not go into the grisly details of what exactly the
Dirichlet distribution looks like, but merely note that it is a distribution over many
variables, and that depending on the exact parameters the distribution can be more or
less flat. The Dirichlet distribution is a handy way of specifying the prior probability
distribution of nucleotide (or amino acid) frequency vectors. The default statefreq prior
in MrBayes is the flat or un-informative prior dirichlet(1,1,1,1).
we will not go into the priors for the remaining parameters in any detail, but you may
notice that by default all topologies are taken to be equally likely (a flat prior on
trees).
Start MCMC sampling:
mcmc ngen=300000 samplefreq=100 diagnfreq=10000 nchains=4 savebrlens=yes
The run will take about 5 minutes to finish (you may want to ensure that the
average standard deviation of split frequencies is less than 0.01 before ending
the analysis).
Compute summary of parameter values:
sump burnin=XX
The sump command works much like the sumt command for the
non-tree parameters. Again, use 25% of the total number of samples as burnin
(and remember that number of samples is found by dividing the number of
generations by 100).
First, you get a plot of the lnL as a function of generation number. Values
from the two independent runs are labeled "1" and "2" respectively. If your
burnin has been chosen well, then the points should be randomly scattered over
a narrow lnL interval. (If you plotted lnL for the entire run, you would get
a plot similar to what you got with mbplot earlier, where the lnL rises
during the burnin and then levels off).
Secondly, the posterior probability
distribution of each parameter is summarized by giving the mean, variance,
median, and 95% credible interval.
Based on the reported posterior means, does it seem that r(C<->G)
is different from r(A<->C)?. Report the mean of both parameters on
today's results form. Here r(C<->G) is the relative substitution rate
between C and G. (Note that all substitution parameters are given as relative
values, with the GT rate arbitrarily set at a value of 1.00).
Marginal vs. joint distributions:
Strictly speaking the comparison above was not entirely appropriate. We first found the overall distribution
of the r(C<->G) parameter and then compared its mean to the mean of the overall distribution of the
r(A<->C) parameter. By doing things this way, we are ignoring the possibility that the two parameters
might be associated in some way. For instance, one parameter might always be larger than the other in any
individual sample, even though the total distributions have a substantial overlap. We should instead be looking
at the distribution over both parameters simultaneously. A probability distribution over several parameters
simultaneously is called a "joint distribution" over the parameters.
By looking at one parameter at a time, we are summing its probability over all values of
the other parameters. This is called the marginal distribution. The figure below
illustrates a joint distribution over two parameters, and shows how one can think of the
marginal distribution as what you get when you view the joint distribution from the side,
so to speak (hence the name "marginal").

Joint vs. marginal distribution. (Figure from Berry, "Statistics - a Bayesian perspective".)
Examine marginal and joint distributions:
Again find a shell window where MrBayes is not running and issue the following command:
cat hcvsmall.nexus.run1.p | grep -v '^Gen' | gawk '{if (NR > 750) print $6}' > rCG.data
This command takes the parameter file, removes the header line, and for all lines that were sampled after
the burnin period (set to 750 here) it prints the r(C<->G) parameter to the file named "rCG.data".
Now extract a few extra interesting columns in a similar manner.
cat hcvsmall.nexus.run1.p | grep -v '^Gen' | gawk '{if (NR > 750) print $9}' > rAC.data
cat hcvsmall.nexus.run1.p | grep -v '^Gen' | gawk '{if (NR > 750) print $14}' > rm1.data
cat hcvsmall.nexus.run1.p | grep -v '^Gen' | gawk '{if (NR > 750) print $15}' > rm2.data
cat hcvsmall.nexus.run1.p | grep -v '^Gen' | gawk '{if (NR > 750) print $16}' > rm3.data
We can now plot some interesting distributions:
histo -w 0.01 -L"CG":"AC" -m ll rCG.data rAC.data
We here use the histo program to plot the marginal distributions of rCG and rAC (issue
the command man histo if you want to learn how to use the histo program). Note
that while rAC tends to be larger than rCG, there is still a substantial overlap between
the distributions. We therefore need to examine the joint distribution to find the real
probability that rAC > rCG. Close the plot window and issue this command:
wc -l rCG.data
This gives us a count of how many points remain in the sample files after having
discarded the burnin. Write down the number on today's results form.
paste rAC.data rCG.data | gawk '{if ($1>$2) print $0}' |Â wc -l
This counts the number of sample points where rAC > rCG (note this number on the form also). You can now
find the probability that rAC > rCG by dividing the last number by the first (do this and note the result
on today's results form).
Note how examining the joint distribution provides you with information that you
could not get from simply comparing the marginal distributions. This very simple procedure can also be
performed using spread-sheet programs, and can obviously be used to answer many different questions.
We will finish off today's exercise by making one final plot:
histo -w 0.01 -L"1st":"2nd":"3rd" -m lll rm1.data rm2.data rm3.data
This command plots the relative mutation rates at the first, second, and third codon positions. How does
this fit with your knowledge of the structure of the genetic code? Make a sketch of the three distributions
on today's results form, and as soon as you have shown it to the teacher you are done. Have a nice
weekend!
Resources
- MrBayes:
Bayesian Inference of Phylogeny.
As was the case for likelihood methods, Bayesian analysis is founded on having a
probabilistic model of how the observed data is produced. (This means that, for a given
set of parameter values, you can compute the probability of any possible data set). You
will recall from today's lecture that in Bayesian statistics the goal is to obtain a full
probability distribution over all possible parameter values. To find this so-called
posterior probability distribution requires combining the likelihood and the prior
probability distribution.
The prior probability distribution shows your beliefs about the parameters before
seeing any data, while the likelihood shows what the data is telling about the
parameters. Specifically, the likelihood of a parameter value is the probability of
the observed data given that parameter value. (This is the measure we have previously
used to find the maximum likelihood estimate). If the prior probability distribution
is flat (i.e., if all possible parameter values have the same prior probability) then
the posterior distribution is simply proportional to the likelihood distribution, and
the parameter value with the maximum likelihood then also has the maximum posterior
probability. However, even in this case, using a Bayesian approach still allows one to
interpret the posterior as a probability distribution. If the prior is NOT flat, then
it may have a substantial impact on the posterior although this effect will diminish
with increasing amounts of data. A prior may be derived from the results of previous
experiments. For instance one can use the posterior of one analysis as the prior in a
new, independent analysis.
In Bayesian phylogeny the parameters are of the same kind as in maximum likelihood
phylogeny. Thus, typical parameters include tree topology, branch lengths, nucleotide
frequencies, and substitution model parameters such as for instance the
transition/transversion ratio or the gamma shape parameter. (The difference is that
while we want to find the best point estimates of parameter values in maximum
likelihood, the goal in Bayesian phylogeny is instead to find a full probability
distribution over all possible parameter values). The observed data is again usually
taken to be the alignment, although it would of course be more reasonable to say that
the sequences are what have been observed (and the alignment should then be inferred
along with the phylogeny).
|