Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Bayesian Phylogenetic Analysis

Exercise written by: Anders Gorm Pedersen


Overview

Today's exercise will focus on phylogenetic analysis using Bayesian methods. We already discussed Bayesian statistics and Bayesian phylogeny in class, but if you want to refresh your memory, then you can read a brief introduction at the end of this page.

In this exercise we will explore how one can determine and use posterior probability distributions over trees, over clades, and over substitution parameters. We will also touch upon the difference between marginal and joint probability distributions.


Getting started.

  1. Log in to your account.

  2. Construct todays working directory:

    mkdir bayes
    ls -l
  3. Change to today's working directory

    cd bayes
  4. Copy files for today's exercise:

    cp ~gorm/bayes/primatemitDNA.nexus ./primatemitDNA.nexus
    cp ~gorm/bayes/neandertal_aligned_batch.nxs ./neandertal_aligned_batch.nxs
    cp ~gorm/bayes/neandertal_aligned.nxs ./neandertal_aligned.nxs
    cp ~gorm/bayes/hcvsmall.nexus ./hcvsmall.nexus
    cp ~gorm/bayes/mbplot ./mbplot

    You have analyzed (versions of) all these data files previously in this course. We will now use Bayesian phylogenetic analysis to complement what we learned in those analyses.


Posterior probability of trees

In today's exercise we will be using the program "MrBayes" to perform Bayesian phylogenetic analysis. MrBayes is a program that, like PAUP*, can be controlled by giving commands at a command line prompt. In fact, there is a substantial overlap between the commands used to control MrBayes and the PAUP command language. This should be a help when you are trying to understand how to use the program.

Note that the command "help" will give you a list of all available commands. Issuing "help command" will give you a more detailed description of the specified command along with current option values. This is similar to how "command ?" works in PAUP. There is also a very useful MrBayes Wiki with a manual and frequently asked questions.

  1. Start program:

    mb

    This starts the program, giving you a prompt ("MrBayes> ") where you can enter commands.

  2. Get a quick overview of available commands:

    help
  3. Load your sequences:

    execute primatemitDNA.nexus

    This file contains mitochondrial DNA sequences from 5 different primates. Note that MrBayes accepts input in nexus format, and that this is the same command that was used to load sequences in PAUP*. In general, you can use many of the PAUP commands in MrBayes also.

  4. Inspect data set:

    showmatrix
  5. Define outgroup:

    outgroup Gibbon
  6. Specify your model of sequence evolution:

    lset nst=2 rates=gamma

    This command is again very much like the corresponding one in PAUP. You are specifying that you want to use a model with two substitution types (nst=2), and this is automatically taken to mean that you want to distinguish between transitions and transversions. Furthermore, rates=gamma means that you want the model to use a gamma distribution to account for different rates at different sites in the sequence.

  7. Start Markov chain Monte Carlo sampling:

    Make sure to make the shell window as wide as possible and then issue this command to start the run:

    mcmc ngen=100000 samplefreq=100 nchains=3 diagnfreq=5000

    What you are doing here is to use the method known as MCMCMC ("Metropolis-coupled Markov chain Monte Carlo") to empirically determine the posterior probability distribution of trees, branch lengths and substitution parameters. Recall that in the Bayesian framework this is how we learn about parameter values: instead of finding the best point estimates, we typically want to quantify the probability of the entire range of possible values. An estimate of the time left is shown in the last column of output.

    Let us examine the command in detail. First, ngen=100000 samplefreq=100 lets the search run for 100,000 steps ("generations") and saves a tree once every 100 rounds (meaning that a total of 1000 trees will be saved). The option nchains=3 means that the MCMCMC sampling uses 3 parallel chains (but see below): one "cold" from which sampling takes place, and two "heated" that move around in the parameter space more quickly to find additional peaks in the probability distribution.

    The option diagnfreq=5000 has to do with testing whether the MrBayes run is succesful. Briefly, MrBayes will start two entirely independent runs starting from different random trees. In the early phases of the run, the two runs will sample very different trees but when they have reached convergence (when they produce a good sample from the posterior probability distribution), the two tree samples should be very similar. Every diagnfreq generations, the program will compute a measure of how similar the tree-samples are (specifically, the measure is the average standard deviation of split frequencies). As a rule of thumb, you may want to run until this value is less than 0.01.

    During the run you will see reports about the progress of the two sets of four chains. Each line of output lists the generation number and the log likelihoods of the current tree/parameter combination for each of the two groups of three chains (a column of asterisks separate the results for the independent runs). The cold chains are the ones enclosed in brackets [], while the heated chains are enclosed in parentheses (). Occasionally the chains will swap so one of the heated chains now becomes cold (and sampling then takes place from this chain).

  8. Continue run until parallel runs converge on same solution:

    At the end of the run, Mrbayes will print the average standard deviation of split frequencies (which is a measure of how similar the tree samples of the two independent runs are). We recommend that you continue with the analysis until the value gets below 0.01 (if the value is larger than 0.01 then you should answer "yes" when the program asks "Continue the analysis? (yes/no)".)

    Q1: Once you have reached convergence (and answered "no" to continue the analysis) you should note the total number of generations and the average standard deviation of split frequencies.

  9. Have a look at the resulting sample files:

    Open a new shell-window and cd to today's working directory. Open one of the parameter sampling files in an nedit window:

    nedit primatemitDNA.nexus.run1.p

    This file contains one line for each sampled point (you may want to turn off line-wrapping in nedit under the preferences menu). Each row corresponds to a certain sample time (or generation). Each column contains the sampled values of one specific parameter. The first line contains headings telling what the different columns are: "lnL" is the log likelihood of the current parameter estimates, "TL" is the tree length (sum of all branch lengths), "kappa" is the transition/transversion rate ratio, "pi(A)" is the frequency of A (etc.), and "alpha" is the shape parameter for the gamma distribution. (Column headings may be shifted relative to their corresponding columns). Note how the values of most parameters change a lot during the initial "burnin" period, before they settle near their most probable values. Now, close the nedit window and have a look at the file containing sampled trees:

    nedit primatemitDNA.nexus.run1.t

    Tree topology is also a parameter in our model, and exactly like for the other parameters we also get samples from tree-space. One tree is printed per line in the parenthetical format used by most phylogeny software. There are 5 taxa in the present data set, meaning that the tree-space consists of only 15 different possible trees. Since we have taken more than 15 sample points, there must be several lines containing the same tree topology.

  10. Examine "burn-in":

    Recall, that the idea in MCMCMC sampling is to move around in parameter space in such a way that the points will be visited according to their posterior probability (i.e., a region with very high posterior probability will be visited frequently). As mentioned in the lecture it takes a while after starting such a process before the sampling gets "out of the foot-hills and into the mountains" of the distribution, and only after this "burnin" period can the sample points be trusted. Now,plot the lnL value for one of the run files:

    mbplot primatemitDNA.nexus.run1.p 2 0 50000

    mbplot is a small script written by me, that extracts the relevant columns from the .p file and uses gnuplot to produce a plot of how the value changes with generation number. The options mean that you will get a plot of the lnL (column 2 in the file) from generation 0 to generation 50,000. Note how the likelihood value (y-axis) increases through the generations (x-axis) during the initial burnin, to finally reach a plateau. You can experiment with plotting other columns as well, or plotting only the initial generations if you want.

  11. Investigate posterior probability distribution over trees:

    MrBayes provides the sumt command to summarize the sampled trees. Before using it, we need to decide on the burn-in. Since the convergence diagnostic we used previously to determine when to stop the analysis discarded the first 25% of the samples, it makes sense to also discard 25% of the samples obtained during the analysis.

    Return to the shell window where you have MrBayes running. In the command below relburnin=yes and burninfrac=0.25 tells MrBayes to discard 25% of the samples as burnin (you could also have explicitly given the number of samples to discard - help sumt will give you details about the command and the current option settings).

    sumt contype=halfcompat conformat=simple showtreeprobs=yes relburnin=yes burninfrac=0.25

    (Scroll back so you can see the top of the output when the command is done). This command gives you a summary of the trees that are in the file you examined manually above. As mentioned, the option burnin=XX tells MrBayes how many sample points to discard before analyzing the data. The option contype=halfcompat requests that a majority rule consensus tree is calculated from the set of trees that are left after discarding the burnin. This consensus is the first tree plotted to the screen. Below the consensus cladogram, a consensus phylogram is plotted. The branch lengths in this have been averaged over the trees in which that branch was present (a particular branch corresponds to a bi-partition of the data, and will typically not be present in every sampled tree). The cladogram also has "clade credibility" values. We will return to the meaning of these later in today's exercise.

    What most interests us right now is the list of trees that is printed after the phylogram. These trees are labeled "Tree 1", "Tree 2", etc, and are sorted according to their posterior probability which is indicated by a lower-case p after the tree number. (The upper-case P gives the cumulated probability of trees shown so far, and is useful for constructing a credible set). This list highlights how Bayesian phylogenetic analysis is different from maximum likelihood: Instead of finding the best tree(s), we now get a full list of how probable any possible tree is.

    The list of trees and probabilities was printed because of the option showtreeprobs=yes. Note that you probably do not want to issue that command if you have much more than 5 taxa! In that case you could instead inspect the file named primatemitDNA.nexus.trprobs which is now present in the same directory as your other files (this file is automatically produced by the sumt command).

    NOTE ADDED APRIL 2013: Apparently there is a bug in the version of mrbayes we are using now, which means leaf names are not printed on the list of trees with probabilities. In order to investigate these tree probabilities you will instead need to do the following:

    • Copy the file primatemitDNA.nexus.trprobs to your own computer.
    • Open the file in a text editor - note that tree probabilitie are included as comments in this nexus file.
    • Also open the file in FigTree. The tree you will see is the first (most probable tree). Reroot on gibbon and inspect the topology of the tree.
    • To see other trees you first need to de-select user-rooting (dont know why this is so...): In the leftmost column of the figtree window, open the "Trees" section, and de-select the "Root tree" checkbox.
    • When user rooting has been de-selected you can click the arrow labeled "Prev/Next" at the top of the window to move to the next (less probable) tree. Again root on gibbon and inspect topology.
    • De-select user rooting again, move to the final tree, root on gibbon and inspect topology.
  12. Report results:

    Q2: Include a plot of the most probable tree and note its posterior probability. How did the less probable trees differ from the most likely one?


Analysis of Neanderthal data (posterior probability of clades)

The classical view emerging from anatomical and archaeological studies places Neanderthals as a different species from Homo sapiens. This is in agreement with the "Out-of-Africa hypothesis", which states that Neanderthals coexisted without mating with modern humans who originated in Africa somewhere between 100,000 to 200,000 years ago. There is, however, also anatomical and paleontological research which supports the so-called "multi-regional hypothesis", which propounds that some populations of archaic Homo evolved into modern human populations in many geographical regions. Consequently, Neanderthals could have contributed to the genetic pool of present-day Europeans. We will use the present data set to consider this issue.

  1. Load Neanderthal data set:

    execute neandertal_aligned.nxs
  2. Investigate data:

    showmatrix

    The data set you have analyzed here consists of an alignment of mitochondrial DNA from human (53 sequences), chimpanzee (1 sequence), and Neanderthal (1 sequence). The Neanderthal DNA was extracted from archaeological material, specifically bones found at Vindija in Croatia.

  3. Start analysis:

    outgroup Pan_troglodytes
    lset nst=2
    mcmc ngen=200000 nchains=3 diagnfreq=10000
    
  4. Find posterior probability of clades:

    sumt contype=halfcompat showtreeprobs=no relburnin=yes burninfrac=0.25

    Examine the consensus tree that is plotted to screen: On the branches that are resolved, you will notice that numbers have been plotted. These are clade-credibility values, and are in fact the posterior probability that the clade is real (based on the present data set). These numbers are different from bootstrap values: unlike bootstrap support (which have no clear statistical meaning) these are actual probabilities. Furthermore, they have been found using a full probabilistic model, instead of neighbor joining, and have still finished in a reasonable amount of time. These features make Bayesian phylogeny very useful for assessing hypotheses about monophyly.

    Q3: Also investigate the tree in FigTree: the result of the sumt command has been saved to a file named "neandertal_aligned.nxs.con.tre". Open this file in FigTree, root the tree on the chimpanzee, and select display of clade credibilities by putting a check mark in the box "Branch labels", expanding the "Branch labels" section by clicking on the disclosure triangle, and selecting "prob" in the "Display" drop-down menu. Include the plot in your report.

    Q4: Based on your values, how much support does the present data set give to the out-of-Africa hypothesis? (Are Homo sapiens a monophyletic group excluding the Neanderthal?). Note the clade credibility for Homo sapiens.


Probability distributions over other parameters

As the last thing today, we will now turn away from the tree topology, and instead examine the other parameters that also form part of the probabilistic model. We will do this using a reduced version of the Hepatitis C virus data set that we have examined previously. Stay in the shell window where you just performed the analysis of Neanderthal sequences.

  1. Load data set:

    execute hcvsmall.nexus
  2. Define site partition:

    charset 1stpos=1-.\3
    charset 2ndpos=2-.\3
    charset 3rdpos=3-.\3
    partition bycodon = 3:1stpos,2ndpos,3rdpos
    set partition=bycodon
    prset ratepr=variable
    

    This is an alternative way of specifying that different sites have different rates. Instead of using a gamma distribution and learning which sites have what rates from the data, we are instead using our prior knowledge about the structure of the genetic code to specify that all 1st codon positions have the same rate, all 2nd codon positions have the same rate, and all 3rd codon positions have the same rate. Specifically, charset 1stpos=1-.\3 means that we define a character set named "1stpos" which includes site 1 in the alignment followed by every third site ("\3", meaning it includes sites 1, 4, 7, 11, ...) until the end of the alignment (here denoted ".").

  3. Specify model:

    lset nst=6

    This specifies that we want to use a model of the General Time Reversible (GTR) type, where all 6 substitution types have separate rate parameters.

    When the lset command was discussed previously, a few issues were glossed over. Importantly, and unlike PAUP, the lset command in MrBayes gives no information about whether nucleotide frequencies are equal or not, and whether they should be estimated from the data or not. In MrBayes this is instead controlled by defining the prior probability of the nucleotide frequencies (the command prset can be used to set priors). For instance, a model with equal nucleotide frequencies corresponds to having prior probability 1 (one) for the frequency vector (A=0.25, C=0.25, G=0.25, T=0.25), and zero prior probability for the infinitely many other possible vectors. As you will see below, the default prior is not this limited, and the program will therefore estimate the frequencies from the data.

  4. Inspect model details:

    showmodel

    This command gives you a summary of the current model settings. You will also get a summary of how the prior probabilities of all model parameters are set. You will for instance notice that the nucleotide frequencies (parameter labeled "Statefreq") have a "Dirichlet" prior. We will not go into the grisly details of what exactly the Dirichlet distribution looks like, but merely note that it is a distribution over many variables, and that depending on the exact parameters the distribution can be more or less flat. The Dirichlet distribution is a handy way of specifying the prior probability distribution of nucleotide (or amino acid) frequency vectors. The default statefreq prior in MrBayes is the flat or un-informative prior dirichlet(1,1,1,1).

    we will not go into the priors for the remaining parameters in any detail, but you may notice that by default all topologies are taken to be equally likely (a flat prior on trees).

  5. Start MCMC sampling:

    mcmc ngen=300000 samplefreq=100 diagnfreq=10000 nchains=4

    The run will take about 2 minutes to finish (you may want to ensure that the average standard deviation of split frequencies is less than 0.01 before ending the analysis).

  6. Compute summary of parameter values:

    sump relburnin=yes burninfrac=0.25

    The sump command works much like the sumt command for the non-tree parameters. Again, we are using 25% of the total number of samples as burnin.

    First, you get a plot of the lnL as a function of generation number. Values from the two independent runs are labeled "1" and "2" respectively. If your burnin has been chosen well, then the points should be randomly scattered over a narrow lnL interval. (If you plotted lnL for the entire run, you would get a plot similar to what you got with mbplot earlier, where the lnL rises during the burnin and then levels off).

    Secondly, the posterior probability distribution of each parameter is summarized by giving the mean, variance, median, and 95% credible interval.

    Q5: Based on the reported posterior means, does it seem that r(C<->G) is different from r(A<->C)?. Report the mean of both parameters. Here r(C<->G) is the relative substitution rate between C and G.

  7. Marginal vs. joint distributions:

    Strictly speaking the comparison above was not entirely appropriate. We first found the overall distribution of the r(C<->G) parameter and then compared its mean to the mean of the overall distribution of the r(A<->C) parameter. By doing things this way, we are ignoring the possibility that the two parameters might be associated in some way. For instance, one parameter might always be larger than the other in any individual sample, even though the total distributions have a substantial overlap. We should instead be looking at the distribution over both parameters simultaneously. A probability distribution over several parameters simultaneously is called a "joint distribution" over the parameters.

    By looking at one parameter at a time, we are summing its probability over all values of the other parameters. This is called the marginal distribution. The figure below illustrates a joint distribution over two parameters, and shows how one can think of the marginal distribution as what you get when you view the joint distribution from the side, so to speak.


    Joint vs. marginal distribution.

    (Figure from Berry, "Statistics - a Bayesian perspective".)

  8. Examine marginal and joint distributions:

    Again find a shell window where MrBayes is not running and issue the following command:

    cat hcvsmall.nexus.run1.p | grep -v '^Gen' | gawk '{if (NR > 750) print $7}' > rCG.data
    

    This command takes the parameter file, removes the header line, and for all lines that were sampled after the burnin period (set to 750 here) it prints the r(C<->G) parameter to the file named "rCG.data". Now extract a few extra interesting columns in a similar manner.

    cat hcvsmall.nexus.run1.p | grep -v '^Gen' | gawk '{if (NR > 750) print $4}' > rAC.data
    cat hcvsmall.nexus.run1.p | grep -v '^Gen' | gawk '{if (NR > 750) print $14}' > rm1.data
    cat hcvsmall.nexus.run1.p | grep -v '^Gen' | gawk '{if (NR > 750) print $15}' > rm2.data
    cat hcvsmall.nexus.run1.p | grep -v '^Gen' | gawk '{if (NR > 750) print $16}' > rm3.data
    

    We can now plot some interesting distributions (Note: you may need to use a shell window on organism.cbs.dtu.dk or login.cbs.dtu.dk for the "histo" program to work):

    histo -w 0.01 -L"CG":"AC" -m ll rCG.data rAC.data

    Q6: Include the plot in your report.

    We here use the histo program to plot the marginal distributions of rCG and rAC (issue the command man histo if you want to learn how to use the histo program). Note that while rAC tends to be larger than rCG, there is still a substantial overlap between the distributions. We therefore need to examine the joint distribution to find the real probability that rAC > rCG. Close the plot window and issue this command:

    wc -l rCG.data

    Q7: This gives us a count of how many points remain in the sample files after having discarded the burnin. Write down the number.

    paste rAC.data rCG.data | gawk '{if ($1>$2) print $0}' | wc -l

    Q8: This counts the number of sample points where rAC > rCG (note this number also). You can now find the joint probability that rAC > rCG by dividing the last number by the first (do this and note the result).

    Note how examining the joint distribution provides you with information that you could not get from simply comparing the marginal distributions. This very simple procedure can also be performed using spread-sheet programs, and can obviously be used to answer many different questions.

    We will finish off today's exercise by making one final plot:

    histo -w 0.01 -L"1st":"2nd":"3rd" -X "Relative substitution rate" -m lll rm1.data rm2.data rm3.data

    Q9: This command plots the relative substitution rates at the first, second, and third codon positions. Since random mutations presumably hit all three codon positions with the same frequency, any differences must be caused by subsequent selection. How does the result fit with your knowledge of the structure of the genetic code? Include the plot in your report.


    Resources

    • MrBayes: Bayesian Inference of Phylogeny.

    Introduction to Bayesian phylogenetic analysis

    As was the case for likelihood methods, Bayesian analysis is founded on having a probabilistic model of how the observed data is produced. (This means that, for a given set of parameter values, you can compute the probability of any possible data set). You will recall from today's lecture that in Bayesian statistics the goal is to obtain a full probability distribution over all possible parameter values. To find this so-called posterior probability distribution requires combining the likelihood and the prior probability distribution.

    The prior probability distribution shows your beliefs about the parameters before seeing any data, while the likelihood shows what the data is telling about the parameters. Specifically, the likelihood of a parameter value is the probability of the observed data given that parameter value. (This is the measure we have previously used to find the maximum likelihood estimate). If the prior probability distribution is flat (i.e., if all possible parameter values have the same prior probability) then the posterior distribution is simply proportional to the likelihood distribution, and the parameter value with the maximum likelihood then also has the maximum posterior probability. However, even in this case, using a Bayesian approach still allows one to interpret the posterior as a probability distribution. If the prior is NOT flat, then it may have a substantial impact on the posterior although this effect will diminish with increasing amounts of data. A prior may be derived from the results of previous experiments. For instance one can use the posterior of one analysis as the prior in a new, independent analysis.

    In Bayesian phylogeny the parameters are of the same kind as in maximum likelihood phylogeny. Thus, typical parameters include tree topology, branch lengths, nucleotide frequencies, and substitution model parameters such as for instance the transition/transversion ratio or the gamma shape parameter. (The difference is that while we want to find the best point estimates of parameter values in maximum likelihood, the goal in Bayesian phylogeny is instead to find a full probability distribution over all possible parameter values). The observed data is again usually taken to be the alignment, although it would of course be more reasonable to say that the sequences are what have been observed (and the alignment should then be inferred along with the phylogeny).