Events News Research CBS CBS Publications Bioinformatics
Staff Contact About Internal CBS CBS Other

Population Growth, Fitness, and Selection

Exercise written by: Anders Gorm Pedersen


Overview

In this exercise you are going to use a simple computer program for simulating the growth of a population. The simulation will be based on an exponential growth model and will involve organisms having one of two different genotypes ('A' and 'a'). By performing the simulation with a range of different model parameter values you will sharpen your intuitive understanding of the dynamics of the model.

You will also get a brief introduction to assessing how well a model describes some observed data (the "fit" of a model), and learn how model parameters can be estimated from such data (the process known as "model fitting").

Finally, you will start learning how to work in the UNIX environment that we will use in the rest of the course (and that is alo used by bioinformaticians/systems biologists in the real world). If you are interested in learning more about UNIX you can browse this set of tutorials.


Getting started

  1. Log in to your Unix shell account on "padawan":

    Connect to the server "padawan" using the ssh program as outlined on the software page.

  2. Construct today's working directory:

    mkdir simulation
    cd simulation

    The mkdir (make directory) command constructs a new directory named "simulation". Subsequently the cd (change directory) command is used to select the newly constructed directory as "current working directory".

  3. Copy required files:

    cp /home/people/gorm/simulation/growth.py ./growth.py
    cp /home/people/gorm/simulation/modelfit.data ./modelfit.data
    ls -l

    The cp command copies the simulation program (growth.py) and a file containing experimental data (modelfit.data), to your current working directory (the entire path to the files was listed here).

    The ls (list) command with option -l (long) shows information about the files present in the current directory.

Links to data (use these if you'd like to use the data, but you are not enrolled in the course):


Population Growth, Fitness, and Selection

For this part of the exercise you will need to use two terminal windows simultaneously. Start by opening a second terminal window in the same way you did before (ssh to your account on padawan). Again cd to the simulation directory.

  1. Have a look at the simulation program:

    nedit growth.py &

    This command opens the file growth.py (which is a simple text file) in the nedit editor. Having an ampersand ("&") at the end of the command makes nedit run in the background.

    The growth.py program is written in the programming language Python (warmly recommended by the way). You will probably be able to understand most of what is going on even if you have no programming experience (Note that hash signs - "#" - indicate the beginning of a comment. Text after # on a line will not be executed.)

    Q1: what are the initial values of the parameters N0 (initial population size), p (initial frequency of 'A'), rate_A (growth rate, or fitness, of 'A'), and rate_a (growth rate of 'a').

  2. Run the simulation program:

    In one of your terminal windows, type the following:

    growth.py

    This command executes the program growth.py. For each generation the state of the population and the frequencies of the 'A' and 'a' alleles will be printed to the screen. The simulation runs for 20 generations, but you could change this by altering the max_gen parameter.

  3. Run the simulation again, saving output to file:

    growth.py > growth.res.1

    The symbol ">" causes the output to be "redirected" to a file named "growth.res.1" (you could have named it anything). Open the result file in nedit to verify that all output has been saved:

    nedit growth.res.1 &

    Q2: what is the total population size (Nt) at generation 0 and generation 20?

    Close the nedit window when you're done.

  4. Start the gnuplot program:

    We now want to plot the results of the first simulation. In the second terminal window type the following:

    gnuplot

    gnuplot is an interactive plotting program. You write commands at a prompt much like you do when entering UNIX commands. The program can be used for plotting data sets from a file and for plotting functions given in symbolic form.

  5. Quick tour of gnuplot features:

    plot x**2-20*x-40

    This plots the function y = x^2 - 20x - 40. As you can see, the default x-range for plotting is -10:10. Try changing this so we get the minimum of the curve in sight:

    plot [0:20] x**2-20*x-40

    (Note that, in gnuplot, previously entered commmands can be recalled by pressing the up-arrow. This is occasionally useful when you want to rerun, or slightly alter, previously issued plotting commands). The numbers enclosed in brackets specify the x-range (here 0 to 20). The y-range is automatically set based on the x-range, but may also be explicitly specified:

    plot [0:20][-160:0] x**2-20*x-40

    Finally, you also have the option of specifying legends and labels for the axes:

    set xlab "Branch length"
    plot [0:20][-160:0] x**2-20*x-40 title "Parabola 1"

    ("ylab" would have set the label for the y-axis). There are many many other commands that we will not go into here (type "help" for online help).

  6. Plot population sizes:

    At the gnuplot prompt type the following:

    set xlab "Generation no."
    plot 'growth.res.1' using 1:2 title "N" with linespoints, 'growth.res.1' using 1:3 title "N_A" with linespoints, 'growth.res.1' using 1:4 title "N_a" with linespoints

    The last command (which has to be entered on a single line) plots data from selected columns in the file 'growth.res.1'. For instance, "using 1:4" means that we should use the numbers in column 1 (generation number) for x-values, and the numbers in column 4 (number of organisms with allele 'a') for y-values.

    Specifically, we here plot total population size ("N"), the number of organisms with allele 'A' ("N_A"), and the number of organisms with allele 'a' ("N_a") for each generation. Recall that in this run the two alleles had the same fitness (rate_A = rate_a = 1.2) but different initial population sizes (p = 0.3, q = 0.7).

  7. Plot allele frequencies:

    At the gnuplot command type the following:

    plot [][0:1]'growth.res.1' u 1:5 tit "A freq" w linespoints, 'growth.res.1' u 1:6 tit "a freq" w linespoints

    Here we have plotted the frequencies of the two alleles for each simulated generation. The leading "[] [0:1]" indicate that we want to use the automatic x-range and that we want the y-axis to span the interval 0:1. Also note that gnuplot commands can be abbreviated ("u" for "using", "w" for "with", "tit" for "title").

    Q3: How do the frequencies of the two alleles behave for this case where the two alleles have the same fitness?

  8. Examine the impact of changing parameter values:

    Based on what you learned above, we now want you to solve the next problem on your own. Specifically, we want you to rerun the simulation and plot the population sizes and allele frequencies for each of the sets of parameter values listed below.

    A few helping remarks: use nedit to alter the relevant parameter values in the growth.py file and remember to save the file after making the alterations (File -> Save). Rerun the simulation program and save the output to a file with a unique name (this step should be performed in the second terminal window where you are not running gnuplot). Finally, plot population sizes and frequencies as above (remember to use the correct file name in the plot command).

    Q4: For each new set of parameter values write down the following (for start and stop values you can just read the approximate values off of the plot):

    1. run number
    2. the relative fitness of allele 'a' (using 'A' as a reference)
    3. the start (generation=0) and stop (generation=20) values of N_A and N_a
    4. the start and stop values of freq_A and freq_a.

    run nr.: 1	N0 = 50     rate_A = 1.2  rate_a = 1.2  p = 0.3
    run nr.: 2	N0 = 1000   rate_A = 0.7  rate_a = 0.7  p = 0.3
    run nr.: 3	N0 = 1000   rate_A = 2.0  rate_a = 1.5  p = 0.02
    run nr.: 4	N0 = 1000   rate_A = 1.2  rate_a = 0.9  p = 0.02
    run nr.: 5	N0 = 10000  rate_A = 0.8  rate_a = 0.6  p = 0.02
    

    Pay special attention to the frequency plots for run number 3, 4, and 5.

    Q5: plot the allele frequencies for run number 4 and 5 in a single graph and sketch this on your answer sheet.


Model Fit, Parameter Estimation

In this part of the exercise, we will briefly consider some aspects of the fit between models and reality.

  1. Have a look at the data file:

    nedit modelfit.data

    This file contains a set of (pseudo) empirical data: the size of a population (column labeled "N") for a number of generations (column labeled t).

  2. Plot data points:

    In the terminal window where you have gnuplot running, type the following:

    plot [-1:13] 'modelfit.data'
  3. Estimating parameters of exponential growth model from data:

    We will now assume that the population is growing exponentially according to this model: Nt = N0 * exp(r*t), where N0 is the initial population size, r is the instantaneous rate of increase, and t is the generation number. Our model thus has two "free parameters": N0 and r. You will now attempt to find a "good" set of values for the N0 and r parameters. We will take "good" values to be those that cause the theoretical curve to lie as close as possible to the observed data. This process is called model-fitting.

    There are actually several different ways of defining "as close as possible". One measure that turns out to be convenient is the "sum of squared errors" (SSE; the word "residual" is sometimes used instead of "error"). The approach is the following: for each x,y point in the data set, the difference between the observed y and the y-value predicted by the model is computed. This difference (the "error" or "residual") is then squared, and the sum of all the squared error terms is then taken to be an indication of how well the model fits the data. The best fitting model thus has the smallest possible SSE, and this approach is therefore referred to as "least squares model fitting". Another measure of model fit that we will return to later in the course is the model likelihood.

  4. Find a good set of parameter values:

    In order to fit a model to our data, we first have to define the model in the form of a function in gnuplot. We will then use gnuplot's built-in "fit" command to find the best (here meaning "least squares") estimates of the two parameters.

    First, define the model function by entering the following at your gnuplot prompt:

    f1(x) = N0 * exp(r * x)

    For the purpose of this run we will also provide initial guesses at the parameter values. (This is not strictly necessary - the gnuplot program will choose starting values if you don't provide them - but it is sometimes helpful if the fitting procedure gets stuck in a local optimum).

    N0 = 500
    r = 0.1
    plot [-1:13] 'modelfit.data', f1(x)

    The last command above plots the data along with our (not particularly well-fitting) first guess at a model that describes the data. You can try out a few different values of the "N0" and "r" parameters if you want to get a feeling for where the best values are.

    Now, fit the model to the data, and plot the resulting best-fit line, by issuing the following commands at your gnuplot prompt:

    fit f1(x) 'modelfit.data' via N0,r
    plot [-1:13] 'modelfit.data', f1(x)

    If you inspect the output you can see how the gnuplot program uses an iterative method to find the best parameter values (i.e., the parameter values that result in the smallest sum of squared errors). Notice how the line now is much closer (on average) to the data points.

    Q6:Report the "final set of parameters" and the "final sum of squares of residuals" of the fitted model (this is found near the end of the gnuplot output).

  5. Fit polynomial model:

    So far we have assumed that the exponential model was the best choice for describing the growth data. This model has two free parameters: "N0" and "r". Let us now consider an alternative model with 7 free parameters, namely the 6'th order polynomial f2(x)=a+bx+cx^2+dx^3+ex^4+fx^5+gx^6. The free parameters of this model are: a, b, c, d, e, f, and g. Note that we now have as many free parameters as data points.

    Using the same approach as we did above for an exponential model, you should now fit this polynomial model to the data (you may want to call this "f2" so it wont get mixed up with the exponential model, which we called f1). Note that in gnuplot you have to explicitly write mutiplication using "*", and exponentiation as "**"; for instance ex^4 would be written: e*x**4. As mentioned above, you do not have to specify initial guesses for the 7 parameter values, but you are allowed to if you want.

    Q7:Again, report the "final set of parameters" and the "final sum of squares of residuals" of the fitted model. Is the fit of the polynomial model better or worse than that of the exponential model?

    If you want to visually compare the fit of the two models, you can use the following command:

    plot 'modelfit.data', f1(x), f2(x)
  6. Assess predictive performance of polynomial and exponential models:

    Imagine that we have now collected one additional data point, and this shows us that the real observed population size at generation t=14 is N=11825. Now, use (1) the exponential model, f1, and (2) the polynomial model, f2, to predict the population size at generation 14. This is done by entering the value 14 for x in the model equations that you found by fitting. You can do this in gnuplot as follows (assuming that the model parameters have not been set to new values after fitting the two models):

    print f1(14)
    print f2(14)

    Q8:Report the predicted values. Based on the data point for t=14 which model do you now think has captured the biological reality best? SSE is a useful way of finding the best estimate of a set of parameter values, but do you think it is fair to directly compare the SSE's of two different types of model and why not?

    Later in the course we will deal with stringent ways of choosing a good model.