Population Growth, Fitness, and Selection
Exercise written by:
Anders Gorm Pedersen
Overview
In this exercise you are going to use a simple computer program for simulating the
growth of a population. The simulation will be based on an exponential growth model and will
involve organisms having one of two different genotypes ('A' and 'a'). By performing the
simulation with a range of different model parameter values you will sharpen your intuitive
understanding of the dynamics of the model.
You will also get a brief introduction to assessing how well a model
describes some observed data (the "fit" of a model), and learn how model
parameters can be estimated from such data (the process known as "model
fitting").
Finally, you will start learning how to work in the UNIX environment that we will
use in the rest of the course (and that is alo used by bioinformaticians/systems biologists in the
real world). If you are interested in learning more about UNIX you can browse this
set of tutorials.
Log in to your Unix shell account on "padawan":
Connect to the server "padawan" using the ssh program as outlined on the
software page.
Construct today's working directory:
mkdir simulation
cd simulation
The mkdir (make directory) command constructs a new directory named
"simulation". Subsequently the cd (change directory) command is used to select the
newly constructed directory as "current working directory".
Copy required files:
cp /home/people/gorm/simulation/growth.py ./growth.py
cp /home/people/gorm/simulation/modelfit.data ./modelfit.data
ls l
The cp command copies the simulation program (growth.py) and a file
containing experimental data (modelfit.data), to your current working directory
(the entire path to the files was listed here).
The ls (list) command with option l (long) shows information about the files
present in the current directory.
Links to data (use these if you'd like to use the data, but you are not enrolled in the course):
For this part of the exercise you will need to use two terminal windows simultaneously.
Start by opening a second terminal window in the same way you did before (ssh to your
account on padawan). Again cd to the simulation directory.
Have a look at the simulation program:
nedit growth.py &
This command opens the file growth.py (which is a simple text file) in the nedit
editor. Having an ampersand ("&") at the end of the command makes nedit run in the
background.
The growth.py program is written in the programming language Python (warmly recommended by the way). You will probably
be able to understand most of what is going on even if you have no programming experience
(Note that hash signs  "#"  indicate the beginning of a comment. Text after # on a line
will not be executed.)
Q1: what are the initial values of the parameters N0 (initial population
size), p (initial frequency of 'A'), rate_A (growth rate, or fitness, of 'A'),
and rate_a (growth rate of 'a').
Run the simulation program:
In one of your terminal windows, type the following:
growth.py
This command executes the program growth.py. For each generation the state of the
population and the frequencies of the 'A' and 'a' alleles will be printed to the screen. The
simulation runs for 20 generations, but you could change this by altering the
max_gen parameter.
Run the simulation again, saving output to file:
growth.py > growth.res.1
The symbol ">" causes the output to be "redirected" to a file named "growth.res.1"
(you could have named it anything). Open the result file in nedit
to verify that all output has been saved:
nedit growth.res.1 &
Q2: what is the total population size (Nt) at generation 0 and generation 20?
Close the nedit window when you're done.
Start the gnuplot program:
We now want to plot the results of the first simulation. In the second terminal window
type the following:
gnuplot
gnuplot is an interactive
plotting program. You write commands at a prompt much like you do when entering
UNIX commands. The program can be used for plotting data sets from a file and
for plotting functions given in symbolic form.
Quick tour of gnuplot features:
plot x**220*x40
This plots the function y = x^2  20x  40. As you can see, the default xrange for
plotting is 10:10. Try changing this so we get the minimum of the curve in sight:
plot [0:20] x**220*x40
(Note that, in gnuplot, previously entered commmands can be recalled by pressing the
uparrow. This is occasionally useful when you want to rerun, or slightly alter, previously
issued plotting commands). The numbers enclosed in brackets specify the xrange (here 0 to
20). The yrange is automatically set based on the xrange, but may also be explicitly
specified:
plot [0:20][160:0] x**220*x40
Finally, you also have the option of specifying legends and labels for the axes:
set xlab "Branch length"
plot [0:20][160:0] x**220*x40 title "Parabola 1"
("ylab" would have set the label for the yaxis). There are many
many other commands that we will not go into here (type "help" for online
help).
Plot population sizes:
At the gnuplot prompt type the following:
set xlab "Generation no."
plot 'growth.res.1' using 1:2 title "N" with linespoints,
'growth.res.1' using 1:3 title "N_A" with linespoints,
'growth.res.1' using 1:4 title "N_a" with linespoints
The last command (which has to be entered on a single line) plots data from selected
columns in the file 'growth.res.1'. For instance, "using 1:4" means that we
should use the numbers in column 1 (generation number) for xvalues, and the numbers in
column 4 (number of organisms with allele 'a') for yvalues.
Specifically, we here plot total population size ("N"), the number of organisms with
allele 'A' ("N_A"), and the number of organisms with allele 'a' ("N_a") for each
generation. Recall that in this run the two alleles had the same fitness (rate_A = rate_a =
1.2) but different initial population sizes (p = 0.3, q = 0.7).
Plot allele frequencies:
At the gnuplot command type the following:
plot [][0:1]'growth.res.1' u 1:5 tit "A freq" w linespoints, 'growth.res.1' u 1:6 tit "a freq" w linespoints
Here we have plotted the frequencies of the two alleles for each simulated
generation. The leading "[] [0:1]" indicate that we want to use the automatic
xrange and that we want the yaxis to span the interval 0:1. Also note that gnuplot
commands can be abbreviated ("u" for "using", "w" for "with", "tit" for "title").
Q3: How do the frequencies of the two alleles behave for this case where the two alleles have
the same fitness?
Examine the impact of changing parameter values:
Based on what you learned above, we now want you to solve the next problem on your own.
Specifically, we want you to rerun the simulation and plot the population sizes and allele
frequencies for each of the sets of parameter values listed below.
A few helping remarks: use nedit to alter the relevant parameter values in the
growth.py file and remember to save the file after making the alterations (File > Save). Rerun the
simulation program and save the output to a file with a unique name (this step should be
performed in the second terminal window where you are not running gnuplot). Finally, plot
population sizes and frequencies as above (remember to use the correct file name in the plot
command).
Q4: For each new set of parameter values write down the following (for
start and stop values you can just read the approximate values off of the plot):
 run number
 the relative fitness of allele 'a' (using 'A' as a reference)
 the start (generation=0) and stop (generation=20) values of N_A and N_a
 the start and stop values of freq_A and freq_a.
run nr.: 1 N0 = 50 rate_A = 1.2 rate_a = 1.2 p = 0.3
run nr.: 2 N0 = 1000 rate_A = 0.7 rate_a = 0.7 p = 0.3
run nr.: 3 N0 = 1000 rate_A = 2.0 rate_a = 1.5 p = 0.02
run nr.: 4 N0 = 1000 rate_A = 1.2 rate_a = 0.9 p = 0.02
run nr.: 5 N0 = 10000 rate_A = 0.8 rate_a = 0.6 p = 0.02
Pay special attention to the frequency plots for run number 3, 4, and 5.
Q5: plot the allele frequencies for run number 4 and 5 in a single graph and
sketch this on your answer sheet.
Model Fit, Parameter Estimation
In this part of the exercise, we will briefly consider some aspects of the fit between
models and reality.
Have a look at the data file:
nedit modelfit.data
This file contains a set of (pseudo) empirical data: the size of a population (column
labeled "N") for a number of generations (column labeled t).
Plot data points:
In the terminal window where you have gnuplot running, type the following:
plot [1:13] 'modelfit.data'
Estimating parameters of exponential growth model from data:
We will now assume that the population is growing exponentially according to this
model: Nt = N0 * exp(r*t), where N0 is the initial population
size, r is the instantaneous rate of increase, and t is the generation
number. Our model thus has two "free
parameters": N0 and r. You will now attempt to find a "good" set of values for
the N0 and r parameters. We will take "good" values to be those that
cause the theoretical curve to lie as close as possible to the observed data. This
process is called modelfitting.
There are actually several different ways of defining "as close as possible". One
measure that turns out to be convenient is the "sum of squared errors" (SSE; the word
"residual" is sometimes used instead of "error"). The
approach is the following: for each x,y point in the data set, the difference between
the observed y and the yvalue predicted by the model is computed. This difference (the
"error" or "residual") is then squared, and the sum of all the squared error terms is then taken to
be an indication of how well the model fits the data. The best fitting model thus has
the smallest possible SSE, and this approach is therefore referred to as "least squares
model fitting". Another measure of model fit that we will return to later in the course
is the model likelihood.
Find a good set of parameter values:
In order to fit a model to our data, we first have to define the model in the form of a function in
gnuplot. We will then use gnuplot's builtin "fit" command to find the best (here meaning "least
squares") estimates of the two parameters.
First, define the model function by entering the following at your gnuplot prompt:
f1(x) = N0 * exp(r * x)
For the purpose of this run we will also provide initial guesses at the parameter values. (This is not
strictly necessary  the gnuplot program will choose starting values if you don't provide them 
but it is sometimes helpful if the fitting procedure gets stuck in a local optimum).
N0 = 500
r = 0.1
plot [1:13] 'modelfit.data', f1(x)
The last command above plots the data along with our (not particularly wellfitting)
first guess at a model that describes the data. You can try out a few different values
of the "N0" and "r" parameters if you want to get a feeling for where the
best values are.
Now, fit the model to the data, and plot the resulting bestfit line, by issuing the
following commands at your gnuplot prompt:
fit f1(x) 'modelfit.data' via N0,r
plot [1:13] 'modelfit.data', f1(x)
If you inspect the output you can see how the gnuplot program uses an iterative method
to find the best parameter values (i.e., the parameter values that result in the smallest sum of
squared errors). Notice how the line now is much closer (on average)
to the data points.
Q6:Report the "final set of parameters" and the "final sum of squares of residuals"
of the fitted model (this is found near the end of the gnuplot output).
Fit polynomial model:
So far we have assumed that the exponential model was the best choice for
describing the growth data. This model has two free parameters: "N0"
and "r". Let us now consider an alternative model with 7 free
parameters, namely the 6'th order polynomial
f2(x)=a+bx+cx^2+dx^3+ex^4+fx^5+gx^6. The free parameters of this model
are: a, b, c, d, e, f, and g. Note that we now have as many free parameters
as data points.
Using the same approach as we did above for an exponential model, you should
now fit this polynomial model to the data (you may want to call this "f2" so it wont
get mixed up with the exponential model, which we called f1). Note that in gnuplot you
have to explicitly write mutiplication using "*", and exponentiation as "**"; for instance
ex^4 would be written: e*x**4. As mentioned above, you do not have to specify initial
guesses for the 7 parameter values, but you are allowed to if you want.
Q7:Again, report the "final set of parameters" and the "final sum of squares of residuals"
of the fitted model. Is the fit of the polynomial model better or worse than that of the
exponential model?
If you want to visually compare the fit of the two models, you can use the following command:
plot 'modelfit.data', f1(x), f2(x)
Assess predictive performance of polynomial and exponential models:
Imagine that we have now collected one additional data point, and this
shows us that the real observed population size at generation t=14 is N=11825. Now, use (1)
the exponential model, f1, and (2) the polynomial model, f2, to
predict the population size at generation 14. This is done by entering the value 14 for x in
the model equations that you found by fitting. You can do this in gnuplot as follows (assuming
that the model parameters have not been set to new values after fitting the two models):
print f1(14)
print f2(14)
Q8:Report the predicted values. Based on the data point for t=14 which model do you
now think has captured the biological reality best? SSE is a useful way of finding the best
estimate of a set of parameter values, but do you think it is fair to directly compare the SSE's
of two different types of model and why not?
Later in the course we will deal with stringent ways of choosing a good model.
