Coalescent simulation refers to the idea of simulating the evolution
of biological sequences like DNA by tracing their ancestry back in time.
The coala
package is an interface for calling a number of
commonly used coalescent simulators from R. It should be able to use the
simulator scrm
out of the box. Other simulators need to be
installed separately and must be activated explicitly. This is described
in the installation
vignette:
vignette("coala-install", package = "coala")
In this introduction we will stick to using scrm
.
In order to conduct simulations, we first need to specify the
components of the simulation model. The function coal_model
creates a basic coalescent model:
library(coala)
<- coal_model(sample_size = 3, loci_number = 1) model
This creates a basic model with one population of constant size. One genetic locus for three haplotypes is sampled from the population.
Printing the model gives a short summary of this content:
model
## Features: Sampling of 3 (pop 1) individuals with ploidy 1 at time `0`
##
## Parameter: None
##
## Loci: 1 locus of length 1000
##
## Summary Statistics: None
##
## Simulator: scrm
## Command: scrm 3 1
Models consist of
In order to simulate the model we need to add a few more features and at least one summary statistic.
For simulating sequences, we need to add mutations to the model. To
do so, we create a corresponding feature using
feat_mutation
and add it to the existing model using the
plus operator:
<- model + feat_mutation(rate = 1, model = "IFS")
model model
## Features:
## * Sampling of 3 (pop 1) individuals with ploidy 1 at time `0`
## * Mutations with rate `1` following a IFS mutation model
##
## Parameter: None
##
## Loci: 1 locus of length 1000
##
## Summary Statistics: None
##
## Simulator: scrm
## Command: scrm 3 1 -t 1
Now, mutations occur with rate 1 and according to an infinite-sites
mutation model (IFS). Details of rate and mutation model are given in
the help page of the feature (?feat_mutation
).
Coala currently support the features
## [1] "feat_growth" "feat_ignore_singletons" "feat_migration"
## [4] "feat_mutation" "feat_outgroup" "feat_pop_merge"
## [7] "feat_recombination" "feat_selection" "feat_size_change"
## [10] "feat_unphased"
However, not all combination of all features might be possible. Please refer to the features help pages for detailed information.
If we build a model consisting of multiple populations, we need to state the sample sizes as a vector of sample sizes for the different populations. The lines,
<- coal_model(sample_size = c(5, 2, 0), loci_number = 1) +
model feat_migration(rate = 0.5, symmetric = TRUE) +
feat_pop_merge(0.5, 3, 2) +
feat_pop_merge(0.8, 2, 1)
create a model of three populations, with a symmetric migration rate
of 0.5
between them. When viewed backwards in time,
population 3 merges into population 2 0.5
coalescent time
units in the past and population 2 into population 1 0.3
time units further into the past. Looking forwards in time, this
represents two speciation events with migration going on afterwards. At
time 0
five haploids are sampled from population 1 and two
from population 2. Please note that sample sizes for all populations
must be given, even if no haploid is sampled from a population, as it is
the case for population 3 here.
Adding summary statistics works in a similar fashion as adding features:
<- coal_model(3, 1) +
model feat_mutation(rate = 1) +
sumstat_seg_sites()
model
## Features:
## * Sampling of 3 (pop 1) individuals with ploidy 1 at time `0`
## * Mutations with rate `1` following a IFS mutation model
## * Generating Seg. Sites
##
## Parameter: None
##
## Loci: 1 locus of length 1000
##
## Summary Statistics: stat_segsites
##
## Simulator: scrm
## Command: scrm 3 1 -t 1
This adds the segregating sites summary statistic to the
model, which is a basic summary statistic in population genetics. Again,
refer to ?sumstat_seg_sites
for details.
Available summary statistics are:
## [1] "sumstat_class" "sumstat_dna" "sumstat_file"
## [4] "sumstat_four_gamete" "sumstat_ihh" "sumstat_jsfs"
## [7] "sumstat_mcmf" "sumstat_nucleotide_div" "sumstat_omega"
## [10] "sumstat_seg_sites" "sumstat_sfs" "sumstat_tajimas_d"
## [13] "sumstat_trees"
Now we can simulate the model. The printed output of a model contains information which program will be used for the simulation and which arguments will be used. As coala is in an early stage, please make sure to always check both.
The function simulate
will call the program with the
printed options, parse its output and calculated the added summary
statistics:
<- simulate(model, seed = 123) sumstats
The returned object sumstats
is a list, in which each
entry corresponds to one summary statistic. As there is only one summary
statistic in our model, the list has only one entry:
names(sumstats)
## [1] "seg_sites" "pars" "cmds" "simulator"
The structure in sumstats$seg_sites
is given by the
segregating sites statistic. It is again a list, where each entry
represents one locus. For each locus, it contains a matrix as specified
in ?sumstat_seg_sites
:
$seg_sites[[1]] sumstats
## 0.1041606 0.4223301
## [1,] 0 0
## [2,] 1 1
## [3,] 1 1
If we want to have more loci in a model, we can add them using the
locus_
functions. The most basic option is to add an
additional locus with a different length:
<- model + locus_single(500)
model model
## Features:
## * Sampling of 3 (pop 1) individuals with ploidy 1 at time `0`
## * Mutations with rate `1` following a IFS mutation model
## * Generating Seg. Sites
##
## Parameter: None
##
## Loci:
## * 1 locus of length 1000
## * 1 locus of length 500
##
## Summary Statistics: stat_segsites
##
## Simulator: scrm
## Command: scrm 3 2 -t 1
Now the model consists of two loci, the first with length 1000, the second with 500. Simulation now produces a segregating sites list with two entries corresponding to the loci:
<- simulate(model)
sumstats $seg_sites[[1]] sumstats
## 0.3144767 0.4034762
## [1,] 0 0
## [2,] 0 0
## [3,] 1 1
$seg_sites[[2]] sumstats
## 0.919158
## [1,] 1
## [2,] 0
## [3,] 0
Another possibility is to add multiple loci with the same length
using locus_averaged
, which gives better performance than
adding the loci one by one. For example
<- model + locus_averaged(2, 750)
model <- simulate(model)
sumstats length(sumstats$seg_sites)
## [1] 4
adds two more loci with length of 750bp to the model.
So far, we have used a model without parameters that can vary between simulations. In particular for fitting a model to data via ABC or Jaatha, it is useful to add parameters to a previous model instead of creating a new model for each simulation.
Named parameters values can be specified in the simulation command. If we want, for example, to launch simulations for a model with different values of the mutation rate, we can use a named parameter:
<- coal_model(5, 1) +
model feat_mutation(rate = par_named("theta")) +
sumstat_seg_sites()
<- simulate(model, pars = c(theta = 2.5))
sumstats1 <- simulate(model, pars = c(theta = 4.3)) sumstats2
A parameter distributed according to a prior can be specified using
the par_prior
function. The function’s first argument is a
name for the parameter, the second an expression that, when evaluated,
produces a sample from the prior distribution.
So if we want the mutation to follow a uniform distribution between
0
and 10
, we can use:
<- coal_model(5, 1) +
model feat_mutation(rate = par_prior("theta", runif(1, 0, 10))) +
sumstat_seg_sites()
<- simulate(model)
sumstats $pars sumstats
## theta
## 6.499852
<- simulate(model)
sumstats2 $pars sumstats2
## theta
## 4.778454
For simulations that will we used for parameter inference with the R
package jaatha
, you need to give a range of possible values
for each parameter. This is done using par_range
. For
instance
<- coal_model(5, 1) +
model feat_mutation(rate = par_range("theta", 0.1, 5)) +
sumstat_seg_sites()
sets a possible range from 0.1 to 5 for the
mutation rate. The actual rate is given in the simulate
function, just as with named parameters.
Finally, there is a very powerful type of parameters generated with
par_expr
. Similar to parameters with priors, the value of
the parameter is given as an R expression, which is evaluated before
simulation. Unlinke par_prior
, this expression can contain
other named parameters. For example
<- coal_model(4, 2) +
model feat_mutation(rate = par_named("theta")) +
feat_recombination(rate = par_expr(theta * 2))
creates a model with a a recombination rate that always is twice as high as the mutation rate.