Maximum likelihood inference of population size contractions from microsatellite data Rapha¨el Lebloisa,b,f,∗, Pierre Pudloa,d,f , Joseph N´eronb , Fran¸cois Bertauxb,e , Champak Reddy Beeravolua , Renaud Vitalisa,f , Fran¸cois Roussetc,f a
INRA, UMR 1062 CBGP (INRA-IRD-CIRAD-Montpellier Supagro), Montpellier, France b Mus´eum National d’Histoire Naturelle, CNRS, UMR OSEB, Paris, France c Universit´e Montpellier 2, CNRS, UMR ISEM, Montpellier, France d Universit´e Montpellier 2, CNRS, UMR I3M, Montpellier, France e INRIA Paris-Rocquencourt, BANG team, Le Chesnay, France f Institut de Biologie Computationnelle, Montpellier, France
Abstract Understanding the demographic history of populations and species is a central issue in evolutionary biology and molecular ecology. In the present work, we develop a maximum likelihood method for the inference of past changes in population size from microsatellite allelic data. Our method is based on importance sampling of gene genealogies, extended for new mutation models, notably the generalized stepwise mutation model (GSM). Using simulations, we test its performance to detect and characterize past reductions in population size. First, we test the estimation precision and confidence intervals coverage properties under ideal conditions, then we compare the accuracy of the estimation with another available method (MsVar) and we finally test its robustness to misspecification of the mutational model and population structure. We show that our method is very competitive compared to alternative ones. Moreover, our implementation of a GSM allows more accurate analysis of microsatellite data, as we show that violations of a single step mutation assumption induce very high bias towards false bottleneck detection rates. However, our simulation tests also showed some important limits, which most importantly are large computation times I Non-standard abbreviations : IS (importance sampling), MCMC (Monte Carlo Markov Chain), ML (maximum likelihood), RMSE (root mean square error), KS (Kolmogorov-Smirnov test), LRT (likelihood ratio test), CI (confidence or credibility intervals), KAM (K-allele model), SMM (stepwise mutation model), GSM (generalized stepwise mutation model), SNP (single nucleotide polymorphism), IBD (isolation by distance), BDR (bottleneck detection rate), FBDR (false bottleneck detection rate), FEDR (false expansion detection rate) ∗ Corresponding author Email address:
[email protected] (Rapha¨el Leblois)
Preprint submitted as an article to Molecular Biology and Evolution
December 16, 2013
for strong disequilibrium scenarios and a strong influence of some form of unaccounted population structure. This inference method is available in the latest implementation of the Migraine software package. Keywords: demographic inference, maximum likelihood, coalescent, importance sampling, microsatellites, bottleneck, population structure, mutation processes
2
1
1. Introduction
2
Understanding the demographic history of populations and species is a central issue in
3
evolutionary biology and molecular ecology, e.g. for understanding the effects of environ-
4
mental changes on the distribution of organisms. From a conservation perspective, a severe
5
reduction in population size, often referred to as a “population bottleneck”, increases rate of
6
inbreeding, loss of genetic variation, fixation of deleterious alleles, and thereby greatly re-
7
duces adaptive potential and increases the risk of extinction (Lande, 1988; Frankham et al.,
8
2006; Keller and Waller, 2002; Reusch and Wood, 2007). However, characterizing the de-
9
mographic history of a species with direct demographic approaches requires the monitoring
10
of census data, which can be extremely difficult and time consuming (Williams, Nichols
11
and Conroy, 2002; Schwartz, Luikart and Waples, 2007; Bonebrake et al., 2010). Moreover,
12
direct approaches cannot give information about past demography from present-time data.
13
A powerful alternative relies on population genetic approaches, which allow inferences on
14
the past demography from the observed present distribution of genetic polymorphism in
15
natural populations (Schwartz, Luikart and Waples, 2007; Lawton-Rauh, 2008).
16
Until recently, most indirect methods were based on testing whether a given summary
17
statistic (computed from genetic data) deviates from its expected value under an equilib-
18
rium demographic model (Cornuet and Luikart, 1996; Schneider and Excoffier, 1999; Garza
19
and Williamson, 2001). Because of their simplicity, these methods have been widely used
20
(see, e.g. Comps et al., 2001; Colautti et al., 2005, and the reviews of Spencer, Neigel
21
and Leberg, 2000 and Peery et al., 2012). But they neither estimate the severity of the
22
bottleneck nor its age or duration.
23
Although much more mathematically difficult and computationally demanding, likelihood-
24
based methods outperform these moment-based methods by considering all available in-
25
formation in the genetic data (see Felsenstein, 1992; Griffiths and Tavar´e, 1994; Emerson,
26
Paradis and Th´ebaud, 2001, and the review of Marjoram and Tavar´e, 2006). Among oth-
27
ers, the software package MsVar (Beaumont, 1999; Storz and Beaumont, 2002) has been
28
increasingly used to infer past demographic changes. MsVar assumes a demographic model
29
consisting of a single isolated population, which has undergone a change in effective popu-
30
lation size at some time in the past. It is dedicated to the analysis of microsatellite loci that
3
31
are assumed to follow a strict stepwise mutation model (SMM, Ohta and Kimura, 1973).
32
In a recent study, Girod et al. (2011) evaluated the performance of MsVar by simulation.
33
They have shown that MsVar clearly outperforms moment-based methods to detect past
34
changes in population sizes, but appears only moderately robust to mis-specification of
35
the mutational model: deviations from the SMM often induce “false” bottleneck detections
36
on simulated samples from populations at equilibrium. Chikhi et al. (2010) also found
37
a strong confounding effect of population structure on bottleneck detection using MsVar.
38
Thus, departures from the mutational and demographic assumptions of the model appear
39
to complicate the inference of past population size changes from genetic data.
40
41
The present work extends the importance sampling (IS) class of algorithms (Stephens
42
and Donnelly, 2000; de Iorio and Griffiths, 2004a,b) to coalescent-based models of a single
43
isolated population with past changes in population size. Moreover, in the spirit of de Iorio
44
et al. (2005), we also provide explicit formula for a generalized stepwise mutation model
45
(GSM, Pritchard et al., 1999).
46
We have conducted three simulation studies to test the efficiency of our methodology
47
on past contractions (i.e., bottlenecks) and its robustness against mis-specifications of the
48
model. The first study aims at showing the ability of the algorithm to detect bottlenecks
49
and to recover the parameters of the model (i.e., the severity of the size change and its age)
50
on a wide range of bottleneck scenarios. In a second study, we compared the accuracy of
51
our IS implementation with the MCMC approach implemented in MsVar. The third study
52
tests the robustness of our method against mis-specification of the mutation model, and
53
against the existence of a population structure not considered in the model. All analyses
54
in these studies were performed using the latest implementation of the Migraine software
55
package, available at the web page kimura.univ-montp2.fr/∼rousset/Migraine.htm.
56
2. New approaches
57
Our goal is to obtain maximum likelihood (ML) estimates for single population models
58
with a past variation in population size, described in Subsection 2.1. To this end, we
59
describe the successive steps of the inference algorithm (Subsections 2.2, 2.3).
4
60
2.1. Demographic model
Figure 1: Representation of the demographic model used in the study. N s are population sizes, T is the time measured in generation since present and µ the mutation rate of the marker used. Those four parameters are the canonical parameters of the model. θs and D are the inferred scaled parameters.
We consider a single isolated population with past size changes (Fig. 1). We denote by N (t) the population size, expressed as the number of genes, t generations away from the sampling time t = 0. Population size at sampling time is N ≡ N (0). Then, going backward in time, the population size changes according to a deterministic exponential function until reaching an ancestral population size Nanc at time t = T . Then, N (t) remains constant, equal to Nanc for all t > T . More precisely, t N Nanc T , if 0 < t < T, N N (t) = N , if t > T. anc
(1)
61
To ensure identifiability, the parameters of interest are scaled as θ ≡ 2N µ, θanc ≡ 2µNanc
62
and D ≡ T /2N , where µ is the mutation rate per locus per generation. We often are
63
interested in an extra composite parameter Nratio = θ/θanc , which is useful to characterize
64
the strength of the bottleneck. Finally, we also consider an alternative parametrization of
65
the model using θ, θanc and D0 ≡ µT in a few situations, for comparison between these two
66
possible parameterizations.
67
2.2. Computation of coalescent-based likelihood with importance sampling
68
Because the precise genetic history of the sample is not observed, the coalescent-based
69
likelihood at a given point of the parameter space is an integral over all possible histories, 5
70
i.e. genealogies with mutations, leading to the present genetic data. Following Stephens
71
and Donnelly (2000) and de Iorio and Griffiths (2004a), the Monte Carlo scheme comput-
72
ing this integral is based here on importance sampling. The set of possible past histories is
73
explored via an importance distribution depending on the demographical scenario and on
74
the parameter values we are currently focused on. The best proposal distribution to sample
75
from is the importance distribution leading to a zero variance estimate of the likelihood.
76
Here it amounts to the model-based distribution of gene history conditioned by the present
77
genetic data, which corresponds to all backward transition rates between successive states
78
of the histories. As computation of these backward transition rates is often too difficult,
79
we substitute this conditional distribution with an importance distribution, and introduce
80
a weight to correct the discrepancy. Like the best proposal distribution, the actual im-
81
portance distribution is a process describing changes in the ancestral sample configuration
82
backward in time using absorbing Markov chains but do not lead to a zero variance esti-
83
mate of the likelihood. Better efficiency of the importance sampling proposals allows to
84
accurately estimate likelihoods by considering less histories for a given parameter value.
85
Stephens and Donnelly (2000), de Iorio and Griffiths (2004a,b) and de Iorio et al. (2005)
86
suggested efficient approximations that are easily computable. However the efficiency of
87
the importance distribution depends heavily on the demographic model and the current
88
parameter value.
89
The first main difference between our algorithm and those described in de Iorio and
90
Griffiths (2004a,b) is the time inhomogeneity induced by the disequilibrium of our de-
91
mographic model. Demographic models considered in the above cited literature and in
92
Rousset and Leblois (2007, 2012), do not include indeed any change in population sizes.
93
To relax the assumption of time homogeneity in (de Iorio and Griffiths, 2004b), we modify
94
their equations (see Tables 1 and 2 of de Iorio and Griffiths, 2004b), so that all quantities
95
depending on the relative population sizes now vary over time because of the population
96
size changes. Thus, we must keep track of time in the algorithm to assign the adequate
97
value to those time dependent quantities. To see how this is done, consider that the ge-
98
nealogy has been constructed until time Tk , and that, at this date, n ancestral lineages
99
remain. Under the importance distribution, the occurrence rate of a mutation event is then
100
nθ, and the occurrence rate of a coalescence event is n(n − 1)λ(t), where λ(t) = N/N (t) is 6
101
102
103
the population size function introducing the disequilibrium. λ(t) corresponds to parameter 1/q in de Iorio and Griffiths (2004b). The total jump rate at time t ≥ Tk is then Γ(t) = n (n − 1)λ(t) + θ and the next event of the genealogy occurs at time Tk+1 whose distribution has density Z t ˆ Γ(u)du dt if t ≥ Tk . P (Tk+1 ∈ dt) = Γ(t) exp − Tk
104
Apart from these modifications, the outline of the IS scheme from de Iorio and Griffiths
105
(2004b) is preserved (see section A.1 in the supplementary materials for more details).
106
107
We also develop specific algorithms to analyze data under the generalized stepwise mu-
108
tation model (GSM), with infinite or finite number of alleles. This more realistic mutation
109
model considers that multistep mutations occur and the number of steps involved for each
110
mutation can be modeled using a geometric distribution with parameter p. The original
111
algorithm of Stephens and Donnelly (2000) covers any finite mutation model but requires
112
numerical matrix inversions to solve a system of linear equations, (see, e.g., Eqs (18) and
113
(19) in Stephens and Donnelly, 2000). Time inhomogeneity requires matrix inversions
114
each time the genealogy is updated by the IS algorithm. To bypass this difficulty, de Iorio
115
et al. (2005) have successfully replaced the matrix inversions with Fourier analysis when
116
considering a SMM with an infinite allele range. We extended this Fourier analysis in the
117
case of a GSM with an infinite allele range. However, contrarily to the SMM, the result of
118
the Fourier analysis for the GSM is a very poor approximation for cases with a finite range
119
of allelic state as soon as p is not very small (e.g. < 0.1). To consider a more realistic GSM
120
with allele ranges of finite size, we propose to compute the relevant matrix inversions using
121
a numerical decomposition in eigenvectors and eigenvalues of the mutation process matrix,
122
P . Because the mutation model is not time-depend, this last decomposition is performed
123
only once for a given matrix P . See A.4 for details about the GSM implementation.
124
Finally, several approximations of the likelihood, using products of approximate condi-
125
tional likelihoods (PAC, Cornuet and Beaumont, 2007) and analytical computation of the
126
probability of the last pair of genes, have been successfully tested to speed up computation
127
times (see section A.2 in the supplementary materials). 7
128
2.3. Inference method
129
Following Rousset and Leblois (2007, 2012), we first define a set of parameter points
130
via a stratified random sample on the range of parameters provided by the user. Then, at
131
each parameter point, the multilocus likelihood is the product of the likelihoods for each
132
locus, which are estimated via the IS algorithm described above. The likelihood inferred
133
at the different parameter point is then smoothed by a Kriging scheme (Cressie, 1993).
134
After a first analysis of the smoothed likelihood surface, the algorithm can be repeated a
135
second time to increase the density of the grid in the neighborhood of a first maximum like-
136
lihood estimate. Finally, one- and two-dimensional profile likelihood ratios are computed,
137
to obtain confidence intervals and graphical outputs (e.g. Fig. 2). Section A.3 in the sup-
138
plementary materials explains how we tuned the parameters of the algorithm, namely the
139
range of parameters, the size of parameter points and the number of genealogical histories
140
explored by the IS algorithm.
141
A genuine issue, when facing genetic data, is to test whether the sampled population
142
has undergone size changes or not. Thus, we derived a statistical test from the methodology
143
presented above. It aims at testing between the null hypothesis that no size change occured
144
(i.e., N = Nanc ) and alternatives such as a population decline or expansion (i.e., N 6=
145
Nanc ). At level α, our test rejects the null hypothesis if and only if 1 lies outside the 1 − α
146
confidence interval of the ratio Nratio = N/Nanc .
147
All those developments are implemented in the Migraine software package. A detailed
148
presentation of the simulation settings and validation procedures used to test the precision
149
and robustness of the method are given in Section 5.
150
3. Results
151
3.1. Two contrasting examples
152
We begin with two contrasting simulated examples presented on Figs. 2a and 2b. The
153
first one, corresponding to our baseline simulation (θ = 0.4, D = 1.25 and θanc = 40.0),
154
case [0]), is an ideal situation in which the inference algorithm performs well due to the
155
large amount of information in the genetic data, resulting in a likelihood surface with clear
156
peaks for all parameters around the maximum likelihood values. The bottleneck signal
8
Profile likelihood ratio
Profile likelihood ratio 1.0
1.0
0.8
0.4
0. 5
0.6
0.8
7 0.
0.4
0.3
0.4 0.8
0.1
0.2
0.8
0.001
0.05
0.5
0.6
+
0.6
0.7
0.8 0.7
0.4
0.01
1 0.0
1
0.6
+
1
0. 9
0.3
0.001
0.01
0.1
D on a log scale
0.1
2
1 00 0.
D on a log scale
10
0.2
0.0 01
0.001
0.5
1
0.0
10−5 10−4 0.001 0.01 0.1
2Nµ on a log scale
0.6
0.5
0.4
0. 3
0.4
0.2
0.1
0.0050.01 0.02 0.05 0.1 0.2
0.5
1
0.4
0.2
0.1
50
0.7 0.5
0.3
0.4 1 00 0.
0.2
0.1
20 0.01 0.001
1
2
0.2
0.0 5
0.6
0.001
0.4 0.2
1
10
0.0
1.0
100
2N ancµ on a log scale
9 0.+
0.6
1.0
0.6
0.4
0.001
+ 0.05
2Nµ on a log scale
0.6
0. 8
2N ancµ on a log scale
1 00 0.
0.5
0.01
10−5 10−4 0.001 0.01 0.1
0.8
100
0.4
0.3
0.01
0.0
0.05 0.2
0.1
0.05
Profile likelihood ratio 0. 00 1
200
0.3 0.001
2Nµ on a log scale Profile likelihood ratio 500
0.2
0.001
0.01
0.001
1
0.8
0.001
0.01 0.1
0.1
0.6
+ 0.7
10
0.01
2N ancµ on a log scale
0.2
100
0.8
2N ancµ on a log scale
0.05
0.01
0.8
200
20
0.0
1.0
100
1.0
500
0.001
10
1
0.2
2Nµ on a log scale Profile likelihood ratio
Profile likelihood ratio
50
0.3
0.00 1
0.0050.01 0.02 0.05 0.1 0.2
0.2
0.01
5 0.0 01 0.0 0.1
0.2
0.05
0.8
10 1 0.1
0.001
0.01
0.3 0.001
0.01
0.05
0.05
0.1 0.1
0.5
.2.3 00
+
0.2
0.6
0.4
0.4
0.01
0.2
0.001
0.0
0.001
D on a log scale
0.01
0.1
1
10
0.0
D on a log scale
(a) case 0
(b) case 10
Figure 2: Examples of two-dimensional profile likelihood ratios for two data set generated with (a) θ = 0.4, D = 1.25, θanc = 40.0 (case [0]) and (b) θ = 0.4, D = 1.25, θanc = 2.0 (case [10]). The likelihood surface is inferred from (a) 1,240 points in two iterative steps; and (b) 3,720 points in three iterative steps as described in A.3. The likelihood surface is shown only for parameter combinations that fell within the envelope of parameter points for which likelihoods were estimated. The cross denotes the maximum.
9
157
is highly significant and is clearly seen in the (θ, θanc ) plot on Fig. 2a, as the maximum
158
likelihood peak is above the 1:1 diagonal. The second example is a more difficult situation,
159
where the population has undergone a much weaker contraction (θ = 0.4, D = 1.25 and
160
θanc = 2.0, case [10]) that does not leave a clear signal in the genetic data. In such a
161
situation, there is not much information on any of the three parameters, resulting in much
162
flatter funnel- or cross-shaped two-dimensional likelihood surfaces. A bottleneck signal is
163
visible on the cross-shaped (θ, θanc ) plot on Fig. 2b, but is not significant.
164
3.2. Implementation and efficiency of IS on time-inhomogeneous models
165
Simulation tests show that our implementation of de Iorio and Griffiths’ IS algorithm
166
for a model of a single population with past changes in population size and stepwise mu-
167
tations is very efficient under most demographic situations tested here. Similar results are
168
obtained for two different approximations of the likelihood (see section A.2 in the supple-
169
mentary materials) First, computation times are reasonably short: for a single data set
170
with hundred gene copies and ten loci, analyses are done within few hours to three days on
171
a single processor, even for the longer analysis with four parameters under the GSM. Sec-
172
ond, likelihood ratio test (LRT) p-value distributions generally indicate good CI coverage
173
properties (see Section 5.2). Cumulative distributions of the LRT-Pvalues for all scenarios,
174
shown in section C in the supplementary materials, are most of the time close to the 1:1
175
diagonal as show in Fig. 3a for our baseline scenario.
10
c(0, 1)
c(0, 1)
1.0
0.8
0.6
0.4
0.2
0.4
0.6
0.8
2Nancmu =40
0.0
0.4
0.6
0.8
Rel. bias, rel. RMSE 0.0456, 0.471
0.2
KS: 0.46
1.0
0.4
0.6
0.8
0.0
0.4
0.6
0.8
Rel. bias, rel. RMSE 0.086, 0.694
0.2
KS: 0.295
DR: 1 ( 0 )
Nratio =0.01
Rel. bias, rel. RMSE 0.0624, 0.268
0.2
KS: 0.0684
1.0
1.0
q q q qq q qqq qq q qq qq qq q q q qqq q q qq q qq qqq q qqq qqq qqq qq q q q q qqq q qqq q qqqq qq qq qqqq q qq q q q qqq qq q qqq q qqq q qqq q q qq qq q q q q qqq qq q qq qqq q qqq qq qqq q qq q qq
0.0
q q
D =1.25 qq qqq qq q q qqq qq q qq q q qq qqq q qqq qq qqqq q q qqq q q qq qq q q q qq qq qq q q qqq q qq q qqqq q qq q q q qq qqq qqq qqq qqq qqq qqq qqq q q qq q qq q qq qq q qqq q qqq qqq qqq qqq
(a) case 0
1.0
qqq qqq q
Rel. bias, rel. RMSE 0.0351, 0.556
qq q qq q qqq qq qq qq q qqq q q qq q qq q qqq qqq qqqq q qq q qqq q q q qq q qqq q q qqq qq qqq qqq qqq q q q q q q qq qqq qq q q qq qq q qq qqqq qqq q qq q q q q qq qq qqqq qq qqq qqq q
0.0
KS: 0.0562
qqqq qqq q q qqq qqq q q qqq qq q q qq qqq q q q qqq qqqq qq qqqq qqqqq qq qq q q qq qq qqq qq qqq q q qq qqq q q qqq q q q qq qq qq q qqq qqqq q q q qqq qqq qqqq q q q q qq q q qq qq qqq q qqq qq q qqq q q
2Nmu =0.4
0.0
q
0.2
0.4
0.6
0.8
0.4
0.6
0.8
Rel. bias, rel. RMSE 0.615, 3.73
0.2
KS: 0.241
qq qq q qqq qq q q q q qq qq q q qq q q qqq q qqq qq q qqq q qq q q qq q q qq qqq qqq qqq qq
1.0
1.0
qqqq qqq qqq qqqq q q qqq qqqqq q q q qq q q q
2Nancmu =2
Rel. bias, rel. RMSE 2.28, 9.19
q q q q qq q qq q qq qq qq qq qq q q q q qqq q q qqqq qq q q
0.0
KS: 0.768
q qqq qqq qqq qqq q qq qqq qqq q q q qq q qq qqq q q qqq q q qq qqq qq q q q qqq q q qqq q qqqq qq qq q q q qqqq q q q q qq q q qq q q qqq q qq qq qqqqq q q qq q q qqq qq qq qq qq qq qq q qq qqq qq q q q
2Nmu =0.4
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
Rel. bias, rel. RMSE 50.3, 385
(b) case 10
0.0
KS: 0.227
DR: 0.395 ( 0 )
Nratio =0.2
Rel. bias, rel. RMSE −0.098, 1.6
1.0
1.0
qqqq q qqq qqq qqqq qqq q qq qq qq q qq qqq qq q qqq qq q qq q qqq q qq q q q q q q qqq qqq q qqq qqq q qq q qq qqq q q q q qq qq qq qq qqq q qq q qq qq q q qq q qqq q qq qq q qqq qqq qq q qqq qq qqq q qq
0.0
KS: 0.0199
qq qq qqq qqq qqq qq qqq qq q q qqq qqq q qq qqq qq qqq qqq q qq qq qq q q qqq q qq qq q q qqq q qqq q qq qqq q q q q qqq q q q qq q qqq q q q qqq q q q q qq qq qq q q qq qq qqq q qqq qqqq qq q q q q
D =1.25
ECDF of P−values
0.2
0.4
0.6
0.8
0.0
0.4
0.6
0.8
1.0
1.0
0.0
0.2
0.4
0.6
0.8
KS: 0.000227
q qq qq qqq qq
0.4
0.6
0.8
KS: 100.0), small scale samples show low BDRs below 10% and accurate
485
BDRs are only obtained using large sampling scales. Parameter inference appears highly
486
biased for all levels of gene flow considered in this study. As for BDRs, best precision is
487
also obtained when gene flow is high and sampling scale is large. Nevertheless, for all other
488
situations, relative biases and RRMSEs are high suggesting that in most situations, limited
489
gene flow between geographically distinct demes will always lead to erroneous inferences
490
of past and present population sizes, and of the timing of the demographic change.
491
Such confounding effects of population structure and past changes in population sizes
492
has already been observed. First, the effect of small scale IBD population structure on
493
BDRs obtained with the Bottleneck and M-Ratio softwares has been tested by simulations
494
in Leblois, Estoup and Streiff (2006). Our results are globally in agreement with this
495
previous study, except that they found large FEDRs when using Bottleneck on IBD
496
samples and that considering large scale samples makes FEDRs even larger. Such results
497
showing that fine scale population structure induces false expansion signals has also been
498
previously stressed by Ptak and Przeworski (2002) in the context of sequence data analysis
499
based on the Tajima’s D statistics. Our simulations on the contrary show non-null but
500
small FEDR in the presence of small scale IBD structure.
501
Second, the effect of island population structure on past population size inference was
502
first highlighted by simulation in Nielsen and Beaumont (2009). More recently, Peter,
503
Wegmann and Excoffier (2010), Chikhi et al. (2010) and Heller, Chikhi and Siegismund
504
(2013) also showed that analyzing samples drawn from a single deme of an island model
505
with low to intermediate migration rates (i.e. N m < 5) leads to false signals of bottleneck.
506
Such erroneous imputations can be understood by considering the genealogical processes
507
in an island model and in a single population with varying size. In a subdivided popula-
508
tion with relatively small deme sizes and small migration rates, the genealogy of a sample
509
taken from a single deme will show (1) many short branches for genes that rapidly coalesce
510
within the deme in which they were sampled (i.e. before any migration event), this corre-
511
spond to the “scattering phase” described in Wakeley (1999); and (2) a few much longer
512
branches for genes that coalesce after any emigration or immigration event from the deme
513
sampled, this is the “collecting phase” of Wakeley (1999). The result is a genealogy with
514
an excess of short terminal branches, as expected after a recent contraction in population 30
515
size. However, if only one individual is taken from different demes, and/or if deme size
516
or migration rates are large, the genealogical process becomes closer to the one expected
517
under a Wright-Fisher population. Similarly, when gene flow is very limited, the ancestry
518
of a sample coming from a single deme will also be very similar to the one expected under
519
the WF model. Thus, except for limit cases, structured and declining population scenarios
520
may result in more or less similar genealogies, depending on deme sizes, migration rates
521
and sampling scale. This expected influence of these three factors may strongly complicate
522
the study of the effect of population structure on the inference of past population size.
523
This can be noticed in the heterogeneity of the results of the different simulation studies
524
available. All those comparisons based on different simulations of structured population
525
show that the effect of population structure is generally complex, and will be quite difficult
526
to predict except in a few simple cases. Those results also show that verbal argumenta-
527
tion based on over-simplified past genealogical processes may not always give the right
528
prediction. Nevertheless, three main points arise from those simulation studies and can
529
serve as guidelines for empirical studies : (1) using a large sample scale strongly limits the
530
influence of population structure on the inference of past population size variations, as ad-
531
vocated by Chikhi et al. (2010), but allows correct inference only when a single individual
532
(ideally, a single gene) is sampled per deme or when migration rates are relatively high,
533
i.e. M > 10.0; (2) for all other demographic situations, detection of past population size
534
changes and parameter inferences based on panmictic models may often be misleading.
535
However, we did not here consider sampling a single individual per deme, which may more
536
effectively decrease the bias due to population structure.
537
Such results finally implies that models themselves should be improved. First, model
538
choice procedure should be developed to evaluate whether observed patterns of genetic
539
diversity can be better explained by a model of population size change or by a model of
540
subdivided populations. For example, Peter, Wegmann and Excoffier (2010) used an Ap-
541
proximate Bayesian Computation (ABC) model choice approach to distinguish between
542
structured populations and panmictic population that undergone past changes in size.
543
However, they show by simulation that their model choice procedure has relatively lim-
544
ited power to assign simulated data sets to the correct evolutionary model, even with a
545
relatively large number of loci (e.g. 60% to 85.5% with 10 to 200 loci, respectively). An 31
546
alternative is to develop models accounting for both population structure and population
547
size changes would probably be more realistic for most species/populations but the only
548
available method (Hey and Nielsen, 2007; Hey, 2010) has never been tested for scenarios
549
with both structured populations and past changes in population sizes.
550
4.5. Conclusion
551
This work shows that our new inference method seems very competitive compared to
552
alternative methods, such as MsVar. However, our simulation tests also showed some impor-
553
tant limits, which most importantly are large computation times for strong disequilibrium
554
scenarios and a strong influence of some form of unaccounted population structure. One
555
first major improvement would thus be to speed up the analyses. Among the different pos-
556
sibilities, a relatively simple improvement would be to more efficiently choose the number
557
of explored histories for each point of the parameter space. A more attractive improvement
558
would be to design more efficient IS algorithms for time-inhomogeneous models. However,
559
various unsuccessful attempts suggest that it may be a difficult task (not shown). A second
560
major improvement would be to include population structure in the demographic model
561
for simultaneous inference of migration rates and past population size change or to develop
562
model choice procedures.
563
Lastly, given the current revolution in genetic data production due to next generation
564
sequencing technologies (NGS), it seems crucial to allow for the analysis of different types of
565
independent markers, such as small DNA sequences without intra-locus recombination, or
566
SNPs. Given the relatively large computation times of our method, all analyses will clearly
567
only be tractable for a limited number of markers (e.g. < 10,000), but could nevertheless
568
give very precise inferences. However, considering only independent markers is probably
569
not the optimal approach as NGS make it possible to apply new class of methods based on
570
the analyses of linkage disequilibrium for past demographic inferences. Such methods are
571
based on the computation of the distribution of non-recombining haplotype block length
572
(e.g. Meuwissen and Goddard (2007); Albrechtsen et al. (2009); Gusev et al. (2012);
573
Palamara et al. (2012); Theunert et al. (2012) or explicitly model the spatial dependence
574
of markers using hidden Markov models (e.g. Dutheil et al. (2009); Mailund et al. (2012)).
575
They will probably play a major role in the future of population genetic demographic and 32
576
historical inferences.
577
5. Methods
578
5.1. Simulation study Table 7: Simulated demographic scenarios with a stepwise mutation model Case
D (T )
θ (N )
θanc (Nanc )
Case
[0]
1.25 (200)
0.4 (200)
40.0 (20,000)
[9]
[1]
0.025 (10)
0.4 (200)
40.0 (20,000)
[2]
0.0625 (25)
0.4 (200)
[3]
0.125 (50)
[4]
D (T )
θ (N )
θanc (Nanc )
7.5 (3,000)
0.4 (200)
40.0 (20,000)
[10]
1.25 (200)
0.4 (200)
2.0 (1,000)
40.0 (20,000)
[11]
1.25 (200)
0.4 (200)
4.0 (2,000)
0.4 (200)
40.0 (20,000)
[12]
1.25 (200)
0.4 (200)
8.0 (4,000)
0.25 (100)
0.4 (200)
40.0 (20,000)
[13]
1.25 (200)
0.4 (200)
12.0 (6,000)
[5]
0.5 (200)
0.4 (200)
40.0 (20,000)
[14]
1.25 (200)
0.4 (200)
24.0 (16,000)
[6]
2.5 (1,000)
0.4 (200)
40.0 (20,000)
[15]
1.25 (200)
0.4 (200)
120.0 (60,000)
[7]
3.5 (1,400)
0.4 (200)
40.0 (20,000)
[16]
1.25 (200)
0.4 (200)
400.0 (200,000)
[8]
5 (2,000)
0.4 (200)
40.0 (20,000)
579
A first set of simulations aims at testing the power of the algorithm to detect bottlenecks
580
and the accuracy of the parameters estimates when the duration (D) or the strength of
581
the contraction (θanc ) vary. The mutation process considered is a SMM over a range of
582
200 alleles. These experiments are presented in Table 7. We also reanalyzed the sixty
583
simulated data sets from Girod et al. (2011) to compare the results obtained with MsVar
584
and our own estimates. The latter simulated data sets are described in Table S2 and the
585
comparison results are presented in section B in the supplementary materials.
586
A second set of simulations concerns robustness and accuracy related to mutation
587
processes of microsatellites that are known to be highly complex (Ellegren, 2000, 2004; Sun
588
et al., 2012). This second set of simulations is thus based on a generalized stepwise mutation
589
model (GSM) with either p = 0.22 or p = 0.74, that are respectively the value commonly
590
considered as a realistic average value in the literature (Dib et al., 1996; Ellegren, 2000;
591
Estoup et al., 2001; Ellegren, 2004), and the largest, ever reported value (Fitzsimmons,
592
1998; Peery et al., 2012). We have also added data sets drawn with the K-allele model
593
(KAM) to those simulations, which might be seen as a GSM with p = 1.0. A first set 33
Figure 6: Simulated data sets with population structure
Local population structure The simulated IBD populations are composed of individuals set at the nodes of a regular lattice, whose size can vary. A past reduction in population size is thus modeled as a reduction of the habitat area keeping a constant density of individuals. Various levels of localized dispersal were simulated via truncated Pareto distributions with mean squared parent-offspring dispersal distance, say σ 2 , varying in {1; 4; 10; 100}. Parameters of the IBD populations
Simulated sampling schemes 100 genes sampled
• At equilibrium: θ = 4.0 with a 32 × 31
• on a 5 × 10 lattice in the center of the popu-
lattice (hence N = 1984 genes)
lation [small sample scale], or • Including an habitat contraction : • regularly on the whole area (i.e., one indi-
(D, θ, θanc ) = (1.25, 0.4, 40.0) with
vidual every 4 nodes) [large sample scale]. lattices of sizes from 10 × 10 (N = 200) to 100 × 100 (Nanc = 20, 000) backward in time
Island population structure We considered models with d = 10 demes of equal size Nd genes, varying in {20;200;2000}, and exchanging migrants at rate m between pairs of demes, varying in {0.000025;0.00025;0.0025;0.025;0.075;0.25}. The model is fully characterized by the scaled parameters θ = 2dNd µ and M = 2Nd m. When past contractions occurred, deme sizes Nd decreased forward in time but migration rates m are kept constant in time. Values of M reported below correspond to scaled migration rates at sampling time t = 0. Simulated sampling schemes
Parameters of the island populations
Samples of 100 genes picked at random
• θ ∈ {4.0, 20.0} and M ∈ {0.01, 0.1, 1.0, 10.0, 30.0, 100.0}
• from a single deme [small sample scale], or
without population size changes
• from three demes [large sample scale], or
• θ = 0.4, M ∈ {0.01, 1.0, 100.0}
• from all demes [very large sample scale].
and a contraction with parameters D = 1.25, θanc = 40.0
34
594
of analyses tests the robustness to mis-specification of the mutation process. Indeed, we
595
have simulated under a GSM but inferred under a SMM. A second set of analyses tests the
596
accuracy of the estimates when the inference algorithm is based on a GSM with unknown
597
value of p.
598
The aim of the third set of simulations is to test robustness against a population
599
structure that is ignored by the inference algorithm. All the data sets in this last series were
600
simulated under a GSM model with p = 0.22 and are presented in Table 6. A first group of
601
data sets simulates local within-population structure according to an isolation-by-distance
602
(IBD) model. It thus aims at testing the robustness of the inferences to the assumption
603
of panmixia by considering non-random mating due to spatially localized parent-offspring
604
dispersal. The second group of data sets simulates both within- and among-population
605
structure at a larger spatial scale according to an island model.
606
For each scenario, we simulated 200 multilocus data sets. Each simulated data set
607
is a sample of ng = 100 genes (or haploid individuals), genotyped at n` = 10 unlinked
608
microsatellite loci, except for a few situations where we indicate that 25 or 50 instead of 10
609
loci are used. The mutation rate per gene per generation, say µ, is assumed to be constant
610
for all loci, equal to 10−3 . All simulated samples, except data sets from Girod et al. (2011)
611
(see section B in the supplementary materials), have been produced with a new version of
612
the IBDSim software (Leblois, Estoup and Rousset, 2009) that considers continuous changes
613
of population sizes.
614
5.2. Validation
615
In all simulation experiments, the true (simulated) values of the parameters of inter-
616
est are compared to the estimated values. The estimation bias and error, assessed by
617
the relative mean bias and relative root mean square error (RRMSE), are reported, as
618
well as the proportion of data sets for which a bottleneck or a false expansion signal is
619
significantly detected (BDR and FEDR, respectively). Furthermore, the accuracy of the
620
inference methodology is assessed by mean of profile likelihood ratio tests (LRTs, Cox
621
and Hinkley, 1974; Severini, 2000). The coverage properties of the confidence intervals
622
computed from the smoothed likelihood surface are tested via the distributions of LRT p-
623
values, which should be asymptotically uniform. The departure from uniformity is tested 35
624
by Kolmogorov–Smirnov tests, notably to check the validity of the implementation of the
625
inference method and to assess the different factors that can affect likelihood surface infer-
626
ence.
627
The supplementary materials are available at the XXX website. The Migraine software,
628
with the implementation of the above described methods, can be downloaded from the web
629
site kimura.univ-montp2.fr/∼rousset/Migraine.htm.
630
Acknowledgements
631
We are grateful to L. Chikhi, J.-M. Cornuet, A. Estoup, J.-M. Marin for their construc-
632
tive discussions about this work. This study was supported by the Agence Nationale de
633
la Recherche (EMILE 09-blan-0145-01 and
[email protected] 2010-BLAN-1726-01
634
projects) and by the Institut National de Recherche en Agronomie (Project INRA Starting
635
Group “IGGiPop”). Part of this work was carried out by using the resources of the Com-
636
putational Biology Service Unit from the MNHN (CNRS Unit´e Mixte de Service 2700),
637
the INRA MIGALE and GENOTOUL bioinformatics platforms and the computing grids
638
of ISEM and CBGP labs.
639
640
References
641
Albrechtsen A, Sand Korneliussen T, Moltke I, van Overseem Hansen T, Nielsen FC,
642
Nielsen R, 2009. Relatedness mapping and tracts of relatedness for genome-wide data
643
in the presence of linkage disequilibrium. Genetic Epidemiology, 33:266–274.
644
645
Beaumont M, 1999. Detecting population expansion and decline using microsatellites. Genetics, 153:2013–2029.
646
Beerli P, Felsenstein J, 2001. Maximum likelihood estimation of a migration matrix and
647
effective population sizes in n subpopulations by using a coalescent approach. Proc.
648
Natl. Acad. Sci. U. S. A., 98:4563–4568.
36
649
650
651
652
Bhargava A, Fuentes F, 2010. Mutational dynamics of microsatellites. Molecular biotechnology, 44:250–266. Bonebrake T, Christensen J, Boggs C, Ehrlich P, 2010. Population decline assessment, historical baselines, and conservation. Conservation Letters, 3:371–378.
653
Chikhi L, Sousa VC, Luisi P, Goossens B, Beaumont MA, 2010. The confounding effects
654
of population structure, genetic diversity and the sampling scheme on the detection and
655
quantification of population size changes. Genetics, 186:983–995.
656
Colautti RI, Manca M, Viljanen M, Ketelaars HAM, B¨ urgi H, Macisaac HJ, Heath DD,
657
2005. Invasion genetics of the Eurasian spiny waterflea: evidence for bottlenecks and
658
gene flow using microsatellites. Mol Ecol, 14:1869–1879.
659
Comps B, G¨ om¨ ory D, Letouzey J, Thi´ebaut B, Petit RJ, 2001. Diverging trends between
660
heterozygosity and allelic richness during postglacial colonization in the European beech.
661
Genetics, 157:389–397.
662
663
664
665
Cornuet JM, Beaumont MA, 2007. A note on the accuracy of PAC-likelihood inference with microsatellite data. Theor. Popul. Biol., 71:12–19. Cornuet JM, Luikart G, 1996. Description and power analysis of two tests for detecting recent population bottlenecks from allele frequency data. Genetics, 144:2001–2014.
666
Cornuet JM, Santos F, Beaumont MA, Robert CP, Marin JM, Balding DJ, Guillemaud T,
667
Estoup A, 2008. Inferring population history with DIY ABC: a user-friendly approach
668
to approximate Bayesian computation. Bioinformatics, 24:2713–2719.
669
Cox DR, Hinkley DV, 1974. Theoretical statistics. London: Chapman & Hall.
670
Cressie NAC, 1993. Statistics for spatial data. New York: Wiley.
671
de Iorio M, Griffiths RC, 2004a. Importance sampling on coalescent histories. Advances
672
673
674
in Applied Probabilities, 36:417–433. de Iorio M, Griffiths RC, 2004b. Importance sampling on coalescent histories. II. Subdivided population models. Advances in Applied Probabilities, 36:434–454. 37
675
de Iorio M, Griffiths RC, Leblois R, Rousset F, 2005. Stepwise mutation likelihood compu-
676
tation by sequential importance sampling in subdivided population models. Theoretical
677
Population Biology, 68:41–53.
678
679
680
681
Dib C, Faur´e S, Fizames C, et al. (14 co-authors), 1996. A comprehensive genetic map of the human genome based on 5,264 microsatellites. Nature, 380:152–154. Drummond AJ, Suchard MA, Xie D, Rambaut A, 2012. Bayesian phylogenetics with BEAUti and the BEAST 1.7. Molecular Biology and Evolution, 29:1969–1973.
682
Dutheil JY, Ganapathy G, Hobolth A, Mailund T, Uyenoyama MK, Schierup MH, 2009.
683
Ancestral population genomics: the coalescent hidden Markov model approach. Genetics,
684
183:259–274.
685
686
687
688
689
690
Ellegren H, 2000. Microsatellite mutations in the germline: implications for evolutionary inference. Trends Genet., 16:551–558. Ellegren H, 2004. Microsatellites: simple sequences with complex evolution. Nat Rev Genet, 5:435–445. Emerson B, Paradis E, Th´ebaud C, 2001. Revealing the demographic histories of species using DNA sequences. Trends in Ecology and Evolution, 16:707–716.
691
Estoup A, Wilson IJ, Sullivan C, Cornuet JM, Moritz C, 2001. Inferring population history
692
from microsatellite and enzyme data in serially introduced cane toads, Bufo marinus.
693
Genetics, 159:1671–1687.
694
Faurby S, Pertoldi C, 2012. The consequences of the unlikely but critical assumption of
695
stepwise mutation in the population genetic software, MSVAR. Evolutionary Ecology
696
Research, 14:859–879.
697
Felsenstein J, 1992. Estimating effective population size from sample sequences - Ineffi-
698
ciency of pairwise and segregating sites as compared to phylogenetic estimates. Genetical
699
Research, 59:139–147.
700
701
Fitzsimmons NN, 1998. Single paternity of clutches and sperm storage in the promiscuous green turtle (Chelonia mydas). Mol Ecol, 7:575–584. 38
702
703
704
705
Frankham R, Lees K, Montgomery M, England P, Lowe E, Briscoe D, 2006. Do population size bottlenecks reduce evolutionary potential? Animal Conservation, 2:255–260. Garza JC, Williamson EG, 2001. Detection of reduction in population size using data from microsatellite loci. Molecular Ecology, 10:305–318.
706
Girod C, Vitalis R, Leblois R, Fr´eville H, 2011. Inferring population decline and expansion
707
from microsatellite data: a simulation-based evaluation of the Msvar method. Genetics,
708
188:165–179.
709
710
711
712
713
714
Gonser R, Donnelly P, Nicholson G, Di Rienzo A, 2000. Microsatellite mutations and inferences about human demography. Genetics, 154:1793–1807. Griffiths RC, Tavar´e S, 1994. Ancestral inference in population genetics. Statistical Science, 9:307–319. Guillot G, Leblois R, Coulon A, Frantz AC, 2009. Statistical methods in spatial genetics. Molecular Ecology, 18:4734–4756.
715
Gusev A, Palamara PF, Aponte G, Zhuang Z, Darvasi A, Gregersen P, Pe’er I, 2012. The
716
architecture of long-range haplotypes shared within and across populations. Molecular
717
biology and evolution, 29:473–486.
718
719
720
721
Heller R, Chikhi L, Siegismund HR, 2013. The confounding effect of population structure on Bayesian skyline plot inferences of demographic history. PloS one, 8:e62992. Hey J, 2010. Isolation with migration models for more than two populations. Mol Biol Evol, 27:905–920.
722
Hey J, Nielsen R, 2004. Multilocus methods for estimating population sizes, migration rates
723
and divergence time, with applications to the divergence of Drosophila pseudoobscura and
724
D. persimilis. Genetics, 167:747–760.
725
Hey J, Nielsen R, 2007. Integration within the Felsenstein equation for improved Markov
726
chain Monte Carlo methods in population genetics. Proc. Natl. Acad. Sci. U. S. A.,
727
104:2785–2790.
39
728
729
730
731
732
733
734
735
736
737
738
739
Keller LF, Waller DM, 2002. Inbreeding effects in wild populations. Trends Ecol. Evol., 17:230–241. Kuhner MK, 2006. LAMARC 2.0: maximum likelihood and Bayesian estimation of population parameters. Bioinformatics, 22:768–770. Lande R, 1988. Genetics and demography in biological conservation. Science, 241:1455– 1460. Lawton-Rauh A, 2008. Demographic processes shaping genetic variation. Curr Opin Plant Biol, 11:103–109. Leblois R, Estoup A, Rousset F, 2009. IBDSim: A computer program to simulate genotypic data under isolation by distance. Molecular Ecology Resources, 9:107–109. Leblois R, Estoup A, Streiff R, 2006. Habitat contraction and reduction in population size: Does isolation by distance matter? Molecular Ecology, 15:3601–3615.
740
Mailund T, Halager AE, Westergaard M, et al. (11 co-authors), 2012. A new isolation
741
with migration model along complete genomes infers very different divergence processes
742
among closely related Great Ape species. PLoS genetics, 8:e1003125.
743
744
Marjoram P, Tavar´e S, 2006. Modern computational approaches for analysing molecular genetic variation data. Nat Rev Genet, 7:759–770.
745
Meuwissen TH, Goddard ME, 2007. Multipoint identity-by-descent prediction using dense
746
markers to map quantitative trait loci and estimate effective population size. Genetics,
747
176:2551–2560.
748
749
750
751
752
753
Nielsen R, Beaumont MA, 2009. Statistical inferences in phylogeography. Mol Ecol, 18:1034–1047. Ohta T, Kimura M, 1973. A model of mutation appropriate to estimate the number of electrophoretically detectable alleles in a finite population. Genet. Res., 22:201–204. Palamara PF, Lencz T, Darvasi A, Peter I, 2012. Length distributions of identity by descent reveal fine-scale demographic history. The American Journal of Human Genetics. 40
754
Peery MZ, Kirby R, Reid BN, Stoelting R, Doucet-B¨er E, Robinson S, V`asquez-Carrillo C,
755
Pauli JN, Palsbøll PJ, 2012. Reliability of genetic bottleneck tests for detecting recent
756
population declines. Molecular Ecology, 21:3403–3418.
757
Peter B, Wegmann D, Excoffier L, 2010. Distinguishing between population bottleneck
758
and population subdivision by a Bayesian model choice procedure. Molecular ecology,
759
19:4648–4660.
760
761
762
763
Pritchard JK, Seielstad MT, Perez-Lezaun A, Feldman MW, 1999. Population growth of human Y chromosome microsatellites. Mol. Biol. Evol., 16:1791–1798. Ptak SE, Przeworski M, 2002. Evidence for population growth in humans is confounded by fine-scale population structure. Trends in Genetics, 18:559–563.
764
Reusch TBH, Wood TE, 2007. Molecular ecology of global change. Mol Ecol, 16:3973–3992.
765
Rousset F, Leblois R, 2007. Likelihood and approximate likelihood analyses of genetic
766
structure in a linear habitat: performance and robustness to model mis-specification.
767
Mol. Biol. Evol., 24:2730–2745.
768
Rousset F, Leblois R, 2012. Likelihood-based inferences under a coalescent model of iso-
769
lation by distance: two-dimensional habitats and confidence intervals. Mol. Biol. Evol.,
770
29:957–973.
771
Schneider S, Excoffier L, 1999. Estimation of past demographic parameters from the dis-
772
tribution of pairwise differences when the mutation rates very among sites: Application
773
to human mitochondrial DNA. Genetics, 152:1079–1089.
774
775
Schwartz M, Luikart G, Waples R, 2007. Genetic monitoring as a promising tool for conservation and management. TREE, 22:25–33.
776
Severini TA, 2000. Likelihood methods in statistics. Oxford Univ. Press.
777
Spencer CC, Neigel JE, Leberg PL, 2000. Experimental evaluation of the usefulness of
778
779
780
microsatellite DNA for detecting demographic bottlenecks. Mol Ecol, 9:1517–1528. Stephens M, Donnelly P, 2000. Inference in molecular population genetics (with discussion). J. R. Stat. Soc., 62:605–655. 41
781
Storz J, Beaumont M, 2002. Testing for genetic evidence of population expansion and
782
contraction: An empirical analysis of microsatellite DNA variation using a hierarchical
783
Bayesian model. Evolution, 56:154–166.
784
785
Sun J, Helgason A, Masson G, et al. (11 co-authors), 2012. A direct characterization of human mutation based on microsatellites. Nature Genetics, 44:1161–1165.
786
Theunert C, Tang K, Lachmann M, Hu S, Stoneking M, 2012. Inferring the history of
787
population size change from genome-wide SNP data. Molecular Biology and Evolution,
788
29:3653–3667.
789
Wakeley J, 1999. Nonequilibrium migration in human evolution. Genetics, 153:1863–1871.
790
Williams B, Nichols J, Conroy M, 2002. Analysis and management of animal populations:
791
792
modeling, estimation, and decision making. Academic Pr. Wright S, 1951. The genetical structure of populations. Ann. Eugenics, 15:323–354.
42
Supplementary materials for the article : Maximum likelihood inference of population size contractions from microsatellite data by Rapha¨el Leblois, Pierre Pudlo, Joseph N´eron, Fran¸cois Bertaux, Champak Reddy Beeravolu, Renaud Vitalis, Fran¸cois Rousset
1
A. Details on the likelihood computations and settings of the inference method
2
A.1. Coalescent-based IS algorithms and disequilibrium models
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22
In this section, we give a more detailed overview of the method used to compute likelihoods of genetic data at a given locus and technical details regarding the Monte Carlo algorithm. The likelihood at a given point of the parameter space is estimated using Stephens and Donnelly (2000) and de Iorio and Griffiths (2004a) ’s importance sampling approach. An ancestral history, i.e. a coalescence tree with mutations, is defined as the set of all ancestral configurations H = {Hk ; k = 0, −1, ..., −m}, corresponding to all coalescent or mutation events that occurred from H0 the current sample state (i.e. the sample allelic configuration, or allelic counts) to H−m the allelic state of the most recent common ancestor (MRCA) ofPthe sample. The Markov nature of the backward coalescent process implies that p(Hk ) = {Hk−1 } p(Hk |Hk−1 )p(Hk−1 ) and expending the recursion over possible ancestral histories of a current sample leads to p(H0 ) = Ep [p(H0 |H−1 )...p(H−m+1 |p(H−m )]. However, forward transition probabilities p(Hk |Hk−1 ) can not directly be used in a backward process and backward transition probabilities p(Hk−1 |Hk ) are unknown, except in some specific simple models such as parent independent mutations (PIM) in a single stable panmictic population. Importance sampling techniques based on an approximation pˆ(Hk−1 |Hk ) of p(Hk−1 |Hk ) are thus used to derive the probability of a sample over possible histories p(H0 |H−1 ) p(H−m+1 |H−m ) p(H0 ) = Epˆ ... . (1) pˆ(H−1 |H0 ) pˆ(H−m |H−m+1 ) The likelihood of the data is then estimated as the average value of the probability of a sample configuration H0 given an ancestral history Hi , over nH independent simulations, I
Non-standard abbreviations : IS (importance sampling), MCMC (Monte Carlo Markov Chain), ML (maximum likelihood), RMSE (root mean square error), KS (Kolmogorov-Smirnov test), LRT (likelihood ratio test), CI (confidence or credibility intervals), KAM (K-allele model), SMM (stepwise mutation model), GSM (generalized stepwise mutation model), SNP (single nucleotide polymorphism), IBD (isolation by distance), BDR (bottleneck detection rate), FBDR (false bottleneck detection rate), FEDR (false expansion detection rate)
Preprint submitted as an article to Molecular Biology and Evolution
November 29, 2013
49
P H backward in time, of possible ancestral histories: p(H0 ) ≈ n1H ni=1 p(H0 |Hi ). The distribution of the possible ancestral histories is generated by an absorbing Markov chain with transition probabilities pˆ(Hk−1 |Hk ), and the likelihood is estimated by averaging the prodp(H |Hi,1 ) p(Hi,−m+1 |Hi,−m ) ... pˆ(Hi,−m |Hi,−m+1 ) of uct of sequential importance weights corresponding the ratio pˆ(Hi,0 i,1 |Hi,0 ) forward and backward transition probabilities obtained for each history Hi . Computation of backward transition probabilities pˆ relies on Stephens and Donnelly’s π ˆ approximations of the unknown probability π that an additional gene sampled from a population is of a given allelic type conditional on a previous sample configuration (see A.4 for an example on π ˆ computations). For efficient importance sampling distributions (i.e. approximate backward transition probabilities pˆ close to the exact p), precise estimation of the likelihood can be obtain with very few histories explored. For example, under parent independent mutation model (e.g. a K allele model, KAM), the importance sampling scheme of Stephens and Donnelly (2000) for a single isolated population is optimal because the π ˆ ’s computed following Stephens p(Hk |Hk−1 ) and Donnelly (2000) are equal to the true π’s and all the ratios of pˆ(Hk−1 |Hk ) are cancelled out. In such ideal case, consideration of a single ancestral history is thus sufficient to get the exact likelihood (de Iorio and Griffiths, 2004a,b; Stephens and Donnelly, 2000). However, departure from parent independent mutation, from panmixia and from timehomogeneity decrease the efficiency of IS proposals, and precise estimation of likelihoods then implies to explore more ancestral histories. Under a time-homogeneous model of isolation by distance, Rousset and Leblois (2007, 2012) found that 30 replicates of the absorbing Markov chain, rebuilding 30 independent possible ancestral histories, is enough to get perfect LRT-Pvalue distributions. Here, for the first time, de Iorio and Griffiths (2004a)’s IS algorithm is applied to a time-inhomogeneous demographic model using π ˆ probabilities computed from a time-homogeneous demographic model as described in the main text. Our work clearly shows that IS proposal computed as described in the main text are less and less efficient for demographic scenarios with increasing disequilibrium.
50
A.2. Some efficient modifications of the original IS algorithm
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
To speed up the computations, we can stop the simulation of the genealogies during the IS algorithm, before reaching the MRCA (see Jasra, De Iorio and Chadeau-Hyam, 2011; Rousset and Leblois, 2007). Here, the algorithm is stopped when reaching demographic equilibrium (i.e., after time T when N (t) = Nanc ) and we finalize the current IS estimates of the likelihood with the PAC-likelihood of the ancestral lineages (Li and Stephens, 2003; Cornuet and Beaumont, 2007; Rousset and Leblois, 2007, 2012). This scheme will be called PACanc. Using analytical formulas for the exact probability of the last pair of genes, computed as in Rousset (2004), also slightly decreases the computation time in the same vein as in de Iorio et al. (2005) and Rousset and Leblois (2007, 2012). The IS scheme can thus be stopped when reaching an ancestral sample of size 2 and finalized with this exact formula, hereafter called 2ID. And the combination of both PACanc and 2ID will be called PACanc2ID. A detailed comparison between the four schemes (strict IS, 2ID, PACanc and PACanc2ID) is presented is presented in Table S1 for the baseline scenarios under a SMM and under a GSM (case[0], [A] to [F] under a SMM; case[G] to [L] under a GSM with p = 0.22; and case[K] to [M] under a GSM with p = 0.74). For the GSM, data sets are simulated under a GSM with 40 allelic states but analyzed with a GSM with 50 possible
2
67 68 69 70 71 72 73
allelic states. This slight mis-specification of the mutation model has a relatively strong influence when p = 0.74: LRT-Pvalue distributions are not close tho the 1:1 regardless of the number of loci. Such strong effect is also probably due to the consideration of large θanc values. Apart from this mutation model effect, our results show that performances are similar in all cases for the different algorithms. All simulations with a GSM, i.e. for the tests of the effect of mutational processes and population structure, are analyzed using the PACanc2ID, unless otherwise specified.
3
ˆ case / L [0] IS [J] 2ID [K] PACanc [L] PACanc2ID [A] IS [M] PACanc2ID [B] IS [N] PACanc2ID [C] IS [O] 2ID [P] PACanc [Q] PACanc2ID [D] IS [R] PACanc2ID [E] IS [S] PACanc2ID [F] IS [T] PACanc2ID [G] IS [H] IS [I] IS
n` 10 10 10 10 25 25 50 50 10 10 10 10 50 50 10 10 50 50 10 25 50
rel. bias NA NA NA NA NA NA NA NA 0.26 0.24 0.21 0.23 0.17 0.19 0.016 -0.022 0.045 0.0047 NA NA NA
p RRMSE NA NA NA NA NA NA NA NA 0.91 0.88 0.86 0.85 0.47 0.48 0.14 0.18 0.081 0.070 NA NA NA KS NA NA NA NA NA NA NA NA 0.16 0.0080 0.39 0.82 0.12 0.67 0.0.094 0.31 3.8 · 10−5 0.56 NA NA NA
rel. bias 0.035 0.038 0.061 0.057 0.0066 0.012 0.015 0.050 0.033 0.0047 0.052 0.021 0.059 0.070 0.137 0.072 0.34 0.24 -0.070 -0.027 -0.084
θ RRMSE 0.56 0.054 0.55 0.544 0.31 0.31 0.23 0.23 0.51 0.51 0.52 0.51 0.25 0.26 0.52 0.56 0.44 0.40 0.64 0.49 0.32 KS 0.056 0.060 0.051 0.032 0.35 0.48 0.62 0.41 0.12 0.377 0.36 0.096 0.44 0.063 0.11 0.0691