Maximum likelihood inference of population size ... - Raphael Leblois

Nov 29, 2013 - Email address: [email protected] (Raphaël Leblois) ...... the INRA MIGALE and GENOTOUL bioinformatics platforms and the ...
4MB taille 1 téléchargements 284 vues
Maximum likelihood inference of population size contractions from microsatellite data Rapha¨el Lebloisa,b,f,∗, Pierre Pudloa,d,f , Joseph N´eronb , Fran¸cois Bertauxb,e , Champak Reddy Beeravolua , Renaud Vitalisa,f , Fran¸cois Roussetc,f a

INRA, UMR 1062 CBGP (INRA-IRD-CIRAD-Montpellier Supagro), Montpellier, France b Mus´eum National d’Histoire Naturelle, CNRS, UMR OSEB, Paris, France c Universit´e Montpellier 2, CNRS, UMR ISEM, Montpellier, France d Universit´e Montpellier 2, CNRS, UMR I3M, Montpellier, France e INRIA Paris-Rocquencourt, BANG team, Le Chesnay, France f Institut de Biologie Computationnelle, Montpellier, France

Abstract Understanding the demographic history of populations and species is a central issue in evolutionary biology and molecular ecology. In the present work, we develop a maximum likelihood method for the inference of past changes in population size from microsatellite allelic data. Our method is based on importance sampling of gene genealogies, extended for new mutation models, notably the generalized stepwise mutation model (GSM). Using simulations, we test its performance to detect and characterize past reductions in population size. First, we test the estimation precision and confidence intervals coverage properties under ideal conditions, then we compare the accuracy of the estimation with another available method (MsVar) and we finally test its robustness to misspecification of the mutational model and population structure. We show that our method is very competitive compared to alternative ones. Moreover, our implementation of a GSM allows more accurate analysis of microsatellite data, as we show that violations of a single step mutation assumption induce very high bias towards false bottleneck detection rates. However, our simulation tests also showed some important limits, which most importantly are large computation times I Non-standard abbreviations : IS (importance sampling), MCMC (Monte Carlo Markov Chain), ML (maximum likelihood), RMSE (root mean square error), KS (Kolmogorov-Smirnov test), LRT (likelihood ratio test), CI (confidence or credibility intervals), KAM (K-allele model), SMM (stepwise mutation model), GSM (generalized stepwise mutation model), SNP (single nucleotide polymorphism), IBD (isolation by distance), BDR (bottleneck detection rate), FBDR (false bottleneck detection rate), FEDR (false expansion detection rate) ∗ Corresponding author Email address: [email protected] (Rapha¨el Leblois)

Preprint submitted as an article to Molecular Biology and Evolution

December 16, 2013

for strong disequilibrium scenarios and a strong influence of some form of unaccounted population structure. This inference method is available in the latest implementation of the Migraine software package. Keywords: demographic inference, maximum likelihood, coalescent, importance sampling, microsatellites, bottleneck, population structure, mutation processes

2

1

1. Introduction

2

Understanding the demographic history of populations and species is a central issue in

3

evolutionary biology and molecular ecology, e.g. for understanding the effects of environ-

4

mental changes on the distribution of organisms. From a conservation perspective, a severe

5

reduction in population size, often referred to as a “population bottleneck”, increases rate of

6

inbreeding, loss of genetic variation, fixation of deleterious alleles, and thereby greatly re-

7

duces adaptive potential and increases the risk of extinction (Lande, 1988; Frankham et al.,

8

2006; Keller and Waller, 2002; Reusch and Wood, 2007). However, characterizing the de-

9

mographic history of a species with direct demographic approaches requires the monitoring

10

of census data, which can be extremely difficult and time consuming (Williams, Nichols

11

and Conroy, 2002; Schwartz, Luikart and Waples, 2007; Bonebrake et al., 2010). Moreover,

12

direct approaches cannot give information about past demography from present-time data.

13

A powerful alternative relies on population genetic approaches, which allow inferences on

14

the past demography from the observed present distribution of genetic polymorphism in

15

natural populations (Schwartz, Luikart and Waples, 2007; Lawton-Rauh, 2008).

16

Until recently, most indirect methods were based on testing whether a given summary

17

statistic (computed from genetic data) deviates from its expected value under an equilib-

18

rium demographic model (Cornuet and Luikart, 1996; Schneider and Excoffier, 1999; Garza

19

and Williamson, 2001). Because of their simplicity, these methods have been widely used

20

(see, e.g. Comps et al., 2001; Colautti et al., 2005, and the reviews of Spencer, Neigel

21

and Leberg, 2000 and Peery et al., 2012). But they neither estimate the severity of the

22

bottleneck nor its age or duration.

23

Although much more mathematically difficult and computationally demanding, likelihood-

24

based methods outperform these moment-based methods by considering all available in-

25

formation in the genetic data (see Felsenstein, 1992; Griffiths and Tavar´e, 1994; Emerson,

26

Paradis and Th´ebaud, 2001, and the review of Marjoram and Tavar´e, 2006). Among oth-

27

ers, the software package MsVar (Beaumont, 1999; Storz and Beaumont, 2002) has been

28

increasingly used to infer past demographic changes. MsVar assumes a demographic model

29

consisting of a single isolated population, which has undergone a change in effective popu-

30

lation size at some time in the past. It is dedicated to the analysis of microsatellite loci that

3

31

are assumed to follow a strict stepwise mutation model (SMM, Ohta and Kimura, 1973).

32

In a recent study, Girod et al. (2011) evaluated the performance of MsVar by simulation.

33

They have shown that MsVar clearly outperforms moment-based methods to detect past

34

changes in population sizes, but appears only moderately robust to mis-specification of

35

the mutational model: deviations from the SMM often induce “false” bottleneck detections

36

on simulated samples from populations at equilibrium. Chikhi et al. (2010) also found

37

a strong confounding effect of population structure on bottleneck detection using MsVar.

38

Thus, departures from the mutational and demographic assumptions of the model appear

39

to complicate the inference of past population size changes from genetic data.

40

41

The present work extends the importance sampling (IS) class of algorithms (Stephens

42

and Donnelly, 2000; de Iorio and Griffiths, 2004a,b) to coalescent-based models of a single

43

isolated population with past changes in population size. Moreover, in the spirit of de Iorio

44

et al. (2005), we also provide explicit formula for a generalized stepwise mutation model

45

(GSM, Pritchard et al., 1999).

46

We have conducted three simulation studies to test the efficiency of our methodology

47

on past contractions (i.e., bottlenecks) and its robustness against mis-specifications of the

48

model. The first study aims at showing the ability of the algorithm to detect bottlenecks

49

and to recover the parameters of the model (i.e., the severity of the size change and its age)

50

on a wide range of bottleneck scenarios. In a second study, we compared the accuracy of

51

our IS implementation with the MCMC approach implemented in MsVar. The third study

52

tests the robustness of our method against mis-specification of the mutation model, and

53

against the existence of a population structure not considered in the model. All analyses

54

in these studies were performed using the latest implementation of the Migraine software

55

package, available at the web page kimura.univ-montp2.fr/∼rousset/Migraine.htm.

56

2. New approaches

57

Our goal is to obtain maximum likelihood (ML) estimates for single population models

58

with a past variation in population size, described in Subsection 2.1. To this end, we

59

describe the successive steps of the inference algorithm (Subsections 2.2, 2.3).

4

60

2.1. Demographic model

Figure 1: Representation of the demographic model used in the study. N s are population sizes, T is the time measured in generation since present and µ the mutation rate of the marker used. Those four parameters are the canonical parameters of the model. θs and D are the inferred scaled parameters.

We consider a single isolated population with past size changes (Fig. 1). We denote by N (t) the population size, expressed as the number of genes, t generations away from the sampling time t = 0. Population size at sampling time is N ≡ N (0). Then, going backward in time, the population size changes according to a deterministic exponential function until reaching an ancestral population size Nanc at time t = T . Then, N (t) remains constant, equal to Nanc for all t > T . More precisely,  t  N Nanc T , if 0 < t < T, N N (t) =  N , if t > T. anc

(1)

61

To ensure identifiability, the parameters of interest are scaled as θ ≡ 2N µ, θanc ≡ 2µNanc

62

and D ≡ T /2N , where µ is the mutation rate per locus per generation. We often are

63

interested in an extra composite parameter Nratio = θ/θanc , which is useful to characterize

64

the strength of the bottleneck. Finally, we also consider an alternative parametrization of

65

the model using θ, θanc and D0 ≡ µT in a few situations, for comparison between these two

66

possible parameterizations.

67

2.2. Computation of coalescent-based likelihood with importance sampling

68

Because the precise genetic history of the sample is not observed, the coalescent-based

69

likelihood at a given point of the parameter space is an integral over all possible histories, 5

70

i.e. genealogies with mutations, leading to the present genetic data. Following Stephens

71

and Donnelly (2000) and de Iorio and Griffiths (2004a), the Monte Carlo scheme comput-

72

ing this integral is based here on importance sampling. The set of possible past histories is

73

explored via an importance distribution depending on the demographical scenario and on

74

the parameter values we are currently focused on. The best proposal distribution to sample

75

from is the importance distribution leading to a zero variance estimate of the likelihood.

76

Here it amounts to the model-based distribution of gene history conditioned by the present

77

genetic data, which corresponds to all backward transition rates between successive states

78

of the histories. As computation of these backward transition rates is often too difficult,

79

we substitute this conditional distribution with an importance distribution, and introduce

80

a weight to correct the discrepancy. Like the best proposal distribution, the actual im-

81

portance distribution is a process describing changes in the ancestral sample configuration

82

backward in time using absorbing Markov chains but do not lead to a zero variance esti-

83

mate of the likelihood. Better efficiency of the importance sampling proposals allows to

84

accurately estimate likelihoods by considering less histories for a given parameter value.

85

Stephens and Donnelly (2000), de Iorio and Griffiths (2004a,b) and de Iorio et al. (2005)

86

suggested efficient approximations that are easily computable. However the efficiency of

87

the importance distribution depends heavily on the demographic model and the current

88

parameter value.

89

The first main difference between our algorithm and those described in de Iorio and

90

Griffiths (2004a,b) is the time inhomogeneity induced by the disequilibrium of our de-

91

mographic model. Demographic models considered in the above cited literature and in

92

Rousset and Leblois (2007, 2012), do not include indeed any change in population sizes.

93

To relax the assumption of time homogeneity in (de Iorio and Griffiths, 2004b), we modify

94

their equations (see Tables 1 and 2 of de Iorio and Griffiths, 2004b), so that all quantities

95

depending on the relative population sizes now vary over time because of the population

96

size changes. Thus, we must keep track of time in the algorithm to assign the adequate

97

value to those time dependent quantities. To see how this is done, consider that the ge-

98

nealogy has been constructed until time Tk , and that, at this date, n ancestral lineages

99

remain. Under the importance distribution, the occurrence rate of a mutation event is then

100

nθ, and the occurrence rate of a coalescence event is n(n − 1)λ(t), where λ(t) = N/N (t) is 6

101

102

103

the population size function introducing the disequilibrium. λ(t) corresponds to parameter 1/q in de Iorio and Griffiths (2004b). The total jump rate at time t ≥ Tk is then   Γ(t) = n (n − 1)λ(t) + θ and the next event of the genealogy occurs at time Tk+1 whose distribution has density   Z t ˆ Γ(u)du dt if t ≥ Tk . P (Tk+1 ∈ dt) = Γ(t) exp − Tk

104

Apart from these modifications, the outline of the IS scheme from de Iorio and Griffiths

105

(2004b) is preserved (see section A.1 in the supplementary materials for more details).

106

107

We also develop specific algorithms to analyze data under the generalized stepwise mu-

108

tation model (GSM), with infinite or finite number of alleles. This more realistic mutation

109

model considers that multistep mutations occur and the number of steps involved for each

110

mutation can be modeled using a geometric distribution with parameter p. The original

111

algorithm of Stephens and Donnelly (2000) covers any finite mutation model but requires

112

numerical matrix inversions to solve a system of linear equations, (see, e.g., Eqs (18) and

113

(19) in Stephens and Donnelly, 2000). Time inhomogeneity requires matrix inversions

114

each time the genealogy is updated by the IS algorithm. To bypass this difficulty, de Iorio

115

et al. (2005) have successfully replaced the matrix inversions with Fourier analysis when

116

considering a SMM with an infinite allele range. We extended this Fourier analysis in the

117

case of a GSM with an infinite allele range. However, contrarily to the SMM, the result of

118

the Fourier analysis for the GSM is a very poor approximation for cases with a finite range

119

of allelic state as soon as p is not very small (e.g. < 0.1). To consider a more realistic GSM

120

with allele ranges of finite size, we propose to compute the relevant matrix inversions using

121

a numerical decomposition in eigenvectors and eigenvalues of the mutation process matrix,

122

P . Because the mutation model is not time-depend, this last decomposition is performed

123

only once for a given matrix P . See A.4 for details about the GSM implementation.

124

Finally, several approximations of the likelihood, using products of approximate condi-

125

tional likelihoods (PAC, Cornuet and Beaumont, 2007) and analytical computation of the

126

probability of the last pair of genes, have been successfully tested to speed up computation

127

times (see section A.2 in the supplementary materials). 7

128

2.3. Inference method

129

Following Rousset and Leblois (2007, 2012), we first define a set of parameter points

130

via a stratified random sample on the range of parameters provided by the user. Then, at

131

each parameter point, the multilocus likelihood is the product of the likelihoods for each

132

locus, which are estimated via the IS algorithm described above. The likelihood inferred

133

at the different parameter point is then smoothed by a Kriging scheme (Cressie, 1993).

134

After a first analysis of the smoothed likelihood surface, the algorithm can be repeated a

135

second time to increase the density of the grid in the neighborhood of a first maximum like-

136

lihood estimate. Finally, one- and two-dimensional profile likelihood ratios are computed,

137

to obtain confidence intervals and graphical outputs (e.g. Fig. 2). Section A.3 in the sup-

138

plementary materials explains how we tuned the parameters of the algorithm, namely the

139

range of parameters, the size of parameter points and the number of genealogical histories

140

explored by the IS algorithm.

141

A genuine issue, when facing genetic data, is to test whether the sampled population

142

has undergone size changes or not. Thus, we derived a statistical test from the methodology

143

presented above. It aims at testing between the null hypothesis that no size change occured

144

(i.e., N = Nanc ) and alternatives such as a population decline or expansion (i.e., N 6=

145

Nanc ). At level α, our test rejects the null hypothesis if and only if 1 lies outside the 1 − α

146

confidence interval of the ratio Nratio = N/Nanc .

147

All those developments are implemented in the Migraine software package. A detailed

148

presentation of the simulation settings and validation procedures used to test the precision

149

and robustness of the method are given in Section 5.

150

3. Results

151

3.1. Two contrasting examples

152

We begin with two contrasting simulated examples presented on Figs. 2a and 2b. The

153

first one, corresponding to our baseline simulation (θ = 0.4, D = 1.25 and θanc = 40.0),

154

case [0]), is an ideal situation in which the inference algorithm performs well due to the

155

large amount of information in the genetic data, resulting in a likelihood surface with clear

156

peaks for all parameters around the maximum likelihood values. The bottleneck signal

8

Profile likelihood ratio

Profile likelihood ratio 1.0

1.0

0.8

0.4

0. 5

0.6

0.8

7 0.

0.4

0.3

0.4 0.8

0.1

0.2

0.8

0.001

0.05

0.5

0.6

+

0.6

0.7

0.8 0.7

0.4

0.01

1 0.0

1

0.6

+

1

0. 9

0.3

0.001

0.01

0.1

D on a log scale

0.1

2

1 00 0.

D on a log scale

10

0.2

0.0 01

0.001

0.5

1

0.0

10−5 10−4 0.001 0.01 0.1

2Nµ on a log scale

0.6

0.5

0.4

0. 3

0.4

0.2

0.1

0.0050.01 0.02 0.05 0.1 0.2

0.5

1

0.4

0.2

0.1

50

0.7 0.5

0.3

0.4 1 00 0.

0.2

0.1

20 0.01 0.001

1

2

0.2

0.0 5

0.6

0.001

0.4 0.2

1

10

0.0

1.0

100

2N ancµ on a log scale

9 0.+

0.6

1.0

0.6

0.4

0.001

+ 0.05

2Nµ on a log scale

0.6

0. 8

2N ancµ on a log scale

1 00 0.

0.5

0.01

10−5 10−4 0.001 0.01 0.1

0.8

100

0.4

0.3

0.01

0.0

0.05 0.2

0.1

0.05

Profile likelihood ratio 0. 00 1

200

0.3 0.001

2Nµ on a log scale Profile likelihood ratio 500

0.2

0.001

0.01

0.001

1

0.8

0.001

0.01 0.1

0.1

0.6

+ 0.7

10

0.01

2N ancµ on a log scale

0.2

100

0.8

2N ancµ on a log scale

0.05

0.01

0.8

200

20

0.0

1.0

100

1.0

500

0.001

10

1

0.2

2Nµ on a log scale Profile likelihood ratio

Profile likelihood ratio

50

0.3

0.00 1

0.0050.01 0.02 0.05 0.1 0.2

0.2

0.01

5 0.0 01 0.0 0.1

0.2

0.05

0.8

10 1 0.1

0.001

0.01

0.3 0.001

0.01

0.05

0.05

0.1 0.1

0.5

.2.3 00

+

0.2

0.6

0.4

0.4

0.01

0.2

0.001

0.0

0.001

D on a log scale

0.01

0.1

1

10

0.0

D on a log scale

(a) case 0

(b) case 10

Figure 2: Examples of two-dimensional profile likelihood ratios for two data set generated with (a) θ = 0.4, D = 1.25, θanc = 40.0 (case [0]) and (b) θ = 0.4, D = 1.25, θanc = 2.0 (case [10]). The likelihood surface is inferred from (a) 1,240 points in two iterative steps; and (b) 3,720 points in three iterative steps as described in A.3. The likelihood surface is shown only for parameter combinations that fell within the envelope of parameter points for which likelihoods were estimated. The cross denotes the maximum.

9

157

is highly significant and is clearly seen in the (θ, θanc ) plot on Fig. 2a, as the maximum

158

likelihood peak is above the 1:1 diagonal. The second example is a more difficult situation,

159

where the population has undergone a much weaker contraction (θ = 0.4, D = 1.25 and

160

θanc = 2.0, case [10]) that does not leave a clear signal in the genetic data. In such a

161

situation, there is not much information on any of the three parameters, resulting in much

162

flatter funnel- or cross-shaped two-dimensional likelihood surfaces. A bottleneck signal is

163

visible on the cross-shaped (θ, θanc ) plot on Fig. 2b, but is not significant.

164

3.2. Implementation and efficiency of IS on time-inhomogeneous models

165

Simulation tests show that our implementation of de Iorio and Griffiths’ IS algorithm

166

for a model of a single population with past changes in population size and stepwise mu-

167

tations is very efficient under most demographic situations tested here. Similar results are

168

obtained for two different approximations of the likelihood (see section A.2 in the supple-

169

mentary materials) First, computation times are reasonably short: for a single data set

170

with hundred gene copies and ten loci, analyses are done within few hours to three days on

171

a single processor, even for the longer analysis with four parameters under the GSM. Sec-

172

ond, likelihood ratio test (LRT) p-value distributions generally indicate good CI coverage

173

properties (see Section 5.2). Cumulative distributions of the LRT-Pvalues for all scenarios,

174

shown in section C in the supplementary materials, are most of the time close to the 1:1

175

diagonal as show in Fig. 3a for our baseline scenario.

10

c(0, 1)

c(0, 1)

1.0

0.8

0.6

0.4

0.2

0.4

0.6

0.8

2Nancmu =40

0.0

0.4

0.6

0.8

Rel. bias, rel. RMSE 0.0456, 0.471

0.2

KS: 0.46

1.0

0.4

0.6

0.8

0.0

0.4

0.6

0.8

Rel. bias, rel. RMSE 0.086, 0.694

0.2

KS: 0.295

DR: 1 ( 0 )

Nratio =0.01

Rel. bias, rel. RMSE 0.0624, 0.268

0.2

KS: 0.0684

1.0

1.0

q q q qq q qqq qq q qq qq qq q q q qqq q q qq q qq qqq q qqq qqq qqq qq q q q q qqq q qqq q qqqq qq qq qqqq q qq q q q qqq qq q qqq q qqq q qqq q q qq qq q q q q qqq qq q qq qqq q qqq qq qqq q qq q qq

0.0

q q

D =1.25 qq qqq qq q q qqq qq q qq q q qq qqq q qqq qq qqqq q q qqq q q qq qq q q q qq qq qq q q qqq q qq q qqqq q qq q q q qq qqq qqq qqq qqq qqq qqq qqq q q qq q qq q qq qq q qqq q qqq qqq qqq qqq

(a) case 0

1.0

qqq qqq q

Rel. bias, rel. RMSE 0.0351, 0.556

qq q qq q qqq qq qq qq q qqq q q qq q qq q qqq qqq qqqq q qq q qqq q q q qq q qqq q q qqq qq qqq qqq qqq q q q q q q qq qqq qq q q qq qq q qq qqqq qqq q qq q q q q qq qq qqqq qq qqq qqq q

0.0

KS: 0.0562

qqqq qqq q q qqq qqq q q qqq qq q q qq qqq q q q qqq qqqq qq qqqq qqqqq qq qq q q qq qq qqq qq qqq q q qq qqq q q qqq q q q qq qq qq q qqq qqqq q q q qqq qqq qqqq q q q q qq q q qq qq qqq q qqq qq q qqq q q

2Nmu =0.4

0.0

q

0.2

0.4

0.6

0.8

0.4

0.6

0.8

Rel. bias, rel. RMSE 0.615, 3.73

0.2

KS: 0.241

qq qq q qqq qq q q q q qq qq q q qq q q qqq q qqq qq q qqq q qq q q qq q q qq qqq qqq qqq qq

1.0

1.0

qqqq qqq qqq qqqq q q qqq qqqqq q q q qq q q q

2Nancmu =2

Rel. bias, rel. RMSE 2.28, 9.19

q q q q qq q qq q qq qq qq qq qq q q q q qqq q q qqqq qq q q

0.0

KS: 0.768

q qqq qqq qqq qqq q qq qqq qqq q q q qq q qq qqq q q qqq q q qq qqq qq q q q qqq q q qqq q qqqq qq qq q q q qqqq q q q q qq q q qq q q qqq q qq qq qqqqq q q qq q q qqq qq qq qq qq qq qq q qq qqq qq q q q

2Nmu =0.4

0.2

0.4

0.6

0.8

0.2

0.4

0.6

0.8

Rel. bias, rel. RMSE 50.3, 385

(b) case 10

0.0

KS: 0.227

DR: 0.395 ( 0 )

Nratio =0.2

Rel. bias, rel. RMSE −0.098, 1.6

1.0

1.0

qqqq q qqq qqq qqqq qqq q qq qq qq q qq qqq qq q qqq qq q qq q qqq q qq q q q q q q qqq qqq q qqq qqq q qq q qq qqq q q q q qq qq qq qq qqq q qq q qq qq q q qq q qqq q qq qq q qqq qqq qq q qqq qq qqq q qq

0.0

KS: 0.0199

qq qq qqq qqq qqq qq qqq qq q q qqq qqq q qq qqq qq qqq qqq q qq qq qq q q qqq q qq qq q q qqq q qqq q qq qqq q q q q qqq q q q qq q qqq q q q qqq q q q q qq qq qq q q qq qq qqq q qqq qqqq qq q q q q

D =1.25

ECDF of P−values

0.2

0.4

0.6

0.8

0.0

0.4

0.6

0.8

1.0

1.0

0.0

0.2

0.4

0.6

0.8

KS: 0.000227

q qq qq qqq qq

0.4

0.6

0.8

KS: 100.0), small scale samples show low BDRs below 10% and accurate

485

BDRs are only obtained using large sampling scales. Parameter inference appears highly

486

biased for all levels of gene flow considered in this study. As for BDRs, best precision is

487

also obtained when gene flow is high and sampling scale is large. Nevertheless, for all other

488

situations, relative biases and RRMSEs are high suggesting that in most situations, limited

489

gene flow between geographically distinct demes will always lead to erroneous inferences

490

of past and present population sizes, and of the timing of the demographic change.

491

Such confounding effects of population structure and past changes in population sizes

492

has already been observed. First, the effect of small scale IBD population structure on

493

BDRs obtained with the Bottleneck and M-Ratio softwares has been tested by simulations

494

in Leblois, Estoup and Streiff (2006). Our results are globally in agreement with this

495

previous study, except that they found large FEDRs when using Bottleneck on IBD

496

samples and that considering large scale samples makes FEDRs even larger. Such results

497

showing that fine scale population structure induces false expansion signals has also been

498

previously stressed by Ptak and Przeworski (2002) in the context of sequence data analysis

499

based on the Tajima’s D statistics. Our simulations on the contrary show non-null but

500

small FEDR in the presence of small scale IBD structure.

501

Second, the effect of island population structure on past population size inference was

502

first highlighted by simulation in Nielsen and Beaumont (2009). More recently, Peter,

503

Wegmann and Excoffier (2010), Chikhi et al. (2010) and Heller, Chikhi and Siegismund

504

(2013) also showed that analyzing samples drawn from a single deme of an island model

505

with low to intermediate migration rates (i.e. N m < 5) leads to false signals of bottleneck.

506

Such erroneous imputations can be understood by considering the genealogical processes

507

in an island model and in a single population with varying size. In a subdivided popula-

508

tion with relatively small deme sizes and small migration rates, the genealogy of a sample

509

taken from a single deme will show (1) many short branches for genes that rapidly coalesce

510

within the deme in which they were sampled (i.e. before any migration event), this corre-

511

spond to the “scattering phase” described in Wakeley (1999); and (2) a few much longer

512

branches for genes that coalesce after any emigration or immigration event from the deme

513

sampled, this is the “collecting phase” of Wakeley (1999). The result is a genealogy with

514

an excess of short terminal branches, as expected after a recent contraction in population 30

515

size. However, if only one individual is taken from different demes, and/or if deme size

516

or migration rates are large, the genealogical process becomes closer to the one expected

517

under a Wright-Fisher population. Similarly, when gene flow is very limited, the ancestry

518

of a sample coming from a single deme will also be very similar to the one expected under

519

the WF model. Thus, except for limit cases, structured and declining population scenarios

520

may result in more or less similar genealogies, depending on deme sizes, migration rates

521

and sampling scale. This expected influence of these three factors may strongly complicate

522

the study of the effect of population structure on the inference of past population size.

523

This can be noticed in the heterogeneity of the results of the different simulation studies

524

available. All those comparisons based on different simulations of structured population

525

show that the effect of population structure is generally complex, and will be quite difficult

526

to predict except in a few simple cases. Those results also show that verbal argumenta-

527

tion based on over-simplified past genealogical processes may not always give the right

528

prediction. Nevertheless, three main points arise from those simulation studies and can

529

serve as guidelines for empirical studies : (1) using a large sample scale strongly limits the

530

influence of population structure on the inference of past population size variations, as ad-

531

vocated by Chikhi et al. (2010), but allows correct inference only when a single individual

532

(ideally, a single gene) is sampled per deme or when migration rates are relatively high,

533

i.e. M > 10.0; (2) for all other demographic situations, detection of past population size

534

changes and parameter inferences based on panmictic models may often be misleading.

535

However, we did not here consider sampling a single individual per deme, which may more

536

effectively decrease the bias due to population structure.

537

Such results finally implies that models themselves should be improved. First, model

538

choice procedure should be developed to evaluate whether observed patterns of genetic

539

diversity can be better explained by a model of population size change or by a model of

540

subdivided populations. For example, Peter, Wegmann and Excoffier (2010) used an Ap-

541

proximate Bayesian Computation (ABC) model choice approach to distinguish between

542

structured populations and panmictic population that undergone past changes in size.

543

However, they show by simulation that their model choice procedure has relatively lim-

544

ited power to assign simulated data sets to the correct evolutionary model, even with a

545

relatively large number of loci (e.g. 60% to 85.5% with 10 to 200 loci, respectively). An 31

546

alternative is to develop models accounting for both population structure and population

547

size changes would probably be more realistic for most species/populations but the only

548

available method (Hey and Nielsen, 2007; Hey, 2010) has never been tested for scenarios

549

with both structured populations and past changes in population sizes.

550

4.5. Conclusion

551

This work shows that our new inference method seems very competitive compared to

552

alternative methods, such as MsVar. However, our simulation tests also showed some impor-

553

tant limits, which most importantly are large computation times for strong disequilibrium

554

scenarios and a strong influence of some form of unaccounted population structure. One

555

first major improvement would thus be to speed up the analyses. Among the different pos-

556

sibilities, a relatively simple improvement would be to more efficiently choose the number

557

of explored histories for each point of the parameter space. A more attractive improvement

558

would be to design more efficient IS algorithms for time-inhomogeneous models. However,

559

various unsuccessful attempts suggest that it may be a difficult task (not shown). A second

560

major improvement would be to include population structure in the demographic model

561

for simultaneous inference of migration rates and past population size change or to develop

562

model choice procedures.

563

Lastly, given the current revolution in genetic data production due to next generation

564

sequencing technologies (NGS), it seems crucial to allow for the analysis of different types of

565

independent markers, such as small DNA sequences without intra-locus recombination, or

566

SNPs. Given the relatively large computation times of our method, all analyses will clearly

567

only be tractable for a limited number of markers (e.g. < 10,000), but could nevertheless

568

give very precise inferences. However, considering only independent markers is probably

569

not the optimal approach as NGS make it possible to apply new class of methods based on

570

the analyses of linkage disequilibrium for past demographic inferences. Such methods are

571

based on the computation of the distribution of non-recombining haplotype block length

572

(e.g. Meuwissen and Goddard (2007); Albrechtsen et al. (2009); Gusev et al. (2012);

573

Palamara et al. (2012); Theunert et al. (2012) or explicitly model the spatial dependence

574

of markers using hidden Markov models (e.g. Dutheil et al. (2009); Mailund et al. (2012)).

575

They will probably play a major role in the future of population genetic demographic and 32

576

historical inferences.

577

5. Methods

578

5.1. Simulation study Table 7: Simulated demographic scenarios with a stepwise mutation model Case

D (T )

θ (N )

θanc (Nanc )

Case

[0]

1.25 (200)

0.4 (200)

40.0 (20,000)

[9]

[1]

0.025 (10)

0.4 (200)

40.0 (20,000)

[2]

0.0625 (25)

0.4 (200)

[3]

0.125 (50)

[4]

D (T )

θ (N )

θanc (Nanc )

7.5 (3,000)

0.4 (200)

40.0 (20,000)

[10]

1.25 (200)

0.4 (200)

2.0 (1,000)

40.0 (20,000)

[11]

1.25 (200)

0.4 (200)

4.0 (2,000)

0.4 (200)

40.0 (20,000)

[12]

1.25 (200)

0.4 (200)

8.0 (4,000)

0.25 (100)

0.4 (200)

40.0 (20,000)

[13]

1.25 (200)

0.4 (200)

12.0 (6,000)

[5]

0.5 (200)

0.4 (200)

40.0 (20,000)

[14]

1.25 (200)

0.4 (200)

24.0 (16,000)

[6]

2.5 (1,000)

0.4 (200)

40.0 (20,000)

[15]

1.25 (200)

0.4 (200)

120.0 (60,000)

[7]

3.5 (1,400)

0.4 (200)

40.0 (20,000)

[16]

1.25 (200)

0.4 (200)

400.0 (200,000)

[8]

5 (2,000)

0.4 (200)

40.0 (20,000)

579

A first set of simulations aims at testing the power of the algorithm to detect bottlenecks

580

and the accuracy of the parameters estimates when the duration (D) or the strength of

581

the contraction (θanc ) vary. The mutation process considered is a SMM over a range of

582

200 alleles. These experiments are presented in Table 7. We also reanalyzed the sixty

583

simulated data sets from Girod et al. (2011) to compare the results obtained with MsVar

584

and our own estimates. The latter simulated data sets are described in Table S2 and the

585

comparison results are presented in section B in the supplementary materials.

586

A second set of simulations concerns robustness and accuracy related to mutation

587

processes of microsatellites that are known to be highly complex (Ellegren, 2000, 2004; Sun

588

et al., 2012). This second set of simulations is thus based on a generalized stepwise mutation

589

model (GSM) with either p = 0.22 or p = 0.74, that are respectively the value commonly

590

considered as a realistic average value in the literature (Dib et al., 1996; Ellegren, 2000;

591

Estoup et al., 2001; Ellegren, 2004), and the largest, ever reported value (Fitzsimmons,

592

1998; Peery et al., 2012). We have also added data sets drawn with the K-allele model

593

(KAM) to those simulations, which might be seen as a GSM with p = 1.0. A first set 33

Figure 6: Simulated data sets with population structure

Local population structure The simulated IBD populations are composed of individuals set at the nodes of a regular lattice, whose size can vary. A past reduction in population size is thus modeled as a reduction of the habitat area keeping a constant density of individuals. Various levels of localized dispersal were simulated via truncated Pareto distributions with mean squared parent-offspring dispersal distance, say σ 2 , varying in {1; 4; 10; 100}. Parameters of the IBD populations

Simulated sampling schemes 100 genes sampled

• At equilibrium: θ = 4.0 with a 32 × 31

• on a 5 × 10 lattice in the center of the popu-

lattice (hence N = 1984 genes)

lation [small sample scale], or • Including an habitat contraction : • regularly on the whole area (i.e., one indi-

(D, θ, θanc ) = (1.25, 0.4, 40.0) with

vidual every 4 nodes) [large sample scale]. lattices of sizes from 10 × 10 (N = 200) to 100 × 100 (Nanc = 20, 000) backward in time

Island population structure We considered models with d = 10 demes of equal size Nd genes, varying in {20;200;2000}, and exchanging migrants at rate m between pairs of demes, varying in {0.000025;0.00025;0.0025;0.025;0.075;0.25}. The model is fully characterized by the scaled parameters θ = 2dNd µ and M = 2Nd m. When past contractions occurred, deme sizes Nd decreased forward in time but migration rates m are kept constant in time. Values of M reported below correspond to scaled migration rates at sampling time t = 0. Simulated sampling schemes

Parameters of the island populations

Samples of 100 genes picked at random

• θ ∈ {4.0, 20.0} and M ∈ {0.01, 0.1, 1.0, 10.0, 30.0, 100.0}

• from a single deme [small sample scale], or

without population size changes

• from three demes [large sample scale], or

• θ = 0.4, M ∈ {0.01, 1.0, 100.0}

• from all demes [very large sample scale].

and a contraction with parameters D = 1.25, θanc = 40.0

34

594

of analyses tests the robustness to mis-specification of the mutation process. Indeed, we

595

have simulated under a GSM but inferred under a SMM. A second set of analyses tests the

596

accuracy of the estimates when the inference algorithm is based on a GSM with unknown

597

value of p.

598

The aim of the third set of simulations is to test robustness against a population

599

structure that is ignored by the inference algorithm. All the data sets in this last series were

600

simulated under a GSM model with p = 0.22 and are presented in Table 6. A first group of

601

data sets simulates local within-population structure according to an isolation-by-distance

602

(IBD) model. It thus aims at testing the robustness of the inferences to the assumption

603

of panmixia by considering non-random mating due to spatially localized parent-offspring

604

dispersal. The second group of data sets simulates both within- and among-population

605

structure at a larger spatial scale according to an island model.

606

For each scenario, we simulated 200 multilocus data sets. Each simulated data set

607

is a sample of ng = 100 genes (or haploid individuals), genotyped at n` = 10 unlinked

608

microsatellite loci, except for a few situations where we indicate that 25 or 50 instead of 10

609

loci are used. The mutation rate per gene per generation, say µ, is assumed to be constant

610

for all loci, equal to 10−3 . All simulated samples, except data sets from Girod et al. (2011)

611

(see section B in the supplementary materials), have been produced with a new version of

612

the IBDSim software (Leblois, Estoup and Rousset, 2009) that considers continuous changes

613

of population sizes.

614

5.2. Validation

615

In all simulation experiments, the true (simulated) values of the parameters of inter-

616

est are compared to the estimated values. The estimation bias and error, assessed by

617

the relative mean bias and relative root mean square error (RRMSE), are reported, as

618

well as the proportion of data sets for which a bottleneck or a false expansion signal is

619

significantly detected (BDR and FEDR, respectively). Furthermore, the accuracy of the

620

inference methodology is assessed by mean of profile likelihood ratio tests (LRTs, Cox

621

and Hinkley, 1974; Severini, 2000). The coverage properties of the confidence intervals

622

computed from the smoothed likelihood surface are tested via the distributions of LRT p-

623

values, which should be asymptotically uniform. The departure from uniformity is tested 35

624

by Kolmogorov–Smirnov tests, notably to check the validity of the implementation of the

625

inference method and to assess the different factors that can affect likelihood surface infer-

626

ence.

627

The supplementary materials are available at the XXX website. The Migraine software,

628

with the implementation of the above described methods, can be downloaded from the web

629

site kimura.univ-montp2.fr/∼rousset/Migraine.htm.

630

Acknowledgements

631

We are grateful to L. Chikhi, J.-M. Cornuet, A. Estoup, J.-M. Marin for their construc-

632

tive discussions about this work. This study was supported by the Agence Nationale de

633

la Recherche (EMILE 09-blan-0145-01 and [email protected] 2010-BLAN-1726-01

634

projects) and by the Institut National de Recherche en Agronomie (Project INRA Starting

635

Group “IGGiPop”). Part of this work was carried out by using the resources of the Com-

636

putational Biology Service Unit from the MNHN (CNRS Unit´e Mixte de Service 2700),

637

the INRA MIGALE and GENOTOUL bioinformatics platforms and the computing grids

638

of ISEM and CBGP labs.

639

640

References

641

Albrechtsen A, Sand Korneliussen T, Moltke I, van Overseem Hansen T, Nielsen FC,

642

Nielsen R, 2009. Relatedness mapping and tracts of relatedness for genome-wide data

643

in the presence of linkage disequilibrium. Genetic Epidemiology, 33:266–274.

644

645

Beaumont M, 1999. Detecting population expansion and decline using microsatellites. Genetics, 153:2013–2029.

646

Beerli P, Felsenstein J, 2001. Maximum likelihood estimation of a migration matrix and

647

effective population sizes in n subpopulations by using a coalescent approach. Proc.

648

Natl. Acad. Sci. U. S. A., 98:4563–4568.

36

649

650

651

652

Bhargava A, Fuentes F, 2010. Mutational dynamics of microsatellites. Molecular biotechnology, 44:250–266. Bonebrake T, Christensen J, Boggs C, Ehrlich P, 2010. Population decline assessment, historical baselines, and conservation. Conservation Letters, 3:371–378.

653

Chikhi L, Sousa VC, Luisi P, Goossens B, Beaumont MA, 2010. The confounding effects

654

of population structure, genetic diversity and the sampling scheme on the detection and

655

quantification of population size changes. Genetics, 186:983–995.

656

Colautti RI, Manca M, Viljanen M, Ketelaars HAM, B¨ urgi H, Macisaac HJ, Heath DD,

657

2005. Invasion genetics of the Eurasian spiny waterflea: evidence for bottlenecks and

658

gene flow using microsatellites. Mol Ecol, 14:1869–1879.

659

Comps B, G¨ om¨ ory D, Letouzey J, Thi´ebaut B, Petit RJ, 2001. Diverging trends between

660

heterozygosity and allelic richness during postglacial colonization in the European beech.

661

Genetics, 157:389–397.

662

663

664

665

Cornuet JM, Beaumont MA, 2007. A note on the accuracy of PAC-likelihood inference with microsatellite data. Theor. Popul. Biol., 71:12–19. Cornuet JM, Luikart G, 1996. Description and power analysis of two tests for detecting recent population bottlenecks from allele frequency data. Genetics, 144:2001–2014.

666

Cornuet JM, Santos F, Beaumont MA, Robert CP, Marin JM, Balding DJ, Guillemaud T,

667

Estoup A, 2008. Inferring population history with DIY ABC: a user-friendly approach

668

to approximate Bayesian computation. Bioinformatics, 24:2713–2719.

669

Cox DR, Hinkley DV, 1974. Theoretical statistics. London: Chapman & Hall.

670

Cressie NAC, 1993. Statistics for spatial data. New York: Wiley.

671

de Iorio M, Griffiths RC, 2004a. Importance sampling on coalescent histories. Advances

672

673

674

in Applied Probabilities, 36:417–433. de Iorio M, Griffiths RC, 2004b. Importance sampling on coalescent histories. II. Subdivided population models. Advances in Applied Probabilities, 36:434–454. 37

675

de Iorio M, Griffiths RC, Leblois R, Rousset F, 2005. Stepwise mutation likelihood compu-

676

tation by sequential importance sampling in subdivided population models. Theoretical

677

Population Biology, 68:41–53.

678

679

680

681

Dib C, Faur´e S, Fizames C, et al. (14 co-authors), 1996. A comprehensive genetic map of the human genome based on 5,264 microsatellites. Nature, 380:152–154. Drummond AJ, Suchard MA, Xie D, Rambaut A, 2012. Bayesian phylogenetics with BEAUti and the BEAST 1.7. Molecular Biology and Evolution, 29:1969–1973.

682

Dutheil JY, Ganapathy G, Hobolth A, Mailund T, Uyenoyama MK, Schierup MH, 2009.

683

Ancestral population genomics: the coalescent hidden Markov model approach. Genetics,

684

183:259–274.

685

686

687

688

689

690

Ellegren H, 2000. Microsatellite mutations in the germline: implications for evolutionary inference. Trends Genet., 16:551–558. Ellegren H, 2004. Microsatellites: simple sequences with complex evolution. Nat Rev Genet, 5:435–445. Emerson B, Paradis E, Th´ebaud C, 2001. Revealing the demographic histories of species using DNA sequences. Trends in Ecology and Evolution, 16:707–716.

691

Estoup A, Wilson IJ, Sullivan C, Cornuet JM, Moritz C, 2001. Inferring population history

692

from microsatellite and enzyme data in serially introduced cane toads, Bufo marinus.

693

Genetics, 159:1671–1687.

694

Faurby S, Pertoldi C, 2012. The consequences of the unlikely but critical assumption of

695

stepwise mutation in the population genetic software, MSVAR. Evolutionary Ecology

696

Research, 14:859–879.

697

Felsenstein J, 1992. Estimating effective population size from sample sequences - Ineffi-

698

ciency of pairwise and segregating sites as compared to phylogenetic estimates. Genetical

699

Research, 59:139–147.

700

701

Fitzsimmons NN, 1998. Single paternity of clutches and sperm storage in the promiscuous green turtle (Chelonia mydas). Mol Ecol, 7:575–584. 38

702

703

704

705

Frankham R, Lees K, Montgomery M, England P, Lowe E, Briscoe D, 2006. Do population size bottlenecks reduce evolutionary potential? Animal Conservation, 2:255–260. Garza JC, Williamson EG, 2001. Detection of reduction in population size using data from microsatellite loci. Molecular Ecology, 10:305–318.

706

Girod C, Vitalis R, Leblois R, Fr´eville H, 2011. Inferring population decline and expansion

707

from microsatellite data: a simulation-based evaluation of the Msvar method. Genetics,

708

188:165–179.

709

710

711

712

713

714

Gonser R, Donnelly P, Nicholson G, Di Rienzo A, 2000. Microsatellite mutations and inferences about human demography. Genetics, 154:1793–1807. Griffiths RC, Tavar´e S, 1994. Ancestral inference in population genetics. Statistical Science, 9:307–319. Guillot G, Leblois R, Coulon A, Frantz AC, 2009. Statistical methods in spatial genetics. Molecular Ecology, 18:4734–4756.

715

Gusev A, Palamara PF, Aponte G, Zhuang Z, Darvasi A, Gregersen P, Pe’er I, 2012. The

716

architecture of long-range haplotypes shared within and across populations. Molecular

717

biology and evolution, 29:473–486.

718

719

720

721

Heller R, Chikhi L, Siegismund HR, 2013. The confounding effect of population structure on Bayesian skyline plot inferences of demographic history. PloS one, 8:e62992. Hey J, 2010. Isolation with migration models for more than two populations. Mol Biol Evol, 27:905–920.

722

Hey J, Nielsen R, 2004. Multilocus methods for estimating population sizes, migration rates

723

and divergence time, with applications to the divergence of Drosophila pseudoobscura and

724

D. persimilis. Genetics, 167:747–760.

725

Hey J, Nielsen R, 2007. Integration within the Felsenstein equation for improved Markov

726

chain Monte Carlo methods in population genetics. Proc. Natl. Acad. Sci. U. S. A.,

727

104:2785–2790.

39

728

729

730

731

732

733

734

735

736

737

738

739

Keller LF, Waller DM, 2002. Inbreeding effects in wild populations. Trends Ecol. Evol., 17:230–241. Kuhner MK, 2006. LAMARC 2.0: maximum likelihood and Bayesian estimation of population parameters. Bioinformatics, 22:768–770. Lande R, 1988. Genetics and demography in biological conservation. Science, 241:1455– 1460. Lawton-Rauh A, 2008. Demographic processes shaping genetic variation. Curr Opin Plant Biol, 11:103–109. Leblois R, Estoup A, Rousset F, 2009. IBDSim: A computer program to simulate genotypic data under isolation by distance. Molecular Ecology Resources, 9:107–109. Leblois R, Estoup A, Streiff R, 2006. Habitat contraction and reduction in population size: Does isolation by distance matter? Molecular Ecology, 15:3601–3615.

740

Mailund T, Halager AE, Westergaard M, et al. (11 co-authors), 2012. A new isolation

741

with migration model along complete genomes infers very different divergence processes

742

among closely related Great Ape species. PLoS genetics, 8:e1003125.

743

744

Marjoram P, Tavar´e S, 2006. Modern computational approaches for analysing molecular genetic variation data. Nat Rev Genet, 7:759–770.

745

Meuwissen TH, Goddard ME, 2007. Multipoint identity-by-descent prediction using dense

746

markers to map quantitative trait loci and estimate effective population size. Genetics,

747

176:2551–2560.

748

749

750

751

752

753

Nielsen R, Beaumont MA, 2009. Statistical inferences in phylogeography. Mol Ecol, 18:1034–1047. Ohta T, Kimura M, 1973. A model of mutation appropriate to estimate the number of electrophoretically detectable alleles in a finite population. Genet. Res., 22:201–204. Palamara PF, Lencz T, Darvasi A, Peter I, 2012. Length distributions of identity by descent reveal fine-scale demographic history. The American Journal of Human Genetics. 40

754

Peery MZ, Kirby R, Reid BN, Stoelting R, Doucet-B¨er E, Robinson S, V`asquez-Carrillo C,

755

Pauli JN, Palsbøll PJ, 2012. Reliability of genetic bottleneck tests for detecting recent

756

population declines. Molecular Ecology, 21:3403–3418.

757

Peter B, Wegmann D, Excoffier L, 2010. Distinguishing between population bottleneck

758

and population subdivision by a Bayesian model choice procedure. Molecular ecology,

759

19:4648–4660.

760

761

762

763

Pritchard JK, Seielstad MT, Perez-Lezaun A, Feldman MW, 1999. Population growth of human Y chromosome microsatellites. Mol. Biol. Evol., 16:1791–1798. Ptak SE, Przeworski M, 2002. Evidence for population growth in humans is confounded by fine-scale population structure. Trends in Genetics, 18:559–563.

764

Reusch TBH, Wood TE, 2007. Molecular ecology of global change. Mol Ecol, 16:3973–3992.

765

Rousset F, Leblois R, 2007. Likelihood and approximate likelihood analyses of genetic

766

structure in a linear habitat: performance and robustness to model mis-specification.

767

Mol. Biol. Evol., 24:2730–2745.

768

Rousset F, Leblois R, 2012. Likelihood-based inferences under a coalescent model of iso-

769

lation by distance: two-dimensional habitats and confidence intervals. Mol. Biol. Evol.,

770

29:957–973.

771

Schneider S, Excoffier L, 1999. Estimation of past demographic parameters from the dis-

772

tribution of pairwise differences when the mutation rates very among sites: Application

773

to human mitochondrial DNA. Genetics, 152:1079–1089.

774

775

Schwartz M, Luikart G, Waples R, 2007. Genetic monitoring as a promising tool for conservation and management. TREE, 22:25–33.

776

Severini TA, 2000. Likelihood methods in statistics. Oxford Univ. Press.

777

Spencer CC, Neigel JE, Leberg PL, 2000. Experimental evaluation of the usefulness of

778

779

780

microsatellite DNA for detecting demographic bottlenecks. Mol Ecol, 9:1517–1528. Stephens M, Donnelly P, 2000. Inference in molecular population genetics (with discussion). J. R. Stat. Soc., 62:605–655. 41

781

Storz J, Beaumont M, 2002. Testing for genetic evidence of population expansion and

782

contraction: An empirical analysis of microsatellite DNA variation using a hierarchical

783

Bayesian model. Evolution, 56:154–166.

784

785

Sun J, Helgason A, Masson G, et al. (11 co-authors), 2012. A direct characterization of human mutation based on microsatellites. Nature Genetics, 44:1161–1165.

786

Theunert C, Tang K, Lachmann M, Hu S, Stoneking M, 2012. Inferring the history of

787

population size change from genome-wide SNP data. Molecular Biology and Evolution,

788

29:3653–3667.

789

Wakeley J, 1999. Nonequilibrium migration in human evolution. Genetics, 153:1863–1871.

790

Williams B, Nichols J, Conroy M, 2002. Analysis and management of animal populations:

791

792

modeling, estimation, and decision making. Academic Pr. Wright S, 1951. The genetical structure of populations. Ann. Eugenics, 15:323–354.

42

Supplementary materials for the article : Maximum likelihood inference of population size contractions from microsatellite data by Rapha¨el Leblois, Pierre Pudlo, Joseph N´eron, Fran¸cois Bertaux, Champak Reddy Beeravolu, Renaud Vitalis, Fran¸cois Rousset

1

A. Details on the likelihood computations and settings of the inference method

2

A.1. Coalescent-based IS algorithms and disequilibrium models

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

21 22

In this section, we give a more detailed overview of the method used to compute likelihoods of genetic data at a given locus and technical details regarding the Monte Carlo algorithm. The likelihood at a given point of the parameter space is estimated using Stephens and Donnelly (2000) and de Iorio and Griffiths (2004a) ’s importance sampling approach. An ancestral history, i.e. a coalescence tree with mutations, is defined as the set of all ancestral configurations H = {Hk ; k = 0, −1, ..., −m}, corresponding to all coalescent or mutation events that occurred from H0 the current sample state (i.e. the sample allelic configuration, or allelic counts) to H−m the allelic state of the most recent common ancestor (MRCA) ofPthe sample. The Markov nature of the backward coalescent process implies that p(Hk ) = {Hk−1 } p(Hk |Hk−1 )p(Hk−1 ) and expending the recursion over possible ancestral histories of a current sample leads to p(H0 ) = Ep [p(H0 |H−1 )...p(H−m+1 |p(H−m )]. However, forward transition probabilities p(Hk |Hk−1 ) can not directly be used in a backward process and backward transition probabilities p(Hk−1 |Hk ) are unknown, except in some specific simple models such as parent independent mutations (PIM) in a single stable panmictic population. Importance sampling techniques based on an approximation pˆ(Hk−1 |Hk ) of p(Hk−1 |Hk ) are thus used to derive the probability of a sample over possible histories   p(H0 |H−1 ) p(H−m+1 |H−m ) p(H0 ) = Epˆ ... . (1) pˆ(H−1 |H0 ) pˆ(H−m |H−m+1 ) The likelihood of the data is then estimated as the average value of the probability of a sample configuration H0 given an ancestral history Hi , over nH independent simulations, I

Non-standard abbreviations : IS (importance sampling), MCMC (Monte Carlo Markov Chain), ML (maximum likelihood), RMSE (root mean square error), KS (Kolmogorov-Smirnov test), LRT (likelihood ratio test), CI (confidence or credibility intervals), KAM (K-allele model), SMM (stepwise mutation model), GSM (generalized stepwise mutation model), SNP (single nucleotide polymorphism), IBD (isolation by distance), BDR (bottleneck detection rate), FBDR (false bottleneck detection rate), FEDR (false expansion detection rate)

Preprint submitted as an article to Molecular Biology and Evolution

November 29, 2013

49

P H backward in time, of possible ancestral histories: p(H0 ) ≈ n1H ni=1 p(H0 |Hi ). The distribution of the possible ancestral histories is generated by an absorbing Markov chain with transition probabilities pˆ(Hk−1 |Hk ), and the likelihood is estimated by averaging the prodp(H |Hi,1 ) p(Hi,−m+1 |Hi,−m ) ... pˆ(Hi,−m |Hi,−m+1 ) of uct of sequential importance weights corresponding the ratio pˆ(Hi,0 i,1 |Hi,0 ) forward and backward transition probabilities obtained for each history Hi . Computation of backward transition probabilities pˆ relies on Stephens and Donnelly’s π ˆ approximations of the unknown probability π that an additional gene sampled from a population is of a given allelic type conditional on a previous sample configuration (see A.4 for an example on π ˆ computations). For efficient importance sampling distributions (i.e. approximate backward transition probabilities pˆ close to the exact p), precise estimation of the likelihood can be obtain with very few histories explored. For example, under parent independent mutation model (e.g. a K allele model, KAM), the importance sampling scheme of Stephens and Donnelly (2000) for a single isolated population is optimal because the π ˆ ’s computed following Stephens p(Hk |Hk−1 ) and Donnelly (2000) are equal to the true π’s and all the ratios of pˆ(Hk−1 |Hk ) are cancelled out. In such ideal case, consideration of a single ancestral history is thus sufficient to get the exact likelihood (de Iorio and Griffiths, 2004a,b; Stephens and Donnelly, 2000). However, departure from parent independent mutation, from panmixia and from timehomogeneity decrease the efficiency of IS proposals, and precise estimation of likelihoods then implies to explore more ancestral histories. Under a time-homogeneous model of isolation by distance, Rousset and Leblois (2007, 2012) found that 30 replicates of the absorbing Markov chain, rebuilding 30 independent possible ancestral histories, is enough to get perfect LRT-Pvalue distributions. Here, for the first time, de Iorio and Griffiths (2004a)’s IS algorithm is applied to a time-inhomogeneous demographic model using π ˆ probabilities computed from a time-homogeneous demographic model as described in the main text. Our work clearly shows that IS proposal computed as described in the main text are less and less efficient for demographic scenarios with increasing disequilibrium.

50

A.2. Some efficient modifications of the original IS algorithm

23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

To speed up the computations, we can stop the simulation of the genealogies during the IS algorithm, before reaching the MRCA (see Jasra, De Iorio and Chadeau-Hyam, 2011; Rousset and Leblois, 2007). Here, the algorithm is stopped when reaching demographic equilibrium (i.e., after time T when N (t) = Nanc ) and we finalize the current IS estimates of the likelihood with the PAC-likelihood of the ancestral lineages (Li and Stephens, 2003; Cornuet and Beaumont, 2007; Rousset and Leblois, 2007, 2012). This scheme will be called PACanc. Using analytical formulas for the exact probability of the last pair of genes, computed as in Rousset (2004), also slightly decreases the computation time in the same vein as in de Iorio et al. (2005) and Rousset and Leblois (2007, 2012). The IS scheme can thus be stopped when reaching an ancestral sample of size 2 and finalized with this exact formula, hereafter called 2ID. And the combination of both PACanc and 2ID will be called PACanc2ID. A detailed comparison between the four schemes (strict IS, 2ID, PACanc and PACanc2ID) is presented is presented in Table S1 for the baseline scenarios under a SMM and under a GSM (case[0], [A] to [F] under a SMM; case[G] to [L] under a GSM with p = 0.22; and case[K] to [M] under a GSM with p = 0.74). For the GSM, data sets are simulated under a GSM with 40 allelic states but analyzed with a GSM with 50 possible

2

67 68 69 70 71 72 73

allelic states. This slight mis-specification of the mutation model has a relatively strong influence when p = 0.74: LRT-Pvalue distributions are not close tho the 1:1 regardless of the number of loci. Such strong effect is also probably due to the consideration of large θanc values. Apart from this mutation model effect, our results show that performances are similar in all cases for the different algorithms. All simulations with a GSM, i.e. for the tests of the effect of mutational processes and population structure, are analyzed using the PACanc2ID, unless otherwise specified.

3

ˆ case / L [0] IS [J] 2ID [K] PACanc [L] PACanc2ID [A] IS [M] PACanc2ID [B] IS [N] PACanc2ID [C] IS [O] 2ID [P] PACanc [Q] PACanc2ID [D] IS [R] PACanc2ID [E] IS [S] PACanc2ID [F] IS [T] PACanc2ID [G] IS [H] IS [I] IS

n` 10 10 10 10 25 25 50 50 10 10 10 10 50 50 10 10 50 50 10 25 50

rel. bias NA NA NA NA NA NA NA NA 0.26 0.24 0.21 0.23 0.17 0.19 0.016 -0.022 0.045 0.0047 NA NA NA

p RRMSE NA NA NA NA NA NA NA NA 0.91 0.88 0.86 0.85 0.47 0.48 0.14 0.18 0.081 0.070 NA NA NA KS NA NA NA NA NA NA NA NA 0.16 0.0080 0.39 0.82 0.12 0.67 0.0.094 0.31 3.8 · 10−5 0.56 NA NA NA

rel. bias 0.035 0.038 0.061 0.057 0.0066 0.012 0.015 0.050 0.033 0.0047 0.052 0.021 0.059 0.070 0.137 0.072 0.34 0.24 -0.070 -0.027 -0.084

θ RRMSE 0.56 0.054 0.55 0.544 0.31 0.31 0.23 0.23 0.51 0.51 0.52 0.51 0.25 0.26 0.52 0.56 0.44 0.40 0.64 0.49 0.32 KS 0.056 0.060 0.051 0.032 0.35 0.48 0.62 0.41 0.12 0.377 0.36 0.096 0.44 0.063 0.11 0.0691