Likelihood-based demographic inference using the co - Raphael Leblois

The coalescent allows efficient simulations of the genetic variability under various ...... Parameter inference and CIs are slightly more accurate. But comparison is ...
7MB taille 1 téléchargements 282 vues
Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

A faire RL : homog´en´eiser les notations avec les macros de FR, remplacer IS par SIS partout, insister sur le sequential = le long de la construction de l’arbre, mieux s´eparer l’obtention de la r´ecurrence, les pi exactes et les pi chapeaux, mettre en bleu les mots cl´es ajouter l terme p(Htau)=distribution stationnaire des ´etats all´eliques partout ou c’est n´ecessaire FR : 1) Faire backward dans un cours pr´ec´edent.... (et d’autres trucs sur ma pr´esentation de backward) 2) l’op´erateur diff´erentiel φj n’est pas explicit´e =¿ phrases plus compliqu´ees

Intro Likelihood & coa MCMC

IS

Sim tests

Module de Master 2 Biostatistique: mod` eles de g´ en´ etique des populations

Likelihood-based demographic inference using the coalescent Rapha¨el Leblois & Fran¸cois Rousset Centre de Biologie pour la Gestion des populations (CBGP, Montpellier) Institut des Sciences de l’Evolution, (ISEM, Montpellier)

Janvier 2017

Conclusions

Intro Likelihood & coa MCMC

IS

Introduction Likelihoods under the coalescent Felsenstein et al.’s MCMC Metropolis-Hastings MsVar example Conclusions on MCMC Griffiths et al.’s IS Griffiths et al’s recursion Old IS scheme New IS scheme A general method based on diffusion Approximations of π Simulation tests Precision Validation Robustness MCMC vs. IS Conclusions

Sim tests

Conclusions

Intro Likelihood & coa MCMC

IS

Introduction Likelihoods under the coalescent Felsenstein et al.’s MCMC Griffiths et al.’s IS Simulation tests Conclusions

Sim tests

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Typical biological question : • There are demographic evidences that

orang-utan population sizes have collapsed → but what is the major cause of the decline, when did it start and how strong is it ?

• Can population genetics help ? - Can we infer the time of the event ? - Can we infer the strength of the population size decrease ?

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Methods based on coalescence simulations (Reminder...) Genealogy of the sample

forward in time

backward in time

Genealogy of the population

Coalescent tree

6

☇ ☇

?

;; P(Tk = t) ≈

k(k − 1) −t k(k−1) 2N e 2N

P(m∣t) =

(µt)m e −µt m!

Intro Likelihood & coa MCMC

IS

Sim tests

Two different ways to use the coalescent theory • Exploratory approaches & simulation tests

- The coalescent allows efficient simulations of the genetic variability under various demo-genetic models (sample vs. population) Specify the model and parameter values Coalescent process Simulated data sets

• Inferential approach

- The coalescent allows the inference of populationnal evolutionary parameters (genetic, demographic, reproductive,...), some of those methods uses all the information contained in the genetic data (likelihood-based methods) a real data set Coalescent process

infer the model parameters

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

Two different ways to use the coalescent theory • Exploratory approaches & simulation tests

- The coalescent allows efficient simulations of the genetic variability under various demo-genetic models (sample vs. population) Specify the model and parameter values Coalescent process Simulated data sets

• Inferential approach

- The coalescent allows the inference of populationnal evolutionary parameters (genetic, demographic, reproductive,...), some of those methods uses all the information contained in the genetic data (likelihood-based methods) a real data set Coalescent process

infer the model parameters

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Likelihood-based inference under the coalescent • Inferential approaches are based on the modeling of

population genetic processes. Each population genetic model is characterized by a set of demographic and genetic parameters P • The aim is to infer those parameters from a polymorphism

data set (i.e. a genetic sample) • The genetic sample is then considered as the realization

(”output”) of a stochastic process defined by the demo-genetic model

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Likelihood-based inference under the coalescent

• First, compute or estimate the likelihood L(P ∗ ; D), i.e. the

probability P(D; P ∗ ) of observing the data D for some parameter values P ∗

• Second, infer the likelihood surface over all parameter values,

find the set of parameter values that maximize it, and compute CI (maximum likelihood method), or Compute posterior distributions and compare with priors (Bayesian approach).

Intro Likelihood & coa MCMC

IS

Introduction Likelihoods under the coalescent Felsenstein et al.’s MCMC Griffiths et al.’s IS Simulation tests Conclusions

Sim tests

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Likelihood computations under the coalescent • Problem : Most of the time, the likelihood P(D; P ∗ ) of a

genetic sample cannot be computed because there is no explicit mathematical expression • However, the probability P(D; P ∗ ∣Gk ) of observing the data D

given a specific genealogy Gk can be computed for some parameter values P ∗ . • Then we take the sum of all genealogy-specific likelihoods on

the whole genealogical space, weighted by the probability of the genealogy given the parameters : L(P ∗ ; D) = ∫ P(D; P ∗ ∣G )P(G ; P ∗ ) dG G

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Likelihood computations under the coalescent • The likelihood can be written as the sum of P(D; P ∗ ∣Gk ) over

the genealogical space (all possible genealogies) : L(P ∗Mutation ; D) = ∫ P(D; P ∗ ∣G )P(G ; P ∗ ) dG G

Demography (Coalescent)

• Genealogies are missing data, they are important for the

computation of the likelihood but there is no interest in estimating them. → very different from the phylogenetic approaches

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Likelihood computations under the coalescent • The likelihood can be written as the sum of P(D; P ∗ ∣Gk ) over

the genealogical space (all possible genealogies) : L(P ∗ ; D) = ∫ P(D; P ∗ ∣G )P(G ; P ∗ ) dG G

...Usually impossible to sum over all possible genealogies...

→ Monte Carlo simulations are used : a large number K of genealogies are simulated according to P(G ; P ∗ ) and the mean over those simulations is taken as the expectation of P(D; P ∗ ∣G ) : L(P ∗ ; D) = EP(G ;P ∗ ) (P(D; P ∗ ∣G )) ≈

1 K ∗ ∑ P(D; P ∣Gk ) K k=1

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Likelihood computations under the coalescent • The likelihood can be written as the sum of P(D; P ∗ ∣Gk ) over

the genealogical space (all possible genealogies) : L(P ∗ ; D) = ∫ P(D; P ∗ ∣G )P(G ; P ∗ ) dG G

...Usually impossible to sum over all possible genealogies...

→ Monte Carlo simulations are used : L(P ∗ ; D) = EP(G ;P ∗ ) (P(D; P ∗ ∣G )) ≈

1 K ∗ ∑ P(D; P ∣Gk ) K k=1

many many genealogies necessary for a good estimation of the likelihood...

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Likelihood computations under the coalescent

• Monte Carlo simulations are used :

L(P ∗ ; D) = EP(G ;P ∗ ) (P(D; P ∗ ∣G )) ≈

1 K ∗ ∑ P(D; P ∣Gk ) K k=1

Monte Carlo simulations are often not very efficient because there are too many genealogies giving extremely low probabilities of observing the data, more efficient algorithms are used to explore the genealogical space and focus on genealogies well supported by the data.

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Likelihood computations under the coalescent • Two main approaches developed using more efficient

algorithms that allows better exploration of the genealogies proportionally to their probability of explaining the data P(D; P∣G ). MCMC Monte Carlo Markov chains on the genealogical and the parameter space, based on Felsenstein’s pruning algorithm (1973,1981) Felsenstein, J. (1981). ”Evolutionary trees from DNA sequences : A maximum likelihood approach”. J. of Mol. Evol. 17 (6) : 368-376.

IS Importance Sampling on genealogies, based on the work of Griffiths & Tavar´e 1994. Griffiths, R.C. and S. Tavar´ e (1994). Simulating probability distributions in the coalescent. Theor. Pop. Biol., 46 :131-159.

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Likelihood computations under the coalescent • More efficient algorithms that allows better exploration of the

genealogies proportionally to their probability of explaining the data P(D; P∣G ) MCMC Felsenstein’s pruning algorithm. - Easier to implement, can easily consider various models - Implemented in many softwares (LAMARC, Batwing, MsVar, MIGRATE, IM) IS Griffiths &Tavar´e’s coalescent recursion - Extension to different models may be difficult - Implemented in fewer softwares (Genetree, Migraine)

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Likelihood computations under the coalescent • More efficient algorithms that allows better exploration of the

genealogies proportionally to their probability of explaining the data P(D; P∣G ) MCMC Felsenstein’s pruning algorithm (quick overview) - Easier to implement, can consider various models - Implemented in many softwares (LAMARC, Batwing, MsVar, MIGRATE, IM) IS Griffiths &Tavar´e’s coalescent recursion - Extension to different models may be difficult - Implemented in fewer softwares (Genetree, Migraine)

Intro Likelihood & coa MCMC

IS

Introduction Likelihoods under the coalescent Felsenstein et al.’s MCMC Griffiths et al.’s IS Simulation tests Conclusions

Sim tests

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

The approach of Felsenstein et al.

• Based on (1) on the availability of approximate exponential

distributions for time intervals between events (coalescence and migration and recombinaison) and (2) on the separation of demographic and mutational processes : 1 The probability of a genealogy given the parameters of the ∗ demographic model P(Gk ; Pdemo ) can be computed from the distributions of time between events. 2 The probability of the data given a genealogy and mutational ∗ parameters P(D; Pmut ∣Gk ) can be computed from the mutation model parameters, the mutation rate, tree topology and branch lengths.

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

The approach of Felsenstein et al. • Based on (1) on the availability of approximate exponential

distributions for time intervals between events (coalescence and migration and recombinaison) and (2) on the separation of demographic and mutational processes : ∗ 1 P(Gk ; Pdemo ) computed from the distributions of time between events. ∗ 2 P(D; Pmut ∣Gk ) computed from the mutation parameters, tree topology and branch lengths.

• From this, an efficient algorithm to explore the genealogical

and the parameter spaces should allow the inference of the likelihood over the two spaces. → MCMC

Intro Likelihood & coa MCMC

Felsenstein et al.’s MCMC Metropolis-Hastings MsVar example Conclusions on MCMC

IS

Sim tests

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

MCMC with Metropolis-Hastings sampler • Full conditional distributions can not be computed, MCMC

classical sampler can not thus be used (e.g. Gibbs) → Monte Carlo Markov Chains (MCMC) simulations using the Metropolis-Hastings (MH) algorithm - To explore the genealogy space (G ) - and the parameter space (P = Pdemo + Pmut )

all algorithms based on the ’Felsenstein et al.’ approach uses similar MH/MCMC algorithms with slight differences in the MCMC update steps.

Intro Likelihood & coa MCMC

IS

Sim tests

Metropolis-Hastings sampling for the coalescent For the Metropolis-Hastings algorithm, we need to compute the ratio of the probability of the proposed update over the current state : 1. Computation of P(Gk ; Pdemo ) : P(Gk ; Pdemo ) =

MRCA−1



γ(ti+1 )e

t

− ∫t i+1 γ(t)dt i

i=0

- Example for a stable WF population (coalescence only, time homogeneous) P(Gk ; Pdemo ) =

MRCA−1

∏ i=0

ki+1 (ki+1 − 1) −(ti+1 −ti ) ki+1 (k2i+1 −1) e 2

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Metropolis-Hastings sampling for the coalescent 1. Computation of P(Gk ; Pdemo ) : P(Gk ; Pdemo ) =

MRCA−1



γ(ti+1 )e

t

− ∫t i+1 γ(t)dt i

i=0

2 Then compute the probability P(D; Pmut ∣Gk ) of the data D given the genealogy Gk , by going from the MRCA to the leaves and considering the probability of occurrence of all mutations on each branch of length tb and their effects (i.e.transition among genetic states x → y ) :

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Metropolis-Hastings sampling for the coalescent 2 Then compute the probability P(D; Pmut ∣Gk ) : Mutation matrix : transition probability between genetic states (x, y )

effect of mutations

P(D; Pmut ∣Gk ) =

nb branch



Poisson probability for the mb mutations

³¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ·¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ µ P(y ∣x, mb ) ⋅

number of mutations

³¹¹ ¹ ¹ ¹ ¹ ¹ ¹ · ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ µ P(mb ∣tb )

b=1 2(n−1)

= ∏ ((Matmut )mb )x,y b=1

(µtb )mb e −µtb mb !

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Metropolis-Hastings sampling for the coalescent 1. Computation of P(Gk ; Pdemo ) : P(Gk ; Pdemo ) =

MRCA−1



γ(ti+1 )e

t

− ∫t i+1 γ(t)dt i

i=0

2 Then compute P(D; Pmut ∣Gk ) : 2(n−1)

P(D; Pmut ∣Gk ) = ∏ ((Matmut )mb )x,y b=1

(µtb )mb e −µtb mb !

3 These probabilities are plugged into the MH formula for acceptance probabilities of candidate changes for the next state of the Markov chain. Reminder : P(D; P∣Gk ) = P(D; Pmut ∣Gk )P(Gk ; Pdemo )

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Metropolis-Hastings sampling for the coalescent

• for each update, the new state (P ′ or G ′ ) is accepted or

rejected according to the Metropolis-Hastings ratio, • the MH ratio is chosen so that the chain converge towards the

good stationary distribution P(D; P), e.g. rMH =

P(D; P ′ ∣G )Prior(P ′ ) P(P ′ → P) P(D; P∣G )Prior(P) P(P → P ′ )

Intro Likelihood & coa MCMC

Felsenstein et al.’s MCMC Metropolis-Hastings MsVar example Conclusions on MCMC

IS

Sim tests

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Coalescent-based MCMC example : MsVar

• One example of a coalescent-based MCMC algorithm : MsVar Beaumont, M. 1999. Detecting Population Expansion and Decline Using Microsatellites. Genetics.

• Biological contexte :

Past changes in population sizes (cf. Orang-Utans) - Details of the demographic and mutation models - few results on the Orang-Utan data set

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Coalescent-based MCMC example : MsVar • Demographic model : a single isolated panmictic (WF)

population with a exponential past change in population size.

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Coalescent-based MCMC example : MsVar • Demographic model : a single isolated panmictic (WF)

population with a exponential past change in population size.

Population contraction or expansion

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Coalescent-based MCMC example : MsVar • Demographic model : a single isolated panmictic (WF)

population with a exponential past change in population size.

3 demographic parameters : N, T , Nanc + 1 mutation parameter µ 3 scaled parameters (diffusion approx.) : θ, D, θanc

Intro Likelihood & coa MCMC

IS

Sim tests

Coalescent-based MCMC example : MsVar • Mutation model : Stepwise Mutation Model (SMM)

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Coalescent-based MCMC example : MsVar

P = N, T , Nanc , µ Pscaled = θ, D, θanc

• Aim : infer those parameters (P or Pscaled ) from a unique actual

genetic sample using coalescent-based MCMC algorithms

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

MH/MCMC of MsVar 1. Initialization step : Build a genealogy that is compatible with the data → Starting with the sample, choose a set of events depending on starting values of the parameters ; the events are also chosen to be compatible with the data

2. MCMC steps : Explore the parameter and the genealogical space → Update the parameters for population sizes (θact , D, θanc ). or Update the genealogy (sequence and times of coalescence and mutation events (Ti ))

both updates made using the Metropolis-Hastings algorithm

Intro Likelihood & coa MCMC

IS

Sim tests

MCMC updates in MsVar Ti = times of coa & mut, r =

θact θanc

pop size ratio, tf = D time of pop size change

M. Beaumont : “This scheme was devised by trial and error to obtain good rates of convergence.”

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Analyses of MsVar results • First check that the chains mixed and converged properly

→ Visual check (very useful) • Traces of likelihood / parameters • Autocorrelation

→ Compute convergence criteria among chains (GR, ...) not always useful... → Run different chains and check concordance between results Problem : Convergence is often pretty bad with such coalescent-based MCMC algorithms ... but simulation tests show that posterior distributions are generally correct (at least the mode as point estimate) despite no clear convergence indices...

Intro Likelihood & coa MCMC

IS

Sim tests

Analyses of MsVar results

• Bayesian method → compare posteriors (plain) and priors

(dashed)

... and test different priors

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

Analyses of MsVar results • Bayesian method → compute Bayes factor to check for

contraction or expansion signal BF =

(Posterior prob. model 1) (Prior prob. model 2) (Posterior prob. model 2) (Prior prob. model 1)

• Equal priors for models 1 and 2, the Bayes factor for a

contraction is thus BF =

Posterior P(Nanc /Nact > 1) Posterior P(Nanc /Nact < 1)

BF =

# MCMC steps where (Nanc /Nact > 1) # MCMC steps where (Nanc /Nact < 1)

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

An application of MsVar : Orang-Utans and the deforestation of Borneo Does the genome of Orang-utans carry the signature of population bottlenecks ? (Goossens et al. 2006 PLoS Biology)

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

An application of MsVar : Orang-Utans and the deforestation of Borneo

Population sizes have collapsed : what is the cause ? Can population genetics help ?

(Delgado & Van Schaik, 2001 Evol. Anthropology)

Intro Likelihood & coa MCMC

IS

Sim tests

An application of MsVar : Orang-Utans and the deforestation of Borneo • The data

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

An application of MsVar : Orang-Utans and the deforestation of Borneo • MsVar results

→ MsVar efficiently detects a past decrease in population size

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

An application of MsVar : Orang-Utans and the deforestation of Borneo • MsVar results FE : beginning of massive forest exploitation F : first farmers HG : first hunter-gatherers

→ MsVar efficiently detects a past decrease in population size... ... and allows for the dating of the beginning of the decrease : massive forest exploitation seems to be the most likely cause

Intro Likelihood & coa MCMC

Felsenstein et al.’s MCMC Metropolis-Hastings MsVar example Conclusions on MCMC

IS

Sim tests

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions about MsVar/ MCMC approaches • Coalescent theory provides a powerful framework for statistical inference → Allows to infer past history from a unique actual sample ! (it was impossible with moment based methods) • Gene genealogies are missing data (but important...) → MCMCs with coalescent simulations are “difficult” (to run) • But what is the robustness to model assumptions : • Mutational processes (e.g. large mutation steps → long branches) • Population structure (e.g. immigrants → long branches)

Conclusions

Intro Likelihood & coa MCMC

IS

Introduction Likelihoods under the coalescent Felsenstein et al.’s MCMC Griffiths et al.’s IS Simulation tests Conclusions

Sim tests

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Likelihood computations under the coalescent • More efficient algorithms that allows better exploration of the

genealogies (i.e. proportionally to P(D; P∣G )).

MCMC Felsenstein’s pruning algorithm. - Easier to implement, can consider various models - Implemented in many softwares (LAMARC, Batwing, MsVar, MIGRATE, IM) IS Griffiths &Tavar´e’s coalescent recursion (cf. Ewens’ recursion) - Extension to different models may be difficult - Implemented in fewer softwares (Genetree, Migraine)

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

The approach of Griffiths et al.

• Coalescent-based likelihood at a given point of the parameter

space is an integral aver all possible histories (genealogies with mutations) leading to the present genetic sample • Monte Carlo scheme used to compute this integral • Histories are build backward in time, event by event, starting

from the present sample • But computation of exact backward transition probabilities is

often too difficult → an IS scheme is used to compute the likelihoods by simulation

Intro Likelihood & coa MCMC

IS

Griffiths et al.’s IS Griffiths et al’s recursion Old IS scheme New IS scheme A general method based on diffusion Approximations of π

Sim tests

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Recursions for sampling distributions Ewens 1972 : a Wright-Fisher, infinite-allele model General recursion at stationarity [with ej = (0, ⋯, 0, 1jth , 0, ⋯, 0)] : Pn (a) =

j(aj + 1) θ n−1 Pn−1 (a−e1 )+ Pn−1 (a+ej −ej+1 ) ∑ n−1+θ n − 1 + θ aj+1 >0 n − 1

where given that a coalescence occurs and that the descendant sample has (⋯, aj , aj+1 , ⋯), the ancestral one has (⋯, aj + 1, aj+1 − 1, ⋯) and the probability that one of the aj + 1 alleles with j gene copies is chosen to duplicate is j(aj + 1)/(n − 1). Rappel : les effectifs de l’´echantilons (vecteur a est ordonn´e selon les effectifs des alleles = allele frequency spectrum = aj est le nombre de l’alleles ayant j copies dans l’´echantilon) Griffiths and Tavar´e : recursion for mutation models defined by a matrix (pij ) of mutation rates from i to j

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

The recursion of Griffiths et al.

• Coalescent-based likelihood at a given point of the parameter

space is an integral over all possible histories (genealogies with mutations) H = {Hk ; k = 0, ..., τ } corresponding to all coalescent or mutation events that occurred from H0 the current sample state to Hτ the allelic state of the most recent common ancestor (MRCA) of the sample.

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

The recursion of Griffiths et al.

• Then for any given state Hk of the history (cf. Ewens) :

p(Hk ) = ∑ p(Hk ∣Hk ′ )p(Hk ′ ) {Hk ′ }

where Hk ′ is the ancestral sample state (i.e. the state before the last event) and p(Hk ∣Hk ′ ) are the forward transition probabilities (i.e. from the ancestral to the current state)

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

The recursion of Griffiths et al. • Griffiths & Tavar´ e 1994 : example for a single population p(Hk = η) =

⎡ ⎢ ⎢(nµ ∑ ∑ ni + 1 pij p(Hk ′ = η − ej + ei )) ⎢ n(n−1) n ( 2N + nµ) ⎢ i j∶nj >0,j≠i ⎣ ⎤ ⎥ n(n − 1) nj − 1 +( p(Hk ′ = η − ej ))⎥ ∑ ⎥. 2N j∶nj >1 n − 1 ⎥ ⎦ 1

- Setting θ = 4Nµ and β = n(n − 1 + θ), we have ⎡ 1⎢ θ ∑ ∑ (ni + 1)pij p(Hk ′ = η − ej + ei ) p(Hk = η) = ⎢ β⎢ ⎢ i j∶nj >0,j≠i ⎣ ⎤ ⎥ + n ∑ (nj − 1)p(Hk ′ = η − ej )⎥ ⎥, ⎥ j∶nj >1 ⎦

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

The recursion of Griffiths et al. • Griffiths & Tavar´ e 1994 : example for a single population ⎡ 1⎢ p(Hk = η) = ⎢ θ ∑ ∑ (ni + 1)pij p(Hk ′ = η − ej + ei ) β⎢ ⎢ i j∶nj >0,j≠i ⎣ ⎤ ⎥ + n ∑ (nj − 1)p(Hk ′ = η − ej )⎥ ⎥, ⎥ j∶nj >1 ⎦

• Such recursions are too difficult to solve except for very simple

models (WF + IAM, cf Ewens) → Griffiths & Tavar´e (1994) proposed to use a Monte Carlo approach using sequential importance sampling on past histories to solve the recursion.

Intro Likelihood & coa MCMC

IS

Griffiths et al.’s IS Griffiths et al’s recursion Old IS scheme New IS scheme A general method based on diffusion Approximations of π

Sim tests

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

Inference of the likelihood by simulation • Griffiths & Tavar´ e 1994 : ⎡ 1⎢ p(Hk = η) = ⎢ θ ∑ ∑ (ni + 1)pij p(Hk ′ = η − ej + ei ) β⎢ ⎢ i j∶nj >0,j≠i ⎣ ⎤ ⎥ + n ∑ (nj − 1)p(Hk ′ = η − ej )⎥ ⎥, ⎥ j∶nj >1 ⎦

or equivalently p(Hk ) = wGT (Hk )(



i,j∶nj >0,j≠i

Mij (Hk )p(Hk − ej + eai )

+ ∑ Cj (Hk )p(Hk − ej )) j∶nj >1

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Inference of the likelihood by simulation • Griffiths & Tavar´ e 1994 :

Backward absorbing Markov chain based on forward transition probabilities p(Hk ) = wGT (Hk )(



i,j∶nj >0,j≠i

Mij (Hk )p(Hk − ej + eai )

+ ∑ Cj (Hk )p(Hk − ej )) j∶nj >1

→ Histories are build backward event by event using absorbing Markov chain (abs. state = MRCA) based on forward transitions probabilities (“uniform sampling” based on Mij (Hk ) and Cj (Hk )) among all possible events. wGT (Hk ) is the weight associated with the IS proposal.

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Inference of the likelihood by simulation • Expending the recursion p(Hk ) = ∑{H ′ } p(Hk ∣Hk ′ )p(Hk ′ ) k

over all possible ancestral histories of a current sample leads to p(H0 ) = E [p(H0 ∣H1 )...p(Hτ −1 ∣Hτ )p(Hτ )] Then L(P; D) = p(H0 ) = ∫ WGT (H)fGT (H) ≈ H



L

1 L ∑ WGT (Hh ) L h=1

τ

1 ∑ ∏ wGT ((Hh )k ). L h=1 k=0

This IS scheme fGT (H) is not very efficient because it does not appropriately consider that some backward transitions are more likely than others given the current state (example : SMM mutation).

Intro Likelihood & coa MCMC

IS

Griffiths et al.’s IS Griffiths et al’s recursion Old IS scheme New IS scheme A general method based on diffusion Approximations of π

Sim tests

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Towards a better IS scheme (Stephens & Donnelly 2000, de Iorio & Griffiths 2004)

→ A better Importance Sampling (IS) scheme should be used : Let Q(Hk ′ ) be a given distribution, then p(Hk ∣Hk ′ ) Q(Hk ′ )p(Hk ′ ) Q(Hk ′ ) ′}

p(Hk ) = ∑ {Hk

= EQ [

p(H0 ∣H1 ) p(Hτ −1 ∣Hτ ) ... ] Q(H1 ) Q(Hτ )

where EQ is expectation over the distribution of full histories induced by Q. This means that Q may be viewed as a proposal distribution in a sequential IS algorithm with matching weights p(Hk ∣Hk ′ /Q(Hk ′ )).

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Towards a better IS scheme (Stephens & Donnelly 2000, de Iorio & Griffiths 2004)

→ A better Importance Sampling (IS) scheme should be used : Let Q(Hk ′ ) be a given distribution, then p(Hk ) = EQ [

p(H0 ∣H1 ) p(Hτ −1 ∣Hτ ) ... ] Q(H1 ) Q(Hτ )

where EQ is expectation over the distribution of full histories induced by Q. This means that Q may be viewed as a proposal distribution in a sequential IS algorithm with matching weights p(Hk ∣Hk ′ /Q(Hk ′ )). • The problem is then to find the proposal distribution that minimizes the variance of likelihood estimates 1 L τ ∑ ∏ wGT ((Hh )k ). L h=1 k=0

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Towards a better IS scheme (Stephens & Donnelly 2000, de Iorio & Griffiths 2004)

• The ideal proposal is the backward transition probability

p(Hk ′ ∣Hk ), because the IS weights are then p(Hk ′ ) p(Hk ∣Hk ′ ) p(Hk ) = = Q(Hk ′ ) p(Hk ′ ∣Hk ) p(Hk ′ ) and thus their product is always the sample likelihood, p(H0 ). expliciter → a single tree reconstruction allows exact likelihood computations (null variance). • However, backward transition probabilities p(Hk ′ ∣Hk ) are

generally unknown Aim : find good approximations pˆ(Hk ′ ∣Hk ) of p(Hk ′ ∣Hk )

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Towards a better IS scheme (Stephens & Donnelly 2000, de Iorio & Griffiths 2004)

• The likelihood at a given point is an integral over all possible

histories H = {Hk ; k = 0, ..., τ }. • Markov coalescent process → p(Hk ) = ∑ p(Hk ∣Hk ′ )p(Hk ′ )

and p(H0 ) = E [p(H0 ∣H1 )...p(Hτ −1 ∣Hτ )p(Hτ )].

• However, forward transition probabilities p(Hk ∣Hk ′ ) are not

efficient in a backward process • Importance sampling techniques based on an approximation

pˆ(Hk ′ ∣Hk ) of p(Hk ′ ∣Hk ) are used to build more likely histories p(H0 ) = Epˆ [

p(H0 ∣H1 ) p(Hτ −1 ∣Hτ ) ... ]. pˆ(H1 ∣H0 ) pˆ(Hτ ∣Hτ −1 )

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Linking optimal weights to addition of a gene to a sample Represent sample probability p(n) as integral over the joint distribution f (x) of allele frequencies in the population n p(n) = ∫ p(n∣x)f (x) dx = Ef [( ) ∏ Xini ] n i x where n! n ( )= n ∏i ni ! is the binomial coefficient.

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Linking optimal weights to addition of a gene to a sample Represent sample probability p(n) as integral over the joint distribution f (x) of allele frequencies in the population n p(n) = ∫ p(n∣x)f (x) dx = Ef [( ) ∏ Xini ] n i x Then the joint probability that we have a sample n and that an additional gene copy is of type j is nj + 1 n p(n + ej ). Ef [Xj ( ) ∏ Xini ] = n i n+1

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Linking optimal weights to addition of a gene to a sample Then the joint probability that we have a sample n and that an additional gene copy is of type j is nj + 1 n Ef [Xj ( ) ∏ Xini ] = p(n + ej ). n i n+1 We write this joint probability as p(n) times π(j∣n), where π is thus the probability that an additional gene is of type j, given we have already drawn the sample n from the population. Thus if Hk and Hk ′ differ by the addition of one gene of type j, we can write the optimal IS weight as nj + 1 1 p(Hk ′ = n) = p(Hk = n + ej ) n + 1 π(j∣n)

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Towards a better IS scheme : the π’s

• Let π(⋅∣Hk ) be the conditional distribution of the allelic type

of a n + 1 gene, given Hk the configuration (i.e. allelic types) of the first n genes of the sample.

• Then the optimal IS distribution (exact backward transition

probabilities) is, for a single population : π(i∣Hk − ej ) 1 θnj Pij β π(j∣Hk − ej ) 1 nj (nj − 1) p(Hk ′ ∣Hk ) = β π(j∣Hk − ej )

p(Hk ′ ∣Hk ) =

for Hk ′ = Hk − ej + ei for Hk ′ = Hk − ej

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Towards a better IS scheme : the π ˆ ’s • Unfortunately, π’s are generally unknown → Stephens & Donnelly (2000) proposed a good approximation π ˆ for the πs for a single WF population. → de Iorio & Griffiths (2004) proposed a general method for appoximating the πs under different mutational and demographic models • Then approximate backward transition probabilities using the

π ˆ s are used : π ˆ (i∣Hk − ej ) 1 θnj Pij β π ˆ (j∣Hk − ej ) 1 nj (nj − 1) pˆ(Hk ′ ∣Hk ) = βπ ˆ (j∣Hk − ej )

pˆ(Hk ′ ∣Hk ) =

for Hk ′ = Hk − ej + ei for Hk ′ = Hk − ej

Intro Likelihood & coa MCMC

IS

Griffiths et al.’s IS Griffiths et al’s recursion Old IS scheme New IS scheme A general method based on diffusion Approximations of π

Sim tests

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

The backward equation for f (Xt ∣X0 = x) Pour un processus de diffusion, la densit´e de probabilit´e f des fr´equences all´eliques satisfait l’´equation arri`ere de Kolmogorov, qui d´ecrit les changements de f au cours du temps sous la forme df (Xt ∣X0 = x) = Φ(f (x)), dt o` u Φ est un op´erateur diff´erentiel qui prend ici la forme ∂ 1 ∂2 Φ = ∑ ∑ xi (δij − xj ) + ∑ ( ∑ xi rij ) 2 i∈E j∈E ∂xi ∂xj j∈E i∈E ∂xj = ∑ Φj j∈E

avec

∂ ∂xj

θ R = {rij } ≡ (P − I ) 2 o` u P = {pij } est la matrice de mutation, et I la matrice identit´e.

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

The backward equation for E[g (Xt )∣X0 = x] In the same way as df (Xt ∣X0 = x) = Φ(f (x)), dt the following “generator equation” (Karlin and Taylor, 1981, p.215) holdsfor any function g (x) with bounded second derivatives E[g (Xt )∣X0 = x] − g (x) = Φ(g (x)). t→0 t lim

We will apply this result with g the sample probability given population allele frequencies x.

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

π ˆ ’s computation Pour obtenir une r´ecurrence sur les probabilit´es p(n) avec n = H0 de l’´echantillon, on ´ecrit p(n) sous la forme E [g (x)] n p(n) = E[( ) ∏ Xini ] n i o` u n! n ( )= . n ∏i ni ! On a donc

d(p(n)) = Φ [p(n)] . dt A l’´equilibre stationnaire, d(p(n))/dt est nulle. En d´eveloppant l’expression pour Φ [p(n)], on retrouve alors la r´ecurrence entre les p(n).

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Explicit recursions in terms of π Applying the previous arguments then leads to a relation between probabilities of samples that differ by one coalescence or mutation event : N (n (

nj − 1 n−1 + µ)) p(n) = N ∑ n p(n − ej ) N N j + Nµ ∑ ∑ Pij (ni + 1 − δij )p(n − ej + ei ). j

i

Expressing all p(.) in terms of p(n − ej )s for distinct js : N ∑( j

nj − 1 n−1 + µ)π(j∣n − ej )np(n − ej ) = N ∑ n p(n − ej ) N N d,j + Nµ ∑ ∑ Pij nπ(i∣n − ej )p(n − ej ) j

i

...huge system of linear equations, not easier to solve in this form.

Intro Likelihood & coa MCMC

IS

Griffiths et al.’s IS Griffiths et al’s recursion Old IS scheme New IS scheme A general method based on diffusion Approximations of π

Sim tests

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

π ˆ ’s computation On note que Φ [p(n)] peut s’´ecrire sous la forme ∑ Φj j∈E

∂ [p(n)] , ∂xj

La technique d’approximation d´evelopp´ee par de Iorio & Griffiths est d’approximer les π (d´eriv´es pr´ec´edemment de p(n), solution de Φ [p(n)] = 0) par des π ˆ d´eriv´es des solutions de E[Φj

∂p(n) ∂ n ]= E[Φj ( ) ∏ x ni ]= 0, pour chaque j ∈ E . ∂xj ∂xj n i i

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

π ˆ ’s computation La technique d’approximation d´evelopp´ee par de Iorio & Griffiths est d’approximer les π (d´eriv´es pr´ec´edemment de p(n), solution de Φ [p(n)] = 0) par des π ˆ d´eriv´es des solutions de E[Φj

∂ n ( ) ∏ x ni ]= 0, pour chaque j ∈ E . ∂xj n i i

ce qui donne, pour une population panmictique, pour chaque j ∈ E nj (n − 1 + θ)ˆ p (n) = n(nj − 1)ˆ p (n − ej ) + ∑ θPij (ni + 1 − δij )ˆ p (n − ej + ei ) i∈E

Intro Likelihood & coa MCMC

IS

Sim tests

π ˆ ’s computation

Rappel : π(j∣n) peut ˆetre exprim´e en fonction de p(n) et p(n + ej ) : nj + 1 p(n + ej ). n+1 Si l’on consid`ere que cette relation est aussi valable pour les π ˆ et pˆ, ce qui ne sera g´en´eralement pas le cas, on a π(j∣n)p(n) =

π ˆ (j∣n)ˆ p (n) =

nj + 1 pˆ(n + ej ) n+1

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

π ˆ ’s computation Approximer les p(n), solutions de Φ [p(n)] = 0, par les pˆ(n) solutions de E[Φj

∂ n ( ) ∏ x ni ]= 0, pour chaque j ∈ E . ∂xj n i i

ce qui donne, pour une population panmictique, pour chaque j ∈ E nj (n − 1 + θ)ˆ p (n) = n(nj − 1)ˆ p (n − ej ) + ∑ θPij (ni + 1 − δij )ˆ p (n − ej + ei ) i∈E n +1

j et en utilisant π ˆ (j∣n)ˆ p (n) = n+1 pˆ(n + ej ) et rempla¸cant n par n + ej , on obtient donc pour chaque j ∈ E :

(n − 1 + θ)ˆ π (j∣n) = nj + ∑ θPij π ˆ (i∣n) i∈E

C’est le systeme lin´eaire permettant le calcul des π ˆ (j∣n) pour un mod`ele de Wright-Fisher.

Intro Likelihood & coa MCMC

IS

Sim tests

New IS scheme with the π ˆ ’s

faire deux dipaos de bilan du nouveau schema d’IS bilan en reprennant des bouts des diapos 63, 64, 65, 66, 70 et 71

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

A much better IS scheme based on the π ˆ ’s • Drastic gain in efficiently with this new IS scheme (old IS : millions of trees) → extract backward transition probabilities for a WF model with parent independent mutation (i.e. KAM) → only 30 histories necessary for a good estimation of the likelihood for more complex models (structured populations & KAM)

• but efficiency slightly decrease with non parent-independent

mutations models, e.g. stepwise mutation model (200 histories for structured populations & SMMM)

• and still limited efficiency for time inhomogeneous

demographic models, e.g. one population with past size change (cf. Orang-Utan example) → up to 20,000 histories necessary for strong disequilibrium scenarios (e.g. quick change in population size)

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Implementations of IS : Genetree and Migraine

• Genetree (Bahlo & Griffiths 2000, old IS algorithm) - 2 to 4 populations with migration (ISM) • Migraine (Rousset & Leblois 2007-2014, new IS algorithms) - One single stable population (KAM, SMM, GSM, ISM) - One pop. with past size variation (KAM, SMM, GSM, ISM) - 2 populations with migration (KAM, SMM, ISM) - Isolation By Distance in 1D and 2D (KAM)

Intro Likelihood & coa MCMC

IS

Sim tests

Implementation of IS in Migraine

1. C++ core IS computations • Stratified random sampling of parameter points • Estimation of the likelihood at each point using IS

2. R code for “post-treatment” • Likelihood surface interpolation by Kriging • Inference of MLEs and CIs • Plots of 1D and 2D likelihood profiles

Conclusions

Intro Likelihood & coa MCMC

IS

Introduction Likelihoods under the coalescent Felsenstein et al.’s MCMC Griffiths et al.’s IS Simulation tests Conclusions

Sim tests

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

Simulation tests Can we trust the demographic / historical inferences made with those methods ?

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

Simulation tests Can we trust the demographic / historical inferences made with those methods ? Aim Assess validity and robustness of the method : • Bias, RMSE, coverage properties of confidence intervals • robustness to realistic but “uninteresting” mis-specifications

→ to this aim, we tested by simulation : - The performances of Migraine to infer dispersal under IBD - The performances of MsVar and Migraine to detect and measure past pop size changes

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

Simulation tests Can we trust the demographic / historical inferences made with those methods ? Aim Assess validity and robustness of the method : • Bias, RMSE, coverage properties of confidence intervals • robustness to realistic but “uninteresting” mis-specifications

→ to this aim, we tested by simulation : - The performances of Migraine to infer dispersal under IBD - The performances of MsVar and Migraine to detect and measure past pop size changes

few interesting results...

Conclusions

Intro Likelihood & coa MCMC

Simulation tests Precision Validation Robustness MCMC vs. IS

IS

Sim tests

Conclusions

Intro Likelihood & coa MCMC

IS

Simulation tests

Sim tests

(MsVar Girod et al. 2011)

strong correlations between some pairs of ”natural” parameters but this is expected given the coalescent theory ...

Conclusions

Intro Likelihood & coa MCMC

IS

Simulation tests

Sim tests

(MsVar Girod et al. 2011)

There is no information in the genetic data to infer µ, N and T separately because coalescent histories (H, genealogies with mutations) generated with the usual diffusion/coalescent approximations (large N, small µ) only depends on the scaled parameters Nµ and T /N

constant Nµ product → same unscaled history and same polymorphism

Two indistinguishable situations under the coalescent approximations !

Conclusions

Intro Likelihood & coa MCMC

IS

Simulation tests

Sim tests

Conclusions

(MsVar Girod et al. 2011)

Much better results by rescaling parameters as in the coalescent approximations

Intro Likelihood & coa MCMC

IS

Sim tests

Simulation tests 20

_

_

2N µ D 2N ancµ

rel. bias & rel. RMSE

_

_

_ 5 _ 2

_

_

1

_

_ _

_

_

BDR: 0.76 FEDR: 0

0.98 0

1 0

0.025 0.0625 0.125

_

_

_ __

_

_

_

__

1 0

1 0

1 0

1 0

0.98 0

0.79 0

0.5 0.005

0.25

0.5

1.25

2.5

3.5

5

7.5

_

Good reliability of the estimates for population declines, provided they are neither too recent, nor too weak...

_

__

_

0.2 -0.2

(Migraine)

_

_

10

0.5

Conclusions

D

Why does the method’s performance strongly depend upon the time of the event, and its intensity ?

Intro Likelihood & coa MCMC

IS

Simulation tests

Sim tests

Conclusions

(MsVar & Migraine)

• How genealogies are affected by demographic parameters ?

→ “Predict” the quantity of information present in the data The information in the data strongly depends on the number of mutations and coalecent events during the different demographic phases

Intro Likelihood & coa MCMC

Simulation tests Precision Validation Robustness MCMC vs. IS

IS

Sim tests

Conclusions

Intro Likelihood & coa MCMC

IS

Simulation tests

Sim tests

(Migraine)

Beyond biases, RMSE et bottleneck detection rates...

● ● ● ●●● ● ● ●● ● ●● ● ● ● ● ●● ● ●● ●● ● ● ●● ● ●● ● ●● ● ●● ●● ● ●● ●●● ● ● ● ● ●● ●

0.4

1.0 0.8 0.6 0.4 0.0

0.6

0.8

Rel. bias, rel. RMSE 0.0496, 0.375

0.2

1.0

● ● ● ●● ● ●● ● ●

● ●● ● ● ● ● ●● ●● ● ● ● ● ●

KS: 0.433 0.4

0.6

0.8

1.0

Rel. bias, rel. RMSE −0.00452, 0.14

c(0, 1)

● ●● ●● ●● ●● ● ●● ●● ●● ● ● ●● ●● ●● ● ●● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ●● ● ●● ● ●● ●● ● ● ●

KS: 0.857

0.2

●●● ● ● ●● ●● ●● ●

0.0

Nratio = 0.001

0.2

0.8

1.0

2Nancmu = 400

0.6 0.4 0.2

1.0

1.0

0.8

●● ●● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ● ● ● ●● ● ●

● ● ●● ●● ● ● ● ● ● ●● ●● ● ●● ● ● ● ●● ● ● ●●

●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ●●

0.8

0.6

Rel. bias, rel. RMSE 0.116, 0.453

0.0

0.2 0.0

KS: 0.203 0.4

D = 1.25

0.6

0.2

● ● ● ●● ● ● ●

0.4

0.0

● ●●● ● ●● ●● ● ●● ● ● ●● ● ●● ● ●

●● ● ● ● ●● ● ●● ● ● ●● ●● ● ●● ● ● ● ●● ●● ● ●● ●●

●● ●● ● ● ●●● ●● ● ● ●● ● ● ● ● ●● ● ●

c(0, 1)

0.0

● ●● ● ● ● ●● ●● ●● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ●● ● ●

0.0 c(0, 1)

c(0, 1)

1) ECDF of c(0, P−values

0.2

0.4

0.6

0.8

1.0

2Nmu = 0.4

●● ● ●● ●● ●

0.0

● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ●● ● ●● ●● ● ● ●● ● ● ● ●● ● ●●

0.2

● ● ● ●● ●● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ●

● ● ●● ● ● ● ●

● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ●● ● ● ● ● ●

DR: 1 ( 0 ) KS: 0.165

0.4

0.6

0.8

Rel. bias, rel. RMSE 0.152, 0.601

(usually )GOOD

1.0

Conclusions

Intro Likelihood & coa MCMC

IS

Simulation tests

Sim tests

(Migraine)

Beyond biases, RMSE et bottleneck detection rates... Testing CI coverage properties using LRT P-value distributions

KS: 0.857

0.2

1.0 0.8 0.6 0.4

● ●● ●● ●● ●● ● ●● ●● ●● ● ● ●● ●● ●● ● ●● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ●● ● ●● ● ●● ●● ● ● ● ● ● ● ●●● ● ● ●● ● ●● ● ● ● ● ●● ● ●● ●● ● ● ●● ● ●● ● ●● ● ●● ● ● ● ●● ●●● ● ● ● ● ●● ●

0.4

0.6

0.8

1.0

0.2

● ● ● ●● ● ●● ● ●● ● ●● ●● ● ● ● ● ● ●● ●● ●

KS: 0.433 0.4

0.6

0.8

1.0

Rel. bias, rel. RMSE −0.00452, 0.14

c(0, 1)

2Nancmu = 400

Rel. bias, rel. RMSE 0.0496, 0.375

0.0

Nratio = 0.001

1.0

1.0 0.8 0.6 0.4 0.2

1.0

●● ●● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ● ● ● ●● ● ●

●● ● ● ● ●● ● ● ●●

● ●● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ●● ● ● ● ●●

0.8

0.8

0.6

0.6

●●● ● ● ●● ●● ●● ●

0.4

0.4

D = 1.25

0.2 0.0

KS: 0.203

Rel. bias, rel. RMSE 0.116, 0.453

0.0

● ● ● ●● ● ● ●

● ● ● ●● ●● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ●● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ●● ● ●● ●● ● ● ●● ● ● ● ●● ● ● ●●

●● ● ●● ● ● ● ●

● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ●● ● ● ●●

DR: 1 ( 0 )

0.2

0.2

●● ● ● ● ●● ●●

●●●

0.0

0.0

● ● ● ●● ●● ● ●● ● ● ● ●● ●● ● ● ●● ●● ●●● ● ●● ●● ● ●● ● ● ● ● ● ●● ● ●

● ● ● ●● ● ●

● ● ● ●●● ●● ● ● ● ●●

c(0, 1)

0.0

● ●● ● ● ● ●● ●● ●● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●

0.0 c(0, 1)

c(0, 1)

1) ECDF of c(0, P−values

0.2

0.4

0.6

0.8

1.0

2Nmu = 0.4

KS: 0.165

● ● ●● ●● ●

0.0

0.2

0.4

0.6

0.8

Rel. bias, rel. RMSE 0.152, 0.601

(usually )GOOD

1.0

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

Simulation tests

Conclusions

(Migraine)

Beyond biases, RMSE et bottleneck detection rates... Testing CI coverage properties using LRT P-value distributions

KS: 0.857

0.2

1.0 0.8 0.6 0.4

● ●● ●● ●● ●● ● ●● ●● ●● ● ● ●● ●● ●● ● ●● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ●● ● ●● ● ●● ●● ● ● ● ● ● ● ●●● ● ● ●● ● ●● ● ● ● ● ●● ● ●● ●● ● ● ●● ● ●● ● ●● ● ●● ● ● ● ●● ●●● ● ● ● ● ●● ●

0.4

0.6

0.8

1.0

0.2

KS: 0.433 0.4

0.6

0.8

Extremely recent and strong 10 Generations, D = 0.025 Nratio = 0.001 (θanc = 400.0)

1.0

Rel. bias, rel. RMSE −0.00452, 0.14

c(0, 1)

2Nancmu = 400

Rel. bias, rel. RMSE 0.0496, 0.375

0.0

Nratio = 0.001

1.0

1.0 0.8 0.6 0.4 0.2

1.0

●● ●● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ● ● ● ●● ● ●

● ● ● ●● ● ●● ● ●● ● ●● ●● ● ● ● ● ● ●● ●● ● ●● ● ● ● ●● ● ● ● ●

● ●● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ●● ● ● ● ●●

0.8

0.8

0.6

0.6

●●● ● ● ●● ●● ●● ●

0.4

0.4

D = 1.25

0.2 0.0

KS: 0.203

Rel. bias, rel. RMSE 0.116, 0.453

0.0

● ● ● ●● ● ● ●

● ● ● ●● ●● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ●● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ●● ● ●● ●● ● ● ●● ● ● ● ●● ● ● ●●

●● ● ●● ● ● ● ●

● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ●● ● ● ●●

DR: 1 ( 0 )

0.2

0.2

●● ● ● ● ●● ●●

●●●

0.0

0.0

● ● ● ●● ●● ● ●● ● ● ● ●● ●● ● ● ●● ●● ●●● ● ●● ●● ● ●● ● ● ● ● ● ●● ● ●

● ● ● ●● ● ●

● ● ● ●●● ●● ● ● ● ●●

c(0, 1)

0.0

● ●● ● ● ● ●● ●● ●● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●

0.0 c(0, 1)

c(0, 1)

1) ECDF of c(0, P−values

0.2

0.4

0.6

0.8

1.0

2Nmu = 0.4

KS: 0.165

● ● ●● ●● ●

0.0

0.2

0.4

0.6

0.8

1.0

Rel. bias, rel. RMSE 0.152, 0.601

(usually )GOOD

(very rarely) BAD

Intro Likelihood & coa MCMC

Simulation tests Precision Validation Robustness MCMC vs. IS

IS

Sim tests

Conclusions

Intro Likelihood & coa MCMC

IS

Simulation tests

Sim tests

(Migraine)

Microsatellite markers show complex mutation processes • Mutations do not fit SMM,

indels of more than one repeat often occur

Conclusions

Intro Likelihood & coa MCMC

IS

Simulation tests

Sim tests

Conclusions

(Migraine)

Microsatellite markers show complex mutation processes • Mutations do not fit SMM,

indels of more than one repeat often occur • Better mutation model = Generalized Stepwise Model (GSM)

indels of X (geometric) repeats at each mutation event commonly found value in “natura” : pGSM ≈ 0.22

Intro Likelihood & coa MCMC

IS

Simulation tests

Sim tests

(Migraine)

Microsatellite markers show complex mutation processes • Mutations do not fit SMM,

indels of more than one repeat often occur • Better mutation model = GSM

indels of X (geometric) repeats commonly found value in “natura” : pGSM ≈ 0.22 • Problem : Analyses under the SMM

of data simulated under a GSM in a stable population often show false signs of bottleneck (57% of false detection with pGSM = 0.22)

Conclusions

Intro Likelihood & coa MCMC

Simulation tests Precision Validation Robustness MCMC vs. IS

IS

Sim tests

Conclusions

Intro Likelihood & coa MCMC

IS

Simulation tests

Sim tests

(MsVar vs. Migraine)

Some comparison with MsVar • Similar performances for “good” scenarios • Better bottleneck detection rate for “non-optimal” scenarios • Parameter inference and CIs are slightly more accurate

But comparison is not easy • Frequentist vs. bayesian approaches • very long computation times for MCMC (and sometimes for IS)

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Conclusions from the simulation tests (MCMC & IS) • Very efficient for bottleneck detections • Accurate inferences for most demographic scenarios • IS faster and sometimes more accurate than MCMC

But : • Not robutst to mutational processes • Not robust to immigration (structured populations) • Inaccurate for extremely strong and recent pop size change

and... very long computation times for large data sets with many loci (i.e. “NGS” >> 100 - 1,000 loci)

Intro Likelihood & coa MCMC

IS

Introduction Likelihoods under the coalescent Felsenstein et al.’s MCMC Griffiths et al.’s IS Simulation tests Conclusions

Sim tests

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

Conclusions • Coalescent theory and ML-based approaches provide a

powerful framework for statistical inference in population genetics. • They ”extract” much more information from the data than

moment based methods. • In these methods, gene genealogies are missing data • Coalescent theory may also help understanding the limits of

these methods (the reliability of a method also depends upon the quantity of information available in the data) • Testing methods by simulation greatly helps to clearly

understand real data analyses

Intro Likelihood & coa MCMC

IS

Books

Sim tests

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

´ecrit pour pop subdiv, mais pas utile ici ? p(n) = E[∏ ( d

nd ) ∏ X ndi ] (ndi ) i di

Conclusions

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

´ecrit pour pop subdiv, mais pas utile ici ? The backward diffusion equation holds with an operator which is a sum over different demes : Φ= 21 ∑demes

d

∑allele

pairs i,j

2 Ntot x (δ −xdj ) ∂x ∂∂x Nd di ij di dj

where xdi is the frequency of allele i in deme j.

+∑d ∑i Mdi ∂x∂

di

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

´ecrit pour pop subdiv, mais pas utile ici ? Applying the previous arguments then leads to a relation between probabilities of samples that differ by one coalescence/mutation/migration event : Ntot (∑ nd ( d

nd − 1 + md + µ)) p(n) = Nd ndj − 1 Ntot ∑ nd p(n − edj ) Nd d,j

+ Ntot µ ∑ ∑ Pij (ndi + 1 − δij )p(n − edj + edi ) d,j i

+ Ntot ∑ nd ∑ mdd ′ d,j

d ′ ≠d

nd ′ j + 1 p(n − edj + ed ′ j ). nd ′ + 1

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

´ecrit pour pop subdiv, mais pas utile ici ? Expressing all p(.) in terms of p(n − edj )s for distinct d, j : Ntot ∑ ( d,j

nd − 1 + md + µ)π(j∣d, n − edj )nd p(n − edj ) = Nd Ntot ∑ nd d,j

ndj − 1 p(n − edj ) Nd

+ Ntot µ ∑ ∑ Pij nd π(i∣d, n − edj )p(n − edj ) d,j i

+ Ntot ∑ nd ∑ mdd ′ π(j∣d ′ , n − edj )p(n − edj ) d,j

d ′ ≠d

Intro Likelihood & coa MCMC

IS

Sim tests

Conclusions

´ecrit pour pop subdiv, mais pas utile ici ? Expressing all p(.) in terms of p(n − edj )s for distinct d, j : Ntot ∑ ( d,j

nd − 1 + md + µ)π(j∣d, n − edj )nd p(n − edj ) = Nd Ntot ∑ nd d,j

ndj − 1 p(n − edj ) Nd

+ Ntot µ ∑ ∑ Pij nd π(i∣d, n − edj )p(n − edj ) d,j i

+ Ntot ∑ nd ∑ mdd ′ π(j∣d ′ , n − edj )p(n − edj ) d,j

d ′ ≠d

Huge system of linear equations, not easier to solve in this form.