Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
A faire RL : homog´en´eiser les notations avec les macros de FR, remplacer IS par SIS partout, insister sur le sequential = le long de la construction de l’arbre, mieux s´eparer l’obtention de la r´ecurrence, les pi exactes et les pi chapeaux, mettre en bleu les mots cl´es ajouter l terme p(Htau)=distribution stationnaire des ´etats all´eliques partout ou c’est n´ecessaire FR : 1) Faire backward dans un cours pr´ec´edent.... (et d’autres trucs sur ma pr´esentation de backward) 2) l’op´erateur diff´erentiel φj n’est pas explicit´e =¿ phrases plus compliqu´ees
Intro Likelihood & coa MCMC
IS
Sim tests
Module de Master 2 Biostatistique: mod` eles de g´ en´ etique des populations
Likelihood-based demographic inference using the coalescent Rapha¨el Leblois & Fran¸cois Rousset Centre de Biologie pour la Gestion des populations (CBGP, Montpellier) Institut des Sciences de l’Evolution, (ISEM, Montpellier)
Janvier 2017
Conclusions
Intro Likelihood & coa MCMC
IS
Introduction Likelihoods under the coalescent Felsenstein et al.’s MCMC Metropolis-Hastings MsVar example Conclusions on MCMC Griffiths et al.’s IS Griffiths et al’s recursion Old IS scheme New IS scheme A general method based on diffusion Approximations of π Simulation tests Precision Validation Robustness MCMC vs. IS Conclusions
Sim tests
Conclusions
Intro Likelihood & coa MCMC
IS
Introduction Likelihoods under the coalescent Felsenstein et al.’s MCMC Griffiths et al.’s IS Simulation tests Conclusions
Sim tests
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Typical biological question : • There are demographic evidences that
orang-utan population sizes have collapsed → but what is the major cause of the decline, when did it start and how strong is it ?
• Can population genetics help ? - Can we infer the time of the event ? - Can we infer the strength of the population size decrease ?
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Methods based on coalescence simulations (Reminder...) Genealogy of the sample
forward in time
backward in time
Genealogy of the population
Coalescent tree
6
☇ ☇
?
;; P(Tk = t) ≈
k(k − 1) −t k(k−1) 2N e 2N
P(m∣t) =
(µt)m e −µt m!
Intro Likelihood & coa MCMC
IS
Sim tests
Two different ways to use the coalescent theory • Exploratory approaches & simulation tests
- The coalescent allows efficient simulations of the genetic variability under various demo-genetic models (sample vs. population) Specify the model and parameter values Coalescent process Simulated data sets
• Inferential approach
- The coalescent allows the inference of populationnal evolutionary parameters (genetic, demographic, reproductive,...), some of those methods uses all the information contained in the genetic data (likelihood-based methods) a real data set Coalescent process
infer the model parameters
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
Two different ways to use the coalescent theory • Exploratory approaches & simulation tests
- The coalescent allows efficient simulations of the genetic variability under various demo-genetic models (sample vs. population) Specify the model and parameter values Coalescent process Simulated data sets
• Inferential approach
- The coalescent allows the inference of populationnal evolutionary parameters (genetic, demographic, reproductive,...), some of those methods uses all the information contained in the genetic data (likelihood-based methods) a real data set Coalescent process
infer the model parameters
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Likelihood-based inference under the coalescent • Inferential approaches are based on the modeling of
population genetic processes. Each population genetic model is characterized by a set of demographic and genetic parameters P • The aim is to infer those parameters from a polymorphism
data set (i.e. a genetic sample) • The genetic sample is then considered as the realization
(”output”) of a stochastic process defined by the demo-genetic model
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Likelihood-based inference under the coalescent
• First, compute or estimate the likelihood L(P ∗ ; D), i.e. the
probability P(D; P ∗ ) of observing the data D for some parameter values P ∗
• Second, infer the likelihood surface over all parameter values,
find the set of parameter values that maximize it, and compute CI (maximum likelihood method), or Compute posterior distributions and compare with priors (Bayesian approach).
Intro Likelihood & coa MCMC
IS
Introduction Likelihoods under the coalescent Felsenstein et al.’s MCMC Griffiths et al.’s IS Simulation tests Conclusions
Sim tests
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Likelihood computations under the coalescent • Problem : Most of the time, the likelihood P(D; P ∗ ) of a
genetic sample cannot be computed because there is no explicit mathematical expression • However, the probability P(D; P ∗ ∣Gk ) of observing the data D
given a specific genealogy Gk can be computed for some parameter values P ∗ . • Then we take the sum of all genealogy-specific likelihoods on
the whole genealogical space, weighted by the probability of the genealogy given the parameters : L(P ∗ ; D) = ∫ P(D; P ∗ ∣G )P(G ; P ∗ ) dG G
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Likelihood computations under the coalescent • The likelihood can be written as the sum of P(D; P ∗ ∣Gk ) over
the genealogical space (all possible genealogies) : L(P ∗Mutation ; D) = ∫ P(D; P ∗ ∣G )P(G ; P ∗ ) dG G
Demography (Coalescent)
• Genealogies are missing data, they are important for the
computation of the likelihood but there is no interest in estimating them. → very different from the phylogenetic approaches
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Likelihood computations under the coalescent • The likelihood can be written as the sum of P(D; P ∗ ∣Gk ) over
the genealogical space (all possible genealogies) : L(P ∗ ; D) = ∫ P(D; P ∗ ∣G )P(G ; P ∗ ) dG G
...Usually impossible to sum over all possible genealogies...
→ Monte Carlo simulations are used : a large number K of genealogies are simulated according to P(G ; P ∗ ) and the mean over those simulations is taken as the expectation of P(D; P ∗ ∣G ) : L(P ∗ ; D) = EP(G ;P ∗ ) (P(D; P ∗ ∣G )) ≈
1 K ∗ ∑ P(D; P ∣Gk ) K k=1
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Likelihood computations under the coalescent • The likelihood can be written as the sum of P(D; P ∗ ∣Gk ) over
the genealogical space (all possible genealogies) : L(P ∗ ; D) = ∫ P(D; P ∗ ∣G )P(G ; P ∗ ) dG G
...Usually impossible to sum over all possible genealogies...
→ Monte Carlo simulations are used : L(P ∗ ; D) = EP(G ;P ∗ ) (P(D; P ∗ ∣G )) ≈
1 K ∗ ∑ P(D; P ∣Gk ) K k=1
many many genealogies necessary for a good estimation of the likelihood...
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Likelihood computations under the coalescent
• Monte Carlo simulations are used :
L(P ∗ ; D) = EP(G ;P ∗ ) (P(D; P ∗ ∣G )) ≈
1 K ∗ ∑ P(D; P ∣Gk ) K k=1
Monte Carlo simulations are often not very efficient because there are too many genealogies giving extremely low probabilities of observing the data, more efficient algorithms are used to explore the genealogical space and focus on genealogies well supported by the data.
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Likelihood computations under the coalescent • Two main approaches developed using more efficient
algorithms that allows better exploration of the genealogies proportionally to their probability of explaining the data P(D; P∣G ). MCMC Monte Carlo Markov chains on the genealogical and the parameter space, based on Felsenstein’s pruning algorithm (1973,1981) Felsenstein, J. (1981). ”Evolutionary trees from DNA sequences : A maximum likelihood approach”. J. of Mol. Evol. 17 (6) : 368-376.
IS Importance Sampling on genealogies, based on the work of Griffiths & Tavar´e 1994. Griffiths, R.C. and S. Tavar´ e (1994). Simulating probability distributions in the coalescent. Theor. Pop. Biol., 46 :131-159.
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Likelihood computations under the coalescent • More efficient algorithms that allows better exploration of the
genealogies proportionally to their probability of explaining the data P(D; P∣G ) MCMC Felsenstein’s pruning algorithm. - Easier to implement, can easily consider various models - Implemented in many softwares (LAMARC, Batwing, MsVar, MIGRATE, IM) IS Griffiths &Tavar´e’s coalescent recursion - Extension to different models may be difficult - Implemented in fewer softwares (Genetree, Migraine)
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Likelihood computations under the coalescent • More efficient algorithms that allows better exploration of the
genealogies proportionally to their probability of explaining the data P(D; P∣G ) MCMC Felsenstein’s pruning algorithm (quick overview) - Easier to implement, can consider various models - Implemented in many softwares (LAMARC, Batwing, MsVar, MIGRATE, IM) IS Griffiths &Tavar´e’s coalescent recursion - Extension to different models may be difficult - Implemented in fewer softwares (Genetree, Migraine)
Intro Likelihood & coa MCMC
IS
Introduction Likelihoods under the coalescent Felsenstein et al.’s MCMC Griffiths et al.’s IS Simulation tests Conclusions
Sim tests
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
The approach of Felsenstein et al.
• Based on (1) on the availability of approximate exponential
distributions for time intervals between events (coalescence and migration and recombinaison) and (2) on the separation of demographic and mutational processes : 1 The probability of a genealogy given the parameters of the ∗ demographic model P(Gk ; Pdemo ) can be computed from the distributions of time between events. 2 The probability of the data given a genealogy and mutational ∗ parameters P(D; Pmut ∣Gk ) can be computed from the mutation model parameters, the mutation rate, tree topology and branch lengths.
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
The approach of Felsenstein et al. • Based on (1) on the availability of approximate exponential
distributions for time intervals between events (coalescence and migration and recombinaison) and (2) on the separation of demographic and mutational processes : ∗ 1 P(Gk ; Pdemo ) computed from the distributions of time between events. ∗ 2 P(D; Pmut ∣Gk ) computed from the mutation parameters, tree topology and branch lengths.
• From this, an efficient algorithm to explore the genealogical
and the parameter spaces should allow the inference of the likelihood over the two spaces. → MCMC
Intro Likelihood & coa MCMC
Felsenstein et al.’s MCMC Metropolis-Hastings MsVar example Conclusions on MCMC
IS
Sim tests
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
MCMC with Metropolis-Hastings sampler • Full conditional distributions can not be computed, MCMC
classical sampler can not thus be used (e.g. Gibbs) → Monte Carlo Markov Chains (MCMC) simulations using the Metropolis-Hastings (MH) algorithm - To explore the genealogy space (G ) - and the parameter space (P = Pdemo + Pmut )
all algorithms based on the ’Felsenstein et al.’ approach uses similar MH/MCMC algorithms with slight differences in the MCMC update steps.
Intro Likelihood & coa MCMC
IS
Sim tests
Metropolis-Hastings sampling for the coalescent For the Metropolis-Hastings algorithm, we need to compute the ratio of the probability of the proposed update over the current state : 1. Computation of P(Gk ; Pdemo ) : P(Gk ; Pdemo ) =
MRCA−1
∏
γ(ti+1 )e
t
− ∫t i+1 γ(t)dt i
i=0
- Example for a stable WF population (coalescence only, time homogeneous) P(Gk ; Pdemo ) =
MRCA−1
∏ i=0
ki+1 (ki+1 − 1) −(ti+1 −ti ) ki+1 (k2i+1 −1) e 2
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Metropolis-Hastings sampling for the coalescent 1. Computation of P(Gk ; Pdemo ) : P(Gk ; Pdemo ) =
MRCA−1
∏
γ(ti+1 )e
t
− ∫t i+1 γ(t)dt i
i=0
2 Then compute the probability P(D; Pmut ∣Gk ) of the data D given the genealogy Gk , by going from the MRCA to the leaves and considering the probability of occurrence of all mutations on each branch of length tb and their effects (i.e.transition among genetic states x → y ) :
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Metropolis-Hastings sampling for the coalescent 2 Then compute the probability P(D; Pmut ∣Gk ) : Mutation matrix : transition probability between genetic states (x, y )
effect of mutations
P(D; Pmut ∣Gk ) =
nb branch
∏
Poisson probability for the mb mutations
³¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ·¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ µ P(y ∣x, mb ) ⋅
number of mutations
³¹¹ ¹ ¹ ¹ ¹ ¹ ¹ · ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ µ P(mb ∣tb )
b=1 2(n−1)
= ∏ ((Matmut )mb )x,y b=1
(µtb )mb e −µtb mb !
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Metropolis-Hastings sampling for the coalescent 1. Computation of P(Gk ; Pdemo ) : P(Gk ; Pdemo ) =
MRCA−1
∏
γ(ti+1 )e
t
− ∫t i+1 γ(t)dt i
i=0
2 Then compute P(D; Pmut ∣Gk ) : 2(n−1)
P(D; Pmut ∣Gk ) = ∏ ((Matmut )mb )x,y b=1
(µtb )mb e −µtb mb !
3 These probabilities are plugged into the MH formula for acceptance probabilities of candidate changes for the next state of the Markov chain. Reminder : P(D; P∣Gk ) = P(D; Pmut ∣Gk )P(Gk ; Pdemo )
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Metropolis-Hastings sampling for the coalescent
• for each update, the new state (P ′ or G ′ ) is accepted or
rejected according to the Metropolis-Hastings ratio, • the MH ratio is chosen so that the chain converge towards the
good stationary distribution P(D; P), e.g. rMH =
P(D; P ′ ∣G )Prior(P ′ ) P(P ′ → P) P(D; P∣G )Prior(P) P(P → P ′ )
Intro Likelihood & coa MCMC
Felsenstein et al.’s MCMC Metropolis-Hastings MsVar example Conclusions on MCMC
IS
Sim tests
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Coalescent-based MCMC example : MsVar
• One example of a coalescent-based MCMC algorithm : MsVar Beaumont, M. 1999. Detecting Population Expansion and Decline Using Microsatellites. Genetics.
• Biological contexte :
Past changes in population sizes (cf. Orang-Utans) - Details of the demographic and mutation models - few results on the Orang-Utan data set
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Coalescent-based MCMC example : MsVar • Demographic model : a single isolated panmictic (WF)
population with a exponential past change in population size.
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Coalescent-based MCMC example : MsVar • Demographic model : a single isolated panmictic (WF)
population with a exponential past change in population size.
Population contraction or expansion
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Coalescent-based MCMC example : MsVar • Demographic model : a single isolated panmictic (WF)
population with a exponential past change in population size.
3 demographic parameters : N, T , Nanc + 1 mutation parameter µ 3 scaled parameters (diffusion approx.) : θ, D, θanc
Intro Likelihood & coa MCMC
IS
Sim tests
Coalescent-based MCMC example : MsVar • Mutation model : Stepwise Mutation Model (SMM)
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Coalescent-based MCMC example : MsVar
P = N, T , Nanc , µ Pscaled = θ, D, θanc
• Aim : infer those parameters (P or Pscaled ) from a unique actual
genetic sample using coalescent-based MCMC algorithms
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
MH/MCMC of MsVar 1. Initialization step : Build a genealogy that is compatible with the data → Starting with the sample, choose a set of events depending on starting values of the parameters ; the events are also chosen to be compatible with the data
2. MCMC steps : Explore the parameter and the genealogical space → Update the parameters for population sizes (θact , D, θanc ). or Update the genealogy (sequence and times of coalescence and mutation events (Ti ))
both updates made using the Metropolis-Hastings algorithm
Intro Likelihood & coa MCMC
IS
Sim tests
MCMC updates in MsVar Ti = times of coa & mut, r =
θact θanc
pop size ratio, tf = D time of pop size change
M. Beaumont : “This scheme was devised by trial and error to obtain good rates of convergence.”
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Analyses of MsVar results • First check that the chains mixed and converged properly
→ Visual check (very useful) • Traces of likelihood / parameters • Autocorrelation
→ Compute convergence criteria among chains (GR, ...) not always useful... → Run different chains and check concordance between results Problem : Convergence is often pretty bad with such coalescent-based MCMC algorithms ... but simulation tests show that posterior distributions are generally correct (at least the mode as point estimate) despite no clear convergence indices...
Intro Likelihood & coa MCMC
IS
Sim tests
Analyses of MsVar results
• Bayesian method → compare posteriors (plain) and priors
(dashed)
... and test different priors
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
Analyses of MsVar results • Bayesian method → compute Bayes factor to check for
contraction or expansion signal BF =
(Posterior prob. model 1) (Prior prob. model 2) (Posterior prob. model 2) (Prior prob. model 1)
• Equal priors for models 1 and 2, the Bayes factor for a
contraction is thus BF =
Posterior P(Nanc /Nact > 1) Posterior P(Nanc /Nact < 1)
BF =
# MCMC steps where (Nanc /Nact > 1) # MCMC steps where (Nanc /Nact < 1)
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
An application of MsVar : Orang-Utans and the deforestation of Borneo Does the genome of Orang-utans carry the signature of population bottlenecks ? (Goossens et al. 2006 PLoS Biology)
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
An application of MsVar : Orang-Utans and the deforestation of Borneo
Population sizes have collapsed : what is the cause ? Can population genetics help ?
(Delgado & Van Schaik, 2001 Evol. Anthropology)
Intro Likelihood & coa MCMC
IS
Sim tests
An application of MsVar : Orang-Utans and the deforestation of Borneo • The data
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
An application of MsVar : Orang-Utans and the deforestation of Borneo • MsVar results
→ MsVar efficiently detects a past decrease in population size
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
An application of MsVar : Orang-Utans and the deforestation of Borneo • MsVar results FE : beginning of massive forest exploitation F : first farmers HG : first hunter-gatherers
→ MsVar efficiently detects a past decrease in population size... ... and allows for the dating of the beginning of the decrease : massive forest exploitation seems to be the most likely cause
Intro Likelihood & coa MCMC
Felsenstein et al.’s MCMC Metropolis-Hastings MsVar example Conclusions on MCMC
IS
Sim tests
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions about MsVar/ MCMC approaches • Coalescent theory provides a powerful framework for statistical inference → Allows to infer past history from a unique actual sample ! (it was impossible with moment based methods) • Gene genealogies are missing data (but important...) → MCMCs with coalescent simulations are “difficult” (to run) • But what is the robustness to model assumptions : • Mutational processes (e.g. large mutation steps → long branches) • Population structure (e.g. immigrants → long branches)
Conclusions
Intro Likelihood & coa MCMC
IS
Introduction Likelihoods under the coalescent Felsenstein et al.’s MCMC Griffiths et al.’s IS Simulation tests Conclusions
Sim tests
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Likelihood computations under the coalescent • More efficient algorithms that allows better exploration of the
genealogies (i.e. proportionally to P(D; P∣G )).
MCMC Felsenstein’s pruning algorithm. - Easier to implement, can consider various models - Implemented in many softwares (LAMARC, Batwing, MsVar, MIGRATE, IM) IS Griffiths &Tavar´e’s coalescent recursion (cf. Ewens’ recursion) - Extension to different models may be difficult - Implemented in fewer softwares (Genetree, Migraine)
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
The approach of Griffiths et al.
• Coalescent-based likelihood at a given point of the parameter
space is an integral aver all possible histories (genealogies with mutations) leading to the present genetic sample • Monte Carlo scheme used to compute this integral • Histories are build backward in time, event by event, starting
from the present sample • But computation of exact backward transition probabilities is
often too difficult → an IS scheme is used to compute the likelihoods by simulation
Intro Likelihood & coa MCMC
IS
Griffiths et al.’s IS Griffiths et al’s recursion Old IS scheme New IS scheme A general method based on diffusion Approximations of π
Sim tests
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Recursions for sampling distributions Ewens 1972 : a Wright-Fisher, infinite-allele model General recursion at stationarity [with ej = (0, ⋯, 0, 1jth , 0, ⋯, 0)] : Pn (a) =
j(aj + 1) θ n−1 Pn−1 (a−e1 )+ Pn−1 (a+ej −ej+1 ) ∑ n−1+θ n − 1 + θ aj+1 >0 n − 1
where given that a coalescence occurs and that the descendant sample has (⋯, aj , aj+1 , ⋯), the ancestral one has (⋯, aj + 1, aj+1 − 1, ⋯) and the probability that one of the aj + 1 alleles with j gene copies is chosen to duplicate is j(aj + 1)/(n − 1). Rappel : les effectifs de l’´echantilons (vecteur a est ordonn´e selon les effectifs des alleles = allele frequency spectrum = aj est le nombre de l’alleles ayant j copies dans l’´echantilon) Griffiths and Tavar´e : recursion for mutation models defined by a matrix (pij ) of mutation rates from i to j
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
The recursion of Griffiths et al.
• Coalescent-based likelihood at a given point of the parameter
space is an integral over all possible histories (genealogies with mutations) H = {Hk ; k = 0, ..., τ } corresponding to all coalescent or mutation events that occurred from H0 the current sample state to Hτ the allelic state of the most recent common ancestor (MRCA) of the sample.
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
The recursion of Griffiths et al.
• Then for any given state Hk of the history (cf. Ewens) :
p(Hk ) = ∑ p(Hk ∣Hk ′ )p(Hk ′ ) {Hk ′ }
where Hk ′ is the ancestral sample state (i.e. the state before the last event) and p(Hk ∣Hk ′ ) are the forward transition probabilities (i.e. from the ancestral to the current state)
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
The recursion of Griffiths et al. • Griffiths & Tavar´ e 1994 : example for a single population p(Hk = η) =
⎡ ⎢ ⎢(nµ ∑ ∑ ni + 1 pij p(Hk ′ = η − ej + ei )) ⎢ n(n−1) n ( 2N + nµ) ⎢ i j∶nj >0,j≠i ⎣ ⎤ ⎥ n(n − 1) nj − 1 +( p(Hk ′ = η − ej ))⎥ ∑ ⎥. 2N j∶nj >1 n − 1 ⎥ ⎦ 1
- Setting θ = 4Nµ and β = n(n − 1 + θ), we have ⎡ 1⎢ θ ∑ ∑ (ni + 1)pij p(Hk ′ = η − ej + ei ) p(Hk = η) = ⎢ β⎢ ⎢ i j∶nj >0,j≠i ⎣ ⎤ ⎥ + n ∑ (nj − 1)p(Hk ′ = η − ej )⎥ ⎥, ⎥ j∶nj >1 ⎦
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
The recursion of Griffiths et al. • Griffiths & Tavar´ e 1994 : example for a single population ⎡ 1⎢ p(Hk = η) = ⎢ θ ∑ ∑ (ni + 1)pij p(Hk ′ = η − ej + ei ) β⎢ ⎢ i j∶nj >0,j≠i ⎣ ⎤ ⎥ + n ∑ (nj − 1)p(Hk ′ = η − ej )⎥ ⎥, ⎥ j∶nj >1 ⎦
• Such recursions are too difficult to solve except for very simple
models (WF + IAM, cf Ewens) → Griffiths & Tavar´e (1994) proposed to use a Monte Carlo approach using sequential importance sampling on past histories to solve the recursion.
Intro Likelihood & coa MCMC
IS
Griffiths et al.’s IS Griffiths et al’s recursion Old IS scheme New IS scheme A general method based on diffusion Approximations of π
Sim tests
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
Inference of the likelihood by simulation • Griffiths & Tavar´ e 1994 : ⎡ 1⎢ p(Hk = η) = ⎢ θ ∑ ∑ (ni + 1)pij p(Hk ′ = η − ej + ei ) β⎢ ⎢ i j∶nj >0,j≠i ⎣ ⎤ ⎥ + n ∑ (nj − 1)p(Hk ′ = η − ej )⎥ ⎥, ⎥ j∶nj >1 ⎦
or equivalently p(Hk ) = wGT (Hk )(
∑
i,j∶nj >0,j≠i
Mij (Hk )p(Hk − ej + eai )
+ ∑ Cj (Hk )p(Hk − ej )) j∶nj >1
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Inference of the likelihood by simulation • Griffiths & Tavar´ e 1994 :
Backward absorbing Markov chain based on forward transition probabilities p(Hk ) = wGT (Hk )(
∑
i,j∶nj >0,j≠i
Mij (Hk )p(Hk − ej + eai )
+ ∑ Cj (Hk )p(Hk − ej )) j∶nj >1
→ Histories are build backward event by event using absorbing Markov chain (abs. state = MRCA) based on forward transitions probabilities (“uniform sampling” based on Mij (Hk ) and Cj (Hk )) among all possible events. wGT (Hk ) is the weight associated with the IS proposal.
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Inference of the likelihood by simulation • Expending the recursion p(Hk ) = ∑{H ′ } p(Hk ∣Hk ′ )p(Hk ′ ) k
over all possible ancestral histories of a current sample leads to p(H0 ) = E [p(H0 ∣H1 )...p(Hτ −1 ∣Hτ )p(Hτ )] Then L(P; D) = p(H0 ) = ∫ WGT (H)fGT (H) ≈ H
≈
L
1 L ∑ WGT (Hh ) L h=1
τ
1 ∑ ∏ wGT ((Hh )k ). L h=1 k=0
This IS scheme fGT (H) is not very efficient because it does not appropriately consider that some backward transitions are more likely than others given the current state (example : SMM mutation).
Intro Likelihood & coa MCMC
IS
Griffiths et al.’s IS Griffiths et al’s recursion Old IS scheme New IS scheme A general method based on diffusion Approximations of π
Sim tests
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Towards a better IS scheme (Stephens & Donnelly 2000, de Iorio & Griffiths 2004)
→ A better Importance Sampling (IS) scheme should be used : Let Q(Hk ′ ) be a given distribution, then p(Hk ∣Hk ′ ) Q(Hk ′ )p(Hk ′ ) Q(Hk ′ ) ′}
p(Hk ) = ∑ {Hk
= EQ [
p(H0 ∣H1 ) p(Hτ −1 ∣Hτ ) ... ] Q(H1 ) Q(Hτ )
where EQ is expectation over the distribution of full histories induced by Q. This means that Q may be viewed as a proposal distribution in a sequential IS algorithm with matching weights p(Hk ∣Hk ′ /Q(Hk ′ )).
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Towards a better IS scheme (Stephens & Donnelly 2000, de Iorio & Griffiths 2004)
→ A better Importance Sampling (IS) scheme should be used : Let Q(Hk ′ ) be a given distribution, then p(Hk ) = EQ [
p(H0 ∣H1 ) p(Hτ −1 ∣Hτ ) ... ] Q(H1 ) Q(Hτ )
where EQ is expectation over the distribution of full histories induced by Q. This means that Q may be viewed as a proposal distribution in a sequential IS algorithm with matching weights p(Hk ∣Hk ′ /Q(Hk ′ )). • The problem is then to find the proposal distribution that minimizes the variance of likelihood estimates 1 L τ ∑ ∏ wGT ((Hh )k ). L h=1 k=0
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Towards a better IS scheme (Stephens & Donnelly 2000, de Iorio & Griffiths 2004)
• The ideal proposal is the backward transition probability
p(Hk ′ ∣Hk ), because the IS weights are then p(Hk ′ ) p(Hk ∣Hk ′ ) p(Hk ) = = Q(Hk ′ ) p(Hk ′ ∣Hk ) p(Hk ′ ) and thus their product is always the sample likelihood, p(H0 ). expliciter → a single tree reconstruction allows exact likelihood computations (null variance). • However, backward transition probabilities p(Hk ′ ∣Hk ) are
generally unknown Aim : find good approximations pˆ(Hk ′ ∣Hk ) of p(Hk ′ ∣Hk )
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Towards a better IS scheme (Stephens & Donnelly 2000, de Iorio & Griffiths 2004)
• The likelihood at a given point is an integral over all possible
histories H = {Hk ; k = 0, ..., τ }. • Markov coalescent process → p(Hk ) = ∑ p(Hk ∣Hk ′ )p(Hk ′ )
and p(H0 ) = E [p(H0 ∣H1 )...p(Hτ −1 ∣Hτ )p(Hτ )].
• However, forward transition probabilities p(Hk ∣Hk ′ ) are not
efficient in a backward process • Importance sampling techniques based on an approximation
pˆ(Hk ′ ∣Hk ) of p(Hk ′ ∣Hk ) are used to build more likely histories p(H0 ) = Epˆ [
p(H0 ∣H1 ) p(Hτ −1 ∣Hτ ) ... ]. pˆ(H1 ∣H0 ) pˆ(Hτ ∣Hτ −1 )
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Linking optimal weights to addition of a gene to a sample Represent sample probability p(n) as integral over the joint distribution f (x) of allele frequencies in the population n p(n) = ∫ p(n∣x)f (x) dx = Ef [( ) ∏ Xini ] n i x where n! n ( )= n ∏i ni ! is the binomial coefficient.
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Linking optimal weights to addition of a gene to a sample Represent sample probability p(n) as integral over the joint distribution f (x) of allele frequencies in the population n p(n) = ∫ p(n∣x)f (x) dx = Ef [( ) ∏ Xini ] n i x Then the joint probability that we have a sample n and that an additional gene copy is of type j is nj + 1 n p(n + ej ). Ef [Xj ( ) ∏ Xini ] = n i n+1
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Linking optimal weights to addition of a gene to a sample Then the joint probability that we have a sample n and that an additional gene copy is of type j is nj + 1 n Ef [Xj ( ) ∏ Xini ] = p(n + ej ). n i n+1 We write this joint probability as p(n) times π(j∣n), where π is thus the probability that an additional gene is of type j, given we have already drawn the sample n from the population. Thus if Hk and Hk ′ differ by the addition of one gene of type j, we can write the optimal IS weight as nj + 1 1 p(Hk ′ = n) = p(Hk = n + ej ) n + 1 π(j∣n)
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Towards a better IS scheme : the π’s
• Let π(⋅∣Hk ) be the conditional distribution of the allelic type
of a n + 1 gene, given Hk the configuration (i.e. allelic types) of the first n genes of the sample.
• Then the optimal IS distribution (exact backward transition
probabilities) is, for a single population : π(i∣Hk − ej ) 1 θnj Pij β π(j∣Hk − ej ) 1 nj (nj − 1) p(Hk ′ ∣Hk ) = β π(j∣Hk − ej )
p(Hk ′ ∣Hk ) =
for Hk ′ = Hk − ej + ei for Hk ′ = Hk − ej
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Towards a better IS scheme : the π ˆ ’s • Unfortunately, π’s are generally unknown → Stephens & Donnelly (2000) proposed a good approximation π ˆ for the πs for a single WF population. → de Iorio & Griffiths (2004) proposed a general method for appoximating the πs under different mutational and demographic models • Then approximate backward transition probabilities using the
π ˆ s are used : π ˆ (i∣Hk − ej ) 1 θnj Pij β π ˆ (j∣Hk − ej ) 1 nj (nj − 1) pˆ(Hk ′ ∣Hk ) = βπ ˆ (j∣Hk − ej )
pˆ(Hk ′ ∣Hk ) =
for Hk ′ = Hk − ej + ei for Hk ′ = Hk − ej
Intro Likelihood & coa MCMC
IS
Griffiths et al.’s IS Griffiths et al’s recursion Old IS scheme New IS scheme A general method based on diffusion Approximations of π
Sim tests
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
The backward equation for f (Xt ∣X0 = x) Pour un processus de diffusion, la densit´e de probabilit´e f des fr´equences all´eliques satisfait l’´equation arri`ere de Kolmogorov, qui d´ecrit les changements de f au cours du temps sous la forme df (Xt ∣X0 = x) = Φ(f (x)), dt o` u Φ est un op´erateur diff´erentiel qui prend ici la forme ∂ 1 ∂2 Φ = ∑ ∑ xi (δij − xj ) + ∑ ( ∑ xi rij ) 2 i∈E j∈E ∂xi ∂xj j∈E i∈E ∂xj = ∑ Φj j∈E
avec
∂ ∂xj
θ R = {rij } ≡ (P − I ) 2 o` u P = {pij } est la matrice de mutation, et I la matrice identit´e.
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
The backward equation for E[g (Xt )∣X0 = x] In the same way as df (Xt ∣X0 = x) = Φ(f (x)), dt the following “generator equation” (Karlin and Taylor, 1981, p.215) holdsfor any function g (x) with bounded second derivatives E[g (Xt )∣X0 = x] − g (x) = Φ(g (x)). t→0 t lim
We will apply this result with g the sample probability given population allele frequencies x.
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
π ˆ ’s computation Pour obtenir une r´ecurrence sur les probabilit´es p(n) avec n = H0 de l’´echantillon, on ´ecrit p(n) sous la forme E [g (x)] n p(n) = E[( ) ∏ Xini ] n i o` u n! n ( )= . n ∏i ni ! On a donc
d(p(n)) = Φ [p(n)] . dt A l’´equilibre stationnaire, d(p(n))/dt est nulle. En d´eveloppant l’expression pour Φ [p(n)], on retrouve alors la r´ecurrence entre les p(n).
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Explicit recursions in terms of π Applying the previous arguments then leads to a relation between probabilities of samples that differ by one coalescence or mutation event : N (n (
nj − 1 n−1 + µ)) p(n) = N ∑ n p(n − ej ) N N j + Nµ ∑ ∑ Pij (ni + 1 − δij )p(n − ej + ei ). j
i
Expressing all p(.) in terms of p(n − ej )s for distinct js : N ∑( j
nj − 1 n−1 + µ)π(j∣n − ej )np(n − ej ) = N ∑ n p(n − ej ) N N d,j + Nµ ∑ ∑ Pij nπ(i∣n − ej )p(n − ej ) j
i
...huge system of linear equations, not easier to solve in this form.
Intro Likelihood & coa MCMC
IS
Griffiths et al.’s IS Griffiths et al’s recursion Old IS scheme New IS scheme A general method based on diffusion Approximations of π
Sim tests
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
π ˆ ’s computation On note que Φ [p(n)] peut s’´ecrire sous la forme ∑ Φj j∈E
∂ [p(n)] , ∂xj
La technique d’approximation d´evelopp´ee par de Iorio & Griffiths est d’approximer les π (d´eriv´es pr´ec´edemment de p(n), solution de Φ [p(n)] = 0) par des π ˆ d´eriv´es des solutions de E[Φj
∂p(n) ∂ n ]= E[Φj ( ) ∏ x ni ]= 0, pour chaque j ∈ E . ∂xj ∂xj n i i
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
π ˆ ’s computation La technique d’approximation d´evelopp´ee par de Iorio & Griffiths est d’approximer les π (d´eriv´es pr´ec´edemment de p(n), solution de Φ [p(n)] = 0) par des π ˆ d´eriv´es des solutions de E[Φj
∂ n ( ) ∏ x ni ]= 0, pour chaque j ∈ E . ∂xj n i i
ce qui donne, pour une population panmictique, pour chaque j ∈ E nj (n − 1 + θ)ˆ p (n) = n(nj − 1)ˆ p (n − ej ) + ∑ θPij (ni + 1 − δij )ˆ p (n − ej + ei ) i∈E
Intro Likelihood & coa MCMC
IS
Sim tests
π ˆ ’s computation
Rappel : π(j∣n) peut ˆetre exprim´e en fonction de p(n) et p(n + ej ) : nj + 1 p(n + ej ). n+1 Si l’on consid`ere que cette relation est aussi valable pour les π ˆ et pˆ, ce qui ne sera g´en´eralement pas le cas, on a π(j∣n)p(n) =
π ˆ (j∣n)ˆ p (n) =
nj + 1 pˆ(n + ej ) n+1
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
π ˆ ’s computation Approximer les p(n), solutions de Φ [p(n)] = 0, par les pˆ(n) solutions de E[Φj
∂ n ( ) ∏ x ni ]= 0, pour chaque j ∈ E . ∂xj n i i
ce qui donne, pour une population panmictique, pour chaque j ∈ E nj (n − 1 + θ)ˆ p (n) = n(nj − 1)ˆ p (n − ej ) + ∑ θPij (ni + 1 − δij )ˆ p (n − ej + ei ) i∈E n +1
j et en utilisant π ˆ (j∣n)ˆ p (n) = n+1 pˆ(n + ej ) et rempla¸cant n par n + ej , on obtient donc pour chaque j ∈ E :
(n − 1 + θ)ˆ π (j∣n) = nj + ∑ θPij π ˆ (i∣n) i∈E
C’est le systeme lin´eaire permettant le calcul des π ˆ (j∣n) pour un mod`ele de Wright-Fisher.
Intro Likelihood & coa MCMC
IS
Sim tests
New IS scheme with the π ˆ ’s
faire deux dipaos de bilan du nouveau schema d’IS bilan en reprennant des bouts des diapos 63, 64, 65, 66, 70 et 71
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
A much better IS scheme based on the π ˆ ’s • Drastic gain in efficiently with this new IS scheme (old IS : millions of trees) → extract backward transition probabilities for a WF model with parent independent mutation (i.e. KAM) → only 30 histories necessary for a good estimation of the likelihood for more complex models (structured populations & KAM)
• but efficiency slightly decrease with non parent-independent
mutations models, e.g. stepwise mutation model (200 histories for structured populations & SMMM)
• and still limited efficiency for time inhomogeneous
demographic models, e.g. one population with past size change (cf. Orang-Utan example) → up to 20,000 histories necessary for strong disequilibrium scenarios (e.g. quick change in population size)
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Implementations of IS : Genetree and Migraine
• Genetree (Bahlo & Griffiths 2000, old IS algorithm) - 2 to 4 populations with migration (ISM) • Migraine (Rousset & Leblois 2007-2014, new IS algorithms) - One single stable population (KAM, SMM, GSM, ISM) - One pop. with past size variation (KAM, SMM, GSM, ISM) - 2 populations with migration (KAM, SMM, ISM) - Isolation By Distance in 1D and 2D (KAM)
Intro Likelihood & coa MCMC
IS
Sim tests
Implementation of IS in Migraine
1. C++ core IS computations • Stratified random sampling of parameter points • Estimation of the likelihood at each point using IS
2. R code for “post-treatment” • Likelihood surface interpolation by Kriging • Inference of MLEs and CIs • Plots of 1D and 2D likelihood profiles
Conclusions
Intro Likelihood & coa MCMC
IS
Introduction Likelihoods under the coalescent Felsenstein et al.’s MCMC Griffiths et al.’s IS Simulation tests Conclusions
Sim tests
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
Simulation tests Can we trust the demographic / historical inferences made with those methods ?
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
Simulation tests Can we trust the demographic / historical inferences made with those methods ? Aim Assess validity and robustness of the method : • Bias, RMSE, coverage properties of confidence intervals • robustness to realistic but “uninteresting” mis-specifications
→ to this aim, we tested by simulation : - The performances of Migraine to infer dispersal under IBD - The performances of MsVar and Migraine to detect and measure past pop size changes
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
Simulation tests Can we trust the demographic / historical inferences made with those methods ? Aim Assess validity and robustness of the method : • Bias, RMSE, coverage properties of confidence intervals • robustness to realistic but “uninteresting” mis-specifications
→ to this aim, we tested by simulation : - The performances of Migraine to infer dispersal under IBD - The performances of MsVar and Migraine to detect and measure past pop size changes
few interesting results...
Conclusions
Intro Likelihood & coa MCMC
Simulation tests Precision Validation Robustness MCMC vs. IS
IS
Sim tests
Conclusions
Intro Likelihood & coa MCMC
IS
Simulation tests
Sim tests
(MsVar Girod et al. 2011)
strong correlations between some pairs of ”natural” parameters but this is expected given the coalescent theory ...
Conclusions
Intro Likelihood & coa MCMC
IS
Simulation tests
Sim tests
(MsVar Girod et al. 2011)
There is no information in the genetic data to infer µ, N and T separately because coalescent histories (H, genealogies with mutations) generated with the usual diffusion/coalescent approximations (large N, small µ) only depends on the scaled parameters Nµ and T /N
constant Nµ product → same unscaled history and same polymorphism
Two indistinguishable situations under the coalescent approximations !
Conclusions
Intro Likelihood & coa MCMC
IS
Simulation tests
Sim tests
Conclusions
(MsVar Girod et al. 2011)
Much better results by rescaling parameters as in the coalescent approximations
Intro Likelihood & coa MCMC
IS
Sim tests
Simulation tests 20
_
_
2N µ D 2N ancµ
rel. bias & rel. RMSE
_
_
_ 5 _ 2
_
_
1
_
_ _
_
_
BDR: 0.76 FEDR: 0
0.98 0
1 0
0.025 0.0625 0.125
_
_
_ __
_
_
_
__
1 0
1 0
1 0
1 0
0.98 0
0.79 0
0.5 0.005
0.25
0.5
1.25
2.5
3.5
5
7.5
_
Good reliability of the estimates for population declines, provided they are neither too recent, nor too weak...
_
__
_
0.2 -0.2
(Migraine)
_
_
10
0.5
Conclusions
D
Why does the method’s performance strongly depend upon the time of the event, and its intensity ?
Intro Likelihood & coa MCMC
IS
Simulation tests
Sim tests
Conclusions
(MsVar & Migraine)
• How genealogies are affected by demographic parameters ?
→ “Predict” the quantity of information present in the data The information in the data strongly depends on the number of mutations and coalecent events during the different demographic phases
Intro Likelihood & coa MCMC
Simulation tests Precision Validation Robustness MCMC vs. IS
IS
Sim tests
Conclusions
Intro Likelihood & coa MCMC
IS
Simulation tests
Sim tests
(Migraine)
Beyond biases, RMSE et bottleneck detection rates...
● ● ● ●●● ● ● ●● ● ●● ● ● ● ● ●● ● ●● ●● ● ● ●● ● ●● ● ●● ● ●● ●● ● ●● ●●● ● ● ● ● ●● ●
0.4
1.0 0.8 0.6 0.4 0.0
0.6
0.8
Rel. bias, rel. RMSE 0.0496, 0.375
0.2
1.0
● ● ● ●● ● ●● ● ●
● ●● ● ● ● ● ●● ●● ● ● ● ● ●
KS: 0.433 0.4
0.6
0.8
1.0
Rel. bias, rel. RMSE −0.00452, 0.14
c(0, 1)
● ●● ●● ●● ●● ● ●● ●● ●● ● ● ●● ●● ●● ● ●● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ●● ● ●● ● ●● ●● ● ● ●
KS: 0.857
0.2
●●● ● ● ●● ●● ●● ●
0.0
Nratio = 0.001
0.2
0.8
1.0
2Nancmu = 400
0.6 0.4 0.2
1.0
1.0
0.8
●● ●● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ● ● ● ●● ● ●
● ● ●● ●● ● ● ● ● ● ●● ●● ● ●● ● ● ● ●● ● ● ●●
●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ●●
0.8
0.6
Rel. bias, rel. RMSE 0.116, 0.453
0.0
0.2 0.0
KS: 0.203 0.4
D = 1.25
0.6
0.2
● ● ● ●● ● ● ●
0.4
0.0
● ●●● ● ●● ●● ● ●● ● ● ●● ● ●● ● ●
●● ● ● ● ●● ● ●● ● ● ●● ●● ● ●● ● ● ● ●● ●● ● ●● ●●
●● ●● ● ● ●●● ●● ● ● ●● ● ● ● ● ●● ● ●
c(0, 1)
0.0
● ●● ● ● ● ●● ●● ●● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ●● ● ●
0.0 c(0, 1)
c(0, 1)
1) ECDF of c(0, P−values
0.2
0.4
0.6
0.8
1.0
2Nmu = 0.4
●● ● ●● ●● ●
0.0
● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ●● ● ●● ●● ● ● ●● ● ● ● ●● ● ●●
0.2
● ● ● ●● ●● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ●
● ● ●● ● ● ● ●
● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ●● ● ● ● ● ●
DR: 1 ( 0 ) KS: 0.165
0.4
0.6
0.8
Rel. bias, rel. RMSE 0.152, 0.601
(usually )GOOD
1.0
Conclusions
Intro Likelihood & coa MCMC
IS
Simulation tests
Sim tests
(Migraine)
Beyond biases, RMSE et bottleneck detection rates... Testing CI coverage properties using LRT P-value distributions
KS: 0.857
0.2
1.0 0.8 0.6 0.4
● ●● ●● ●● ●● ● ●● ●● ●● ● ● ●● ●● ●● ● ●● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ●● ● ●● ● ●● ●● ● ● ● ● ● ● ●●● ● ● ●● ● ●● ● ● ● ● ●● ● ●● ●● ● ● ●● ● ●● ● ●● ● ●● ● ● ● ●● ●●● ● ● ● ● ●● ●
0.4
0.6
0.8
1.0
0.2
● ● ● ●● ● ●● ● ●● ● ●● ●● ● ● ● ● ● ●● ●● ●
KS: 0.433 0.4
0.6
0.8
1.0
Rel. bias, rel. RMSE −0.00452, 0.14
c(0, 1)
2Nancmu = 400
Rel. bias, rel. RMSE 0.0496, 0.375
0.0
Nratio = 0.001
1.0
1.0 0.8 0.6 0.4 0.2
1.0
●● ●● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ● ● ● ●● ● ●
●● ● ● ● ●● ● ● ●●
● ●● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ●● ● ● ● ●●
0.8
0.8
0.6
0.6
●●● ● ● ●● ●● ●● ●
0.4
0.4
D = 1.25
0.2 0.0
KS: 0.203
Rel. bias, rel. RMSE 0.116, 0.453
0.0
● ● ● ●● ● ● ●
● ● ● ●● ●● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ●● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ●● ● ●● ●● ● ● ●● ● ● ● ●● ● ● ●●
●● ● ●● ● ● ● ●
● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ●● ● ● ●●
DR: 1 ( 0 )
0.2
0.2
●● ● ● ● ●● ●●
●●●
0.0
0.0
● ● ● ●● ●● ● ●● ● ● ● ●● ●● ● ● ●● ●● ●●● ● ●● ●● ● ●● ● ● ● ● ● ●● ● ●
● ● ● ●● ● ●
● ● ● ●●● ●● ● ● ● ●●
c(0, 1)
0.0
● ●● ● ● ● ●● ●● ●● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●
0.0 c(0, 1)
c(0, 1)
1) ECDF of c(0, P−values
0.2
0.4
0.6
0.8
1.0
2Nmu = 0.4
KS: 0.165
● ● ●● ●● ●
0.0
0.2
0.4
0.6
0.8
Rel. bias, rel. RMSE 0.152, 0.601
(usually )GOOD
1.0
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
Simulation tests
Conclusions
(Migraine)
Beyond biases, RMSE et bottleneck detection rates... Testing CI coverage properties using LRT P-value distributions
KS: 0.857
0.2
1.0 0.8 0.6 0.4
● ●● ●● ●● ●● ● ●● ●● ●● ● ● ●● ●● ●● ● ●● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ●● ● ●● ● ●● ●● ● ● ● ● ● ● ●●● ● ● ●● ● ●● ● ● ● ● ●● ● ●● ●● ● ● ●● ● ●● ● ●● ● ●● ● ● ● ●● ●●● ● ● ● ● ●● ●
0.4
0.6
0.8
1.0
0.2
KS: 0.433 0.4
0.6
0.8
Extremely recent and strong 10 Generations, D = 0.025 Nratio = 0.001 (θanc = 400.0)
1.0
Rel. bias, rel. RMSE −0.00452, 0.14
c(0, 1)
2Nancmu = 400
Rel. bias, rel. RMSE 0.0496, 0.375
0.0
Nratio = 0.001
1.0
1.0 0.8 0.6 0.4 0.2
1.0
●● ●● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ● ● ● ●● ● ●
● ● ● ●● ● ●● ● ●● ● ●● ●● ● ● ● ● ● ●● ●● ● ●● ● ● ● ●● ● ● ● ●
● ●● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ●● ● ● ● ●●
0.8
0.8
0.6
0.6
●●● ● ● ●● ●● ●● ●
0.4
0.4
D = 1.25
0.2 0.0
KS: 0.203
Rel. bias, rel. RMSE 0.116, 0.453
0.0
● ● ● ●● ● ● ●
● ● ● ●● ●● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ●● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ●● ● ●● ●● ● ● ●● ● ● ● ●● ● ● ●●
●● ● ●● ● ● ● ●
● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ●● ● ● ●●
DR: 1 ( 0 )
0.2
0.2
●● ● ● ● ●● ●●
●●●
0.0
0.0
● ● ● ●● ●● ● ●● ● ● ● ●● ●● ● ● ●● ●● ●●● ● ●● ●● ● ●● ● ● ● ● ● ●● ● ●
● ● ● ●● ● ●
● ● ● ●●● ●● ● ● ● ●●
c(0, 1)
0.0
● ●● ● ● ● ●● ●● ●● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●
0.0 c(0, 1)
c(0, 1)
1) ECDF of c(0, P−values
0.2
0.4
0.6
0.8
1.0
2Nmu = 0.4
KS: 0.165
● ● ●● ●● ●
0.0
0.2
0.4
0.6
0.8
1.0
Rel. bias, rel. RMSE 0.152, 0.601
(usually )GOOD
(very rarely) BAD
Intro Likelihood & coa MCMC
Simulation tests Precision Validation Robustness MCMC vs. IS
IS
Sim tests
Conclusions
Intro Likelihood & coa MCMC
IS
Simulation tests
Sim tests
(Migraine)
Microsatellite markers show complex mutation processes • Mutations do not fit SMM,
indels of more than one repeat often occur
Conclusions
Intro Likelihood & coa MCMC
IS
Simulation tests
Sim tests
Conclusions
(Migraine)
Microsatellite markers show complex mutation processes • Mutations do not fit SMM,
indels of more than one repeat often occur • Better mutation model = Generalized Stepwise Model (GSM)
indels of X (geometric) repeats at each mutation event commonly found value in “natura” : pGSM ≈ 0.22
Intro Likelihood & coa MCMC
IS
Simulation tests
Sim tests
(Migraine)
Microsatellite markers show complex mutation processes • Mutations do not fit SMM,
indels of more than one repeat often occur • Better mutation model = GSM
indels of X (geometric) repeats commonly found value in “natura” : pGSM ≈ 0.22 • Problem : Analyses under the SMM
of data simulated under a GSM in a stable population often show false signs of bottleneck (57% of false detection with pGSM = 0.22)
Conclusions
Intro Likelihood & coa MCMC
Simulation tests Precision Validation Robustness MCMC vs. IS
IS
Sim tests
Conclusions
Intro Likelihood & coa MCMC
IS
Simulation tests
Sim tests
(MsVar vs. Migraine)
Some comparison with MsVar • Similar performances for “good” scenarios • Better bottleneck detection rate for “non-optimal” scenarios • Parameter inference and CIs are slightly more accurate
But comparison is not easy • Frequentist vs. bayesian approaches • very long computation times for MCMC (and sometimes for IS)
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Conclusions from the simulation tests (MCMC & IS) • Very efficient for bottleneck detections • Accurate inferences for most demographic scenarios • IS faster and sometimes more accurate than MCMC
But : • Not robutst to mutational processes • Not robust to immigration (structured populations) • Inaccurate for extremely strong and recent pop size change
and... very long computation times for large data sets with many loci (i.e. “NGS” >> 100 - 1,000 loci)
Intro Likelihood & coa MCMC
IS
Introduction Likelihoods under the coalescent Felsenstein et al.’s MCMC Griffiths et al.’s IS Simulation tests Conclusions
Sim tests
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
Conclusions • Coalescent theory and ML-based approaches provide a
powerful framework for statistical inference in population genetics. • They ”extract” much more information from the data than
moment based methods. • In these methods, gene genealogies are missing data • Coalescent theory may also help understanding the limits of
these methods (the reliability of a method also depends upon the quantity of information available in the data) • Testing methods by simulation greatly helps to clearly
understand real data analyses
Intro Likelihood & coa MCMC
IS
Books
Sim tests
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
´ecrit pour pop subdiv, mais pas utile ici ? p(n) = E[∏ ( d
nd ) ∏ X ndi ] (ndi ) i di
Conclusions
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
´ecrit pour pop subdiv, mais pas utile ici ? The backward diffusion equation holds with an operator which is a sum over different demes : Φ= 21 ∑demes
d
∑allele
pairs i,j
2 Ntot x (δ −xdj ) ∂x ∂∂x Nd di ij di dj
where xdi is the frequency of allele i in deme j.
+∑d ∑i Mdi ∂x∂
di
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
´ecrit pour pop subdiv, mais pas utile ici ? Applying the previous arguments then leads to a relation between probabilities of samples that differ by one coalescence/mutation/migration event : Ntot (∑ nd ( d
nd − 1 + md + µ)) p(n) = Nd ndj − 1 Ntot ∑ nd p(n − edj ) Nd d,j
+ Ntot µ ∑ ∑ Pij (ndi + 1 − δij )p(n − edj + edi ) d,j i
+ Ntot ∑ nd ∑ mdd ′ d,j
d ′ ≠d
nd ′ j + 1 p(n − edj + ed ′ j ). nd ′ + 1
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
´ecrit pour pop subdiv, mais pas utile ici ? Expressing all p(.) in terms of p(n − edj )s for distinct d, j : Ntot ∑ ( d,j
nd − 1 + md + µ)π(j∣d, n − edj )nd p(n − edj ) = Nd Ntot ∑ nd d,j
ndj − 1 p(n − edj ) Nd
+ Ntot µ ∑ ∑ Pij nd π(i∣d, n − edj )p(n − edj ) d,j i
+ Ntot ∑ nd ∑ mdd ′ π(j∣d ′ , n − edj )p(n − edj ) d,j
d ′ ≠d
Intro Likelihood & coa MCMC
IS
Sim tests
Conclusions
´ecrit pour pop subdiv, mais pas utile ici ? Expressing all p(.) in terms of p(n − edj )s for distinct d, j : Ntot ∑ ( d,j
nd − 1 + md + µ)π(j∣d, n − edj )nd p(n − edj ) = Nd Ntot ∑ nd d,j
ndj − 1 p(n − edj ) Nd
+ Ntot µ ∑ ∑ Pij nd π(i∣d, n − edj )p(n − edj ) d,j i
+ Ntot ∑ nd ∑ mdd ′ π(j∣d ′ , n − edj )p(n − edj ) d,j
d ′ ≠d
Huge system of linear equations, not easier to solve in this form.