Introduction
ARG
Methods
HMM
PSMC
Simuls
Module de Master 2 BioStat: Mod´ elisation en g´ en´ etique des populations
Inferences of demographic history from whole genomes using approximations of the ancestral recombination graph Rapha¨el Leblois Centre de Biologie pour la Gestion des populations (CBGP, UMR INRA)
Janvier 2018
Conclusion
Introduction
ARG
Introduction ARG Methods HMM PSMC Simuls Conclusion
Methods
HMM
PSMC
Simuls
Conclusion
Introduction
ARG
Methods
HMM
PSMC
Simuls
Conclusion
Context and typical biological question : • Good genomic data : - Reference genome → polymorphism ”spatial” arrangement (physical / genetic distance) - Or many long runs of sequences → ≥ 10 Mb • ”Not too recent” demographic history
inference : - Detailed ancestral population size variations - Inference of gene flow / admixture during speciation process
Introduction
ARG
Methods
HMM
PSMC
Simuls
Conclusion
Demographic inferences in population genetics/genomics Before the genomic area, inferences based on • Low number of independent loci (e.g. 5-100 loci) • Short DNA sequences, with low polymorphism levels ( few
Kb, with 1-10 haplotypes) • Fast mutating DNA sequences, e.g. microsatellites, with high
polymorphism levels (up to 50 alleles) Now, we have more and more access to • Whole genomes assemblies for many species • Or many long sequences (> few Mb) with better and better
genome assemblies for less studied organisms → Genome wide sequence data contains rich information about evolutionary processes
Introduction
ARG
Methods
HMM
PSMC
Simuls
Conclusion
Demographic inferences in population genomics : what’s really new ? 1 - More polymorphisms (mutation events)... • More alleles in 10,000 SNPs than in 20 microsatellites • Simple mutational process
But ascertainment bias is often important with SNPs, lots of sequencing errors, problem with phasing haplotype data 2- With a good genome assembly, we now have access to spatial arrangement of markers →
Information about recombination events
→ Genome wide sequence data contains rich information about evolutionary processes, which remains unaccessible with independent markers.
Introduction
ARG
Methods
HMM
PSMC
Simuls
Conclusion
The coalescent with recombination : The ancestral recombination graph (ARG)
ARG = modeling recombination within or between genetic markers recombination events must be considered within long DNA sequences or between spatially close SNPs
Introduction
ARG
Methods
HMM
PSMC
Simuls
Conclusion
The coalescent with recombination : The ancestral recombination graph (ARG)
M more complex genealogical space to explore than for non-recombining loci Not a tree anymore, but a graph → much longer computation times & maybe untractable for long sequences and large sample sizes
Introduction
ARG
Methods
HMM
PSMC
Simuls
Conclusion
Several methods developed since 2007 Several recent inference methods uses the information of linkage disequilibrium present in genomic data : e.g. Mailund et al. 2011 ; Palamara et al. 2012 ; Harris & Nielsen 2013 ; MacLeod et al, 2013 ; Sheehan et al, 2013 but analyses are often limited to n = 2 sequences
Introduction
ARG
Methods
HMM
PSMC
Simuls
Several methods developed since 2007 Several recent inference methods uses the information of linkage disequilibrium present in genomic data but analyses are often limited to n = 2 sequences 2 main approaches • Distributions of homozygous sequence length (Runs of Homozygosity RoH, identity by state IBS tract length)
Conclusion
Introduction
ARG
Methods
HMM
PSMC
Simuls
Several methods developed since 2007 Several recent inference methods uses the information of linkage disequilibrium present in genomic data but analyses are often limited to n = 2 sequences 2 main approaches • Distributions of homozygous sequence length (Runs of Homozygosity RoH, identity by state IBS tract length) • Approximation of the Coalescent with recombination : the sequential coalescent with Hidden Markov Models (PSMC, coaHMM)
Conclusion
Introduction
ARG
Methods
HMM
PSMC
Simuls
Ideas under the coalescent HMM Approximation of the sequential coalescent with recombination based on the following ideas
• Effects of population size variation on the coalescent • ...
Conclusion
Introduction
ARG
Methods
HMM
PSMC
Simuls
Ideas under the coalescent HMM Approximation of the sequential coalescent with recombination based on the following ideas
• Effects of population size variation on the coalescent • ...
Conclusion
Introduction
ARG
Methods
HMM
PSMC
Simuls
Conclusion
Ideas under the coalescent HMM
• The population size at time t is inversely proportional to the
rate of coalescence at time t • Coalescence times are unknown, but are correlated with the
number of mutations between two leaves (and thus allelic frequencies)
Introduction
ARG
Methods
HMM
PSMC
Simuls
Ideas under the coalescent HMM
• Genealogies of nearby loci are correlated with each other • Use mutations to learn about the ancestral recombination
graph
Conclusion
Introduction
ARG
Methods
HMM
PSMC
Simuls
Ideas under the coalescent HMM Analysis of two long DNA sequences
Conclusion
Introduction
ARG
Methods
HMM
PSMC
Simuls
Conclusion
the coalescent HMM : A generative model for two DNA sequences
1 Transitions between trees What is the probability for a tree change from height s to height t between two successive sites (nucleotides) ? 2 Actual observed sequence Given a tree height t : what is the probability of being heterozygous or homozygous at one site between the two sequences ?
Introduction
ARG
Methods
HMM
PSMC
Simuls
Conclusion
The Sequential Markov Coalescent (SMC)
Simplifications on the coalescent with recombination between two sequences (SMC/SMC’) 1 Consider a single recombination event between 2 sites 2 Consider that the coalescent along the sequence is markovian (i.e. only depends on what happened on the previous site) P(tl ∣tl−1 , tl−2 , ..., t1 ) = P(tl ltl−1 )
Introduction
ARG
Methods
HMM
PSMC
Simuls
The Sequential Markov Coalescent (SMC) Transitions between trees : From a tree of height s to a tree of height t
• Wiuf and Hein (1999) : Coalescent with recombination along two DNA sequences • McVean and Cardiin (2005) : Approximating the coalescent with two possible recombination events. SMC=(a)+(b) • Marjoram and Wall (2006) : with two possible recombination events. SMC’=(a)+(b)+(c)
Conclusion
Introduction
ARG
Methods
HMM
PSMC
Simuls
Conclusion
The Sequential Markov Coalescent (SMC) Transitions between trees : From a tree of height s to a tree of height t
Probability of transitions are ”relatively” simple to compute : Recomb. event follows an exponential distribution of rate ρ = 2Nr Prob. of one recombination event between 0 and s : (1 − e −ρs ) Prob. of no recombination event between 0 and s : e −ρs
Introduction
ARG
Methods
HMM
PSMC
Simuls
Conclusion
The Sequential Markov Coalescent (SMC) Transitions between trees : From a tree of height s to a tree of height t
Given a recombination event happened, it is uniformly distributed between 0 and s → Pr(Rec. at u∣ Rec)=1/s
Introduction
ARG
Methods
HMM
PSMC
Simuls
Conclusion
The Sequential Markov Coalescent (SMC) Transitions between trees : From a tree of height s to a tree of height t
Given a recombination event happened, the conditional probability for a tree change from height s to height t (for a stable population size) ⎧ t ⎪ ⎪ 1 e −(t−u) du, q(t∣s) = ⎨∫0s s1 −(t−u) ⎪ du, ⎪ ⎩∫0 s e
if t ⩽ s, if t ⩾ s.
Introduction
ARG
Methods
HMM
PSMC
Simuls
Conclusion
the coalescent HMM : A generative model for two DNA sequences
1 Transitions between trees What is the probability for a tree change from height s to height t ? 2 Emmision probabilities Given a tree height t, is the site het or him ? Transition probabilities (with variable population size) : P(t∣s) = (1 − e −ρs )q(t∣s) + e −ρs δ(t − s) with q(t∣s) =
1 λ(t)
min(s,t) 1 − ∫ut e s
∫0
dv λ(v )
du
and λ(t) = N(t)/N0 the relative population size at t. and δ(⋅) is the Dirac delta function
Introduction
ARG
Methods
HMM
PSMC
Simuls
Conclusion
the coalescent HMM : A generative model for two DNA sequences
1 Transitions between trees What is the probability for a tree change from height s to height t ? 2 Emmision probabilities Given a tree height t, is the site het or him ? Mutations are distributed along branches at rate θ = 2Nµ → CDF of the time to the most recent mutation between two sites : t H(t) = ∫0 θe −θv dv = 1 − e −θt Emmision probabilities : P(Hom∣t) = e −θt P(Het∣t) = 1 − e −θt
Introduction
ARG
Methods
HMM
PSMC
Simuls
Conclusion
the coalescent HMM : A generative model for two DNA sequences
1 Transitions between trees What is the probability for a tree change from height s to height t ? 2 Emmision probabilities Given a tree height t, is the site het or him ? Transition probabilities (with variable population size) : P(t∣s) = (1 − e −ρs )q(t∣s) + e −ρs δ(t − s) Emmision probabilities : P(Hom∣t) = e −θt P(Het∣t) = 1 − e −θt
Introduction
ARG
Methods
HMM
PSMC
Simuls
the coalescent HMM : A generative model for two DNA sequences Posterior of TMRCA from simulations
Conclusion
Introduction
ARG
Methods
HMM
PSMC
Simuls
Conclusion
The PSMC : Pairwise Sequential Markov Chain (Li & Durbiin 2011) The model • Coalescent with mutation and recombination • Single panmictic isolated population (i.e. no population
structure) • Piecewise constant effective population size = ”skyline” model
Introduction
ARG
Methods
HMM
PSMC
Simuls
The PSMC : Pairwise Sequential Markov Chain (Li & Durbiin 2011)
• Markov chain for the TMRCA based on the SMC / SMC’ • Estimation through a coalescent Hidden Markov Model • Limited to two sequences (i.e. two haploid genomes, one diploid individual)
→ not efficient for recent times
Conclusion
Introduction
ARG
Methods
HMM
PSMC
Simuls
Conclusion
The PSMC : Pairwise Sequential Markov Chain (Li & Durbiin 2011)
• Population size variation modeled with ”Skyline” = discrete changes with many phases of constant population size (ad-hoc)
• Rich information about past population sizes compared to few
microsatellites loci !
Introduction
ARG
Methods
HMM
PSMC
Simuls
The PSMC : Pairwise Sequential Markov Chain (Li & Durbiin 2011) Population size variability in Great Apes
Conclusion
Introduction
ARG
Methods
HMM
PSMC
Simuls
The PSMC : Pairwise Sequential Markov Chain (Li & Durbiin 2011) Population size variability in Great Apes
Conclusion
Introduction
ARG
Methods
HMM
PSMC
Simuls
Conclusion
The PSMC : Pairwise Sequential Markov Chain (Li & Durbiin 2011) The ”skyline” model (BEAST, Drummond et al.) is very attractive but has some drawbacks • Need to define the number of phases and their lengths • Difficult to draw correct confidence intervals / credibility
intervals around the maximum
Introduction
ARG
Methods
HMM
PSMC
Simuls
Conclusion
The PSMC : Pairwise Sequential Markov Chain (Li & Durbiin 2011)
Limits of the current PSMC • the ad-hoc ”Sky-line” model • Need to fix some parameter values to get unscaled parameters →pop sizes, times in generations
Introduction
ARG
Methods
HMM
PSMC
Simuls
Conclusion
The PSMC : Pairwise Sequential Markov Chain (Li & Durbiin 2011)
Limits of the current PSMC • the ad-hoc ”Sky-line” model • Need to fix some parameter values to get unscaled parameters →pop sizes, times in generations • influence of sequencing errors →stop the spatial correlation along the genome, cut RoH
Introduction
ARG
Methods
HMM
PSMC
Simuls
Conclusion
The PSMC : Pairwise Sequential Markov Chain (Li & Durbiin 2011)
Limits of the current PSMC • the ad-hoc ”Sky-line” model • Need to fix some parameter values to get unscaled parameters →pop sizes, times in generations • influence of sequencing errors →stop the spatial correlation along the genome, cut RoH • No migration
Introduction
ARG
Methods
HMM
PSMC
Simuls
Conclusion
The PSMC : Test using simulation (From Sheehan, Harris & Song 2013)
Simple simulation tests of the PSMC (Done by S. Sheehan, PhD)
Introduction
ARG
Methods
HMM
PSMC
Simuls
Conclusion
The PSMC : Test using simulation (From Sheehan, Harris & Song 2013)
Simple simulation tests of the PSMC (Done by S. Sheehan, PhD) • Strong influence of the sample size for recent times
(reminder : often limited to 2 sequences)
Introduction
ARG
Methods
HMM
PSMC
Simuls
Conclusion
The PSMC : Test using simulation (From Sheehan, Harris & Song 2013)
Simple simulation tests of the PSMC (Done by S. Sheehan, PhD) • Strong influence of the sample size for recent times
(reminder : often limited to 2 sequences)
Introduction
ARG
Methods
HMM
PSMC
Simuls
Conclusion
The PSMC : Test using simulation (From Sheehan, Harris & Song 2013)
Simple simulation tests of the PSMC (Done by S. Sheehan, PhD) • Strong influence of the sample size for recent times
(reminder : often limited to 2 sequences)
Introduction
ARG
Methods
HMM
PSMC
Simuls
Conclusion
The PSMC : Test using simulation (From Sheehan, Harris & Song 2013)
Simple simulation tests of the PSMC (Done by S. Sheehan, PhD) • Strong influence of the sample size for recent times
(reminder : often limited to 2 sequences)
Introduction
ARG
Methods
HMM
PSMC
Simuls
Conclusion
The PSMC : Test using simulation (From Sheehan, Harris & Song 2013)
Simple simulation tests of the PSMC (Done by S. Sheehan, PhD) • Strong influence of the sample size for recent times
(reminder : often limited to 2 sequences)
Introduction
ARG
Methods
HMM
PSMC
Simuls
Conclusion
The PSMC : Test using simulation (From Sheehan, Harris & Song 2013)
Simple simulation tests of the PSMC (Done by S. Sheehan, PhD) • Strong influence of the sample size for recent times
(reminder : often limited to 2 sequences)
Introduction
ARG
Methods
HMM
PSMC
Simuls
Conclusion
The PSMC : Test using simulation (From Sheehan, Harris & Song 2013)
Simple simulation tests of the PSMC (Done by S. Sheehan, PhD) • Strong influence of the sample size for recent times
(reminder : often limited to 2 sequences) • Biais when low stable population size ? (→ CI ?)
Introduction
ARG
Methods
HMM
PSMC
Simuls
Conclusion
The PSMC : Test using simulation (From Sheehan, Harris & Song 2013) probably a problem with the ”skyline” parametrization : e.g. number and length of the different demographic phases
Simple simulation tests of the PSMC (Done by S. Sheehan, PhD) • Strong influence of the sample size for recent times
(reminder : often limited to 2 sequences) • Biais when low stable population size ? (→ CI ?)
Introduction
ARG
Methods
HMM
PSMC
Simuls
Conclusion
Conclusions • Genomic data contains much more information than classical
independent markers • They often cannot be analyzed with previous methods,
especially coalescent ones. • Methods that can analyze LD information are very promising • but models are still too ”simple” (but see CoaHMM with the
IM model) • Those new methods need to be implemented in user-friendly
software because they are difficult to run. • and to be clearly tested...