Modélisation en génétique des populations - Raphael Leblois

Low number of independent loci (e.g. 5-100 loci). • Short DNA sequences ... computation times. & maybe untractable for long sequences and large sample sizes ...
2MB taille 4 téléchargements 41 vues
Introduction

ARG

Methods

HMM

PSMC

Simuls

Module de Master 2 BioStat: Mod´ elisation en g´ en´ etique des populations

Inferences of demographic history from whole genomes using approximations of the ancestral recombination graph Rapha¨el Leblois Centre de Biologie pour la Gestion des populations (CBGP, UMR INRA)

Janvier 2018

Conclusion

Introduction

ARG

Introduction ARG Methods HMM PSMC Simuls Conclusion

Methods

HMM

PSMC

Simuls

Conclusion

Introduction

ARG

Methods

HMM

PSMC

Simuls

Conclusion

Context and typical biological question : • Good genomic data : - Reference genome → polymorphism ”spatial” arrangement (physical / genetic distance) - Or many long runs of sequences → ≥ 10 Mb • ”Not too recent” demographic history

inference : - Detailed ancestral population size variations - Inference of gene flow / admixture during speciation process

Introduction

ARG

Methods

HMM

PSMC

Simuls

Conclusion

Demographic inferences in population genetics/genomics Before the genomic area, inferences based on • Low number of independent loci (e.g. 5-100 loci) • Short DNA sequences, with low polymorphism levels ( few

Kb, with 1-10 haplotypes) • Fast mutating DNA sequences, e.g. microsatellites, with high

polymorphism levels (up to 50 alleles) Now, we have more and more access to • Whole genomes assemblies for many species • Or many long sequences (> few Mb) with better and better

genome assemblies for less studied organisms → Genome wide sequence data contains rich information about evolutionary processes

Introduction

ARG

Methods

HMM

PSMC

Simuls

Conclusion

Demographic inferences in population genomics : what’s really new ? 1 - More polymorphisms (mutation events)... • More alleles in 10,000 SNPs than in 20 microsatellites • Simple mutational process

But ascertainment bias is often important with SNPs, lots of sequencing errors, problem with phasing haplotype data 2- With a good genome assembly, we now have access to spatial arrangement of markers →

Information about recombination events

→ Genome wide sequence data contains rich information about evolutionary processes, which remains unaccessible with independent markers.

Introduction

ARG

Methods

HMM

PSMC

Simuls

Conclusion

The coalescent with recombination : The ancestral recombination graph (ARG)

ARG = modeling recombination within or between genetic markers recombination events must be considered within long DNA sequences or between spatially close SNPs

Introduction

ARG

Methods

HMM

PSMC

Simuls

Conclusion

The coalescent with recombination : The ancestral recombination graph (ARG)

M more complex genealogical space to explore than for non-recombining loci Not a tree anymore, but a graph → much longer computation times & maybe untractable for long sequences and large sample sizes

Introduction

ARG

Methods

HMM

PSMC

Simuls

Conclusion

Several methods developed since 2007 Several recent inference methods uses the information of linkage disequilibrium present in genomic data : e.g. Mailund et al. 2011 ; Palamara et al. 2012 ; Harris & Nielsen 2013 ; MacLeod et al, 2013 ; Sheehan et al, 2013 but analyses are often limited to n = 2 sequences

Introduction

ARG

Methods

HMM

PSMC

Simuls

Several methods developed since 2007 Several recent inference methods uses the information of linkage disequilibrium present in genomic data but analyses are often limited to n = 2 sequences 2 main approaches • Distributions of homozygous sequence length (Runs of Homozygosity RoH, identity by state IBS tract length)

Conclusion

Introduction

ARG

Methods

HMM

PSMC

Simuls

Several methods developed since 2007 Several recent inference methods uses the information of linkage disequilibrium present in genomic data but analyses are often limited to n = 2 sequences 2 main approaches • Distributions of homozygous sequence length (Runs of Homozygosity RoH, identity by state IBS tract length) • Approximation of the Coalescent with recombination : the sequential coalescent with Hidden Markov Models (PSMC, coaHMM)

Conclusion

Introduction

ARG

Methods

HMM

PSMC

Simuls

Ideas under the coalescent HMM Approximation of the sequential coalescent with recombination based on the following ideas

• Effects of population size variation on the coalescent • ...

Conclusion

Introduction

ARG

Methods

HMM

PSMC

Simuls

Ideas under the coalescent HMM Approximation of the sequential coalescent with recombination based on the following ideas

• Effects of population size variation on the coalescent • ...

Conclusion

Introduction

ARG

Methods

HMM

PSMC

Simuls

Conclusion

Ideas under the coalescent HMM

• The population size at time t is inversely proportional to the

rate of coalescence at time t • Coalescence times are unknown, but are correlated with the

number of mutations between two leaves (and thus allelic frequencies)

Introduction

ARG

Methods

HMM

PSMC

Simuls

Ideas under the coalescent HMM

• Genealogies of nearby loci are correlated with each other • Use mutations to learn about the ancestral recombination

graph

Conclusion

Introduction

ARG

Methods

HMM

PSMC

Simuls

Ideas under the coalescent HMM Analysis of two long DNA sequences

Conclusion

Introduction

ARG

Methods

HMM

PSMC

Simuls

Conclusion

the coalescent HMM : A generative model for two DNA sequences

1 Transitions between trees What is the probability for a tree change from height s to height t between two successive sites (nucleotides) ? 2 Actual observed sequence Given a tree height t : what is the probability of being heterozygous or homozygous at one site between the two sequences ?

Introduction

ARG

Methods

HMM

PSMC

Simuls

Conclusion

The Sequential Markov Coalescent (SMC)

Simplifications on the coalescent with recombination between two sequences (SMC/SMC’) 1 Consider a single recombination event between 2 sites 2 Consider that the coalescent along the sequence is markovian (i.e. only depends on what happened on the previous site) P(tl ∣tl−1 , tl−2 , ..., t1 ) = P(tl ltl−1 )

Introduction

ARG

Methods

HMM

PSMC

Simuls

The Sequential Markov Coalescent (SMC) Transitions between trees : From a tree of height s to a tree of height t

• Wiuf and Hein (1999) : Coalescent with recombination along two DNA sequences • McVean and Cardiin (2005) : Approximating the coalescent with two possible recombination events. SMC=(a)+(b) • Marjoram and Wall (2006) : with two possible recombination events. SMC’=(a)+(b)+(c)

Conclusion

Introduction

ARG

Methods

HMM

PSMC

Simuls

Conclusion

The Sequential Markov Coalescent (SMC) Transitions between trees : From a tree of height s to a tree of height t

Probability of transitions are ”relatively” simple to compute : Recomb. event follows an exponential distribution of rate ρ = 2Nr Prob. of one recombination event between 0 and s : (1 − e −ρs ) Prob. of no recombination event between 0 and s : e −ρs

Introduction

ARG

Methods

HMM

PSMC

Simuls

Conclusion

The Sequential Markov Coalescent (SMC) Transitions between trees : From a tree of height s to a tree of height t

Given a recombination event happened, it is uniformly distributed between 0 and s → Pr(Rec. at u∣ Rec)=1/s

Introduction

ARG

Methods

HMM

PSMC

Simuls

Conclusion

The Sequential Markov Coalescent (SMC) Transitions between trees : From a tree of height s to a tree of height t

Given a recombination event happened, the conditional probability for a tree change from height s to height t (for a stable population size) ⎧ t ⎪ ⎪ 1 e −(t−u) du, q(t∣s) = ⎨∫0s s1 −(t−u) ⎪ du, ⎪ ⎩∫0 s e

if t ⩽ s, if t ⩾ s.

Introduction

ARG

Methods

HMM

PSMC

Simuls

Conclusion

the coalescent HMM : A generative model for two DNA sequences

1 Transitions between trees What is the probability for a tree change from height s to height t ? 2 Emmision probabilities Given a tree height t, is the site het or him ? Transition probabilities (with variable population size) : P(t∣s) = (1 − e −ρs )q(t∣s) + e −ρs δ(t − s) with q(t∣s) =

1 λ(t)

min(s,t) 1 − ∫ut e s

∫0

dv λ(v )

du

and λ(t) = N(t)/N0 the relative population size at t. and δ(⋅) is the Dirac delta function

Introduction

ARG

Methods

HMM

PSMC

Simuls

Conclusion

the coalescent HMM : A generative model for two DNA sequences

1 Transitions between trees What is the probability for a tree change from height s to height t ? 2 Emmision probabilities Given a tree height t, is the site het or him ? Mutations are distributed along branches at rate θ = 2Nµ → CDF of the time to the most recent mutation between two sites : t H(t) = ∫0 θe −θv dv = 1 − e −θt Emmision probabilities : P(Hom∣t) = e −θt P(Het∣t) = 1 − e −θt

Introduction

ARG

Methods

HMM

PSMC

Simuls

Conclusion

the coalescent HMM : A generative model for two DNA sequences

1 Transitions between trees What is the probability for a tree change from height s to height t ? 2 Emmision probabilities Given a tree height t, is the site het or him ? Transition probabilities (with variable population size) : P(t∣s) = (1 − e −ρs )q(t∣s) + e −ρs δ(t − s) Emmision probabilities : P(Hom∣t) = e −θt P(Het∣t) = 1 − e −θt

Introduction

ARG

Methods

HMM

PSMC

Simuls

the coalescent HMM : A generative model for two DNA sequences Posterior of TMRCA from simulations

Conclusion

Introduction

ARG

Methods

HMM

PSMC

Simuls

Conclusion

The PSMC : Pairwise Sequential Markov Chain (Li & Durbiin 2011) The model • Coalescent with mutation and recombination • Single panmictic isolated population (i.e. no population

structure) • Piecewise constant effective population size = ”skyline” model

Introduction

ARG

Methods

HMM

PSMC

Simuls

The PSMC : Pairwise Sequential Markov Chain (Li & Durbiin 2011)

• Markov chain for the TMRCA based on the SMC / SMC’ • Estimation through a coalescent Hidden Markov Model • Limited to two sequences (i.e. two haploid genomes, one diploid individual)

→ not efficient for recent times

Conclusion

Introduction

ARG

Methods

HMM

PSMC

Simuls

Conclusion

The PSMC : Pairwise Sequential Markov Chain (Li & Durbiin 2011)

• Population size variation modeled with ”Skyline” = discrete changes with many phases of constant population size (ad-hoc)

• Rich information about past population sizes compared to few

microsatellites loci !

Introduction

ARG

Methods

HMM

PSMC

Simuls

The PSMC : Pairwise Sequential Markov Chain (Li & Durbiin 2011) Population size variability in Great Apes

Conclusion

Introduction

ARG

Methods

HMM

PSMC

Simuls

The PSMC : Pairwise Sequential Markov Chain (Li & Durbiin 2011) Population size variability in Great Apes

Conclusion

Introduction

ARG

Methods

HMM

PSMC

Simuls

Conclusion

The PSMC : Pairwise Sequential Markov Chain (Li & Durbiin 2011) The ”skyline” model (BEAST, Drummond et al.) is very attractive but has some drawbacks • Need to define the number of phases and their lengths • Difficult to draw correct confidence intervals / credibility

intervals around the maximum

Introduction

ARG

Methods

HMM

PSMC

Simuls

Conclusion

The PSMC : Pairwise Sequential Markov Chain (Li & Durbiin 2011)

Limits of the current PSMC • the ad-hoc ”Sky-line” model • Need to fix some parameter values to get unscaled parameters →pop sizes, times in generations

Introduction

ARG

Methods

HMM

PSMC

Simuls

Conclusion

The PSMC : Pairwise Sequential Markov Chain (Li & Durbiin 2011)

Limits of the current PSMC • the ad-hoc ”Sky-line” model • Need to fix some parameter values to get unscaled parameters →pop sizes, times in generations • influence of sequencing errors →stop the spatial correlation along the genome, cut RoH

Introduction

ARG

Methods

HMM

PSMC

Simuls

Conclusion

The PSMC : Pairwise Sequential Markov Chain (Li & Durbiin 2011)

Limits of the current PSMC • the ad-hoc ”Sky-line” model • Need to fix some parameter values to get unscaled parameters →pop sizes, times in generations • influence of sequencing errors →stop the spatial correlation along the genome, cut RoH • No migration

Introduction

ARG

Methods

HMM

PSMC

Simuls

Conclusion

The PSMC : Test using simulation (From Sheehan, Harris & Song 2013)

Simple simulation tests of the PSMC (Done by S. Sheehan, PhD)

Introduction

ARG

Methods

HMM

PSMC

Simuls

Conclusion

The PSMC : Test using simulation (From Sheehan, Harris & Song 2013)

Simple simulation tests of the PSMC (Done by S. Sheehan, PhD) • Strong influence of the sample size for recent times

(reminder : often limited to 2 sequences)

Introduction

ARG

Methods

HMM

PSMC

Simuls

Conclusion

The PSMC : Test using simulation (From Sheehan, Harris & Song 2013)

Simple simulation tests of the PSMC (Done by S. Sheehan, PhD) • Strong influence of the sample size for recent times

(reminder : often limited to 2 sequences)

Introduction

ARG

Methods

HMM

PSMC

Simuls

Conclusion

The PSMC : Test using simulation (From Sheehan, Harris & Song 2013)

Simple simulation tests of the PSMC (Done by S. Sheehan, PhD) • Strong influence of the sample size for recent times

(reminder : often limited to 2 sequences)

Introduction

ARG

Methods

HMM

PSMC

Simuls

Conclusion

The PSMC : Test using simulation (From Sheehan, Harris & Song 2013)

Simple simulation tests of the PSMC (Done by S. Sheehan, PhD) • Strong influence of the sample size for recent times

(reminder : often limited to 2 sequences)

Introduction

ARG

Methods

HMM

PSMC

Simuls

Conclusion

The PSMC : Test using simulation (From Sheehan, Harris & Song 2013)

Simple simulation tests of the PSMC (Done by S. Sheehan, PhD) • Strong influence of the sample size for recent times

(reminder : often limited to 2 sequences)

Introduction

ARG

Methods

HMM

PSMC

Simuls

Conclusion

The PSMC : Test using simulation (From Sheehan, Harris & Song 2013)

Simple simulation tests of the PSMC (Done by S. Sheehan, PhD) • Strong influence of the sample size for recent times

(reminder : often limited to 2 sequences)

Introduction

ARG

Methods

HMM

PSMC

Simuls

Conclusion

The PSMC : Test using simulation (From Sheehan, Harris & Song 2013)

Simple simulation tests of the PSMC (Done by S. Sheehan, PhD) • Strong influence of the sample size for recent times

(reminder : often limited to 2 sequences) • Biais when low stable population size ? (→ CI ?)

Introduction

ARG

Methods

HMM

PSMC

Simuls

Conclusion

The PSMC : Test using simulation (From Sheehan, Harris & Song 2013) probably a problem with the ”skyline” parametrization : e.g. number and length of the different demographic phases

Simple simulation tests of the PSMC (Done by S. Sheehan, PhD) • Strong influence of the sample size for recent times

(reminder : often limited to 2 sequences) • Biais when low stable population size ? (→ CI ?)

Introduction

ARG

Methods

HMM

PSMC

Simuls

Conclusion

Conclusions • Genomic data contains much more information than classical

independent markers • They often cannot be analyzed with previous methods,

especially coalescent ones. • Methods that can analyze LD information are very promising • but models are still too ”simple” (but see CoaHMM with the

IM model) • Those new methods need to be implemented in user-friendly

software because they are difficult to run. • and to be clearly tested...