Estimation of population allele frequencies from nextgeneration

However, the use of whole-genome sequencing to sample popula- ... involved in population genetics studies due to the large ... In most cases, however, the two.
361KB taille 1 téléchargements 335 vues
Molecular Ecology (2013) 22, 3766–3779

doi: 10.1111/mec.12360

Estimation of population allele frequencies from next-generation sequencing data: pool-versus individual-based genotyping  E CE  ZARD,† MAXIME MATHIEU GAUTIER,*1 JULIEN FOUCAUD,*1 KARIM GHARBI,† TIMOTHE * GALAN,* ANNE LOISEAU,* MARIAN THOMSON,† PIERRE PUDLO,*‡ CAROLE KERDELHUE and A R N A U D E S T O U P * *INRA, UMR CBGP (INRA – IRD – Cirad – Montpellier SupAgro), Campus international de Baillarguet, CS 30016, F-34988, Montferrier-sur-Lez, France, †The GenePool, School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JT, UK, ‡I3M, UMR CNRS 5149, Universite Montpellier 2, F-34095 Montpellier, France

Abstract Molecular markers produced by next-generation sequencing (NGS) technologies are revolutionizing genetic research. However, the costs of analysing large numbers of individual genomes remain prohibitive for most population genetics studies. Here, we present results based on mathematical derivations showing that, under many realistic experimental designs, NGS of DNA pools from diploid individuals allows to estimate the allele frequencies at single nucleotide polymorphisms (SNPs) with at least the same accuracy as individual-based analyses, for considerably lower library construction and sequencing efforts. These findings remain true when taking into account the possibility of substantially unequal contributions of each individual to the final pool of sequence reads. We propose the intuitive notion of effective pool size to account for unequal pooling and derive a Bayesian hierarchical model to estimate this parameter directly from the data. We provide a user-friendly application assessing the accuracy of allele frequency estimation from both pool- and individual-based NGS population data under various sampling, sequencing depth and experimental error designs. We illustrate our findings with theoretical examples and real data sets corresponding to SNP loci obtained using restriction site–associated DNA (RAD) sequencing in pooland individual-based experiments carried out on the same population of the pine processionary moth (Thaumetopoea pityocampa). NGS of DNA pools might not be optimal for all types of studies but provides a cost-effective approach for estimating allele frequencies for very large numbers of SNPs. It thus allows comparison of genome-wide patterns of genetic variation for large numbers of individuals in multiple populations. Keywords: coverage, estimation of allele frequencies, molecular markers, next-generation sequencing, pooled samples, population genetics Received 14 March 2013; revision received 15 April 2013; accepted 16 April 2013

Introduction One prospect of current biology is that molecular data will help us reveal the complex demographic and adapCorrespondence: Mathieu Gautier, Fax: +33 (0)4 99 62 33 45; E-mail: [email protected] 1 These authors contributed equally to this work

tive processes that act on the natural populations of both model and nonmodel species. Over the last three decades, we have seen a continuous turnover of molecular markers used in genetic research, including population genetics analysis. With the advent of nextgeneration sequencing (NGS) technologies, it is now possible to sequence whole genomes at dramatically lower costs than traditional Sanger sequencing © 2013 John Wiley & Sons Ltd

P O O L E D V E R S U S I N D I V I D U A L D N A S A M P L E S A T N G S M A R K E R S 3767 (reviewed in Davey et al. 2012; Metzker 2009). However, the use of whole-genome sequencing to sample population-level diversity remains limited to a small number of species with well-characterized genomes and is currently prohibitively expensive for most laboratories involved in population genetics studies due to the large number of individuals needed for such studies. A first significant step towards enabling populationlevel analysis of genome-wide variation relies on the development of NGS approaches based on the sequencing of reduced representation libraries (RRL), which allow a less uniform but higher sequencing depth along the genome than whole-genome technologies (Van Tassell et al. 2008; Luca et al. 2011). Deep sequencing of restriction site–associated DNA (RAD-seq), for example, borrows from such strategies (Baird et al. 2008). The technique is increasingly popular, especially in nonmodel species (Davey & Blaxter 2011; Davey et al. 2012), with successful applications across a wide range of organisms and research areas, including population genomics studies (e.g. Hohenlohe et al. 2010; Bruneaux et al. 2013; Hecht et al. 2013). While significantly cheaper than whole-genome resequencing, traditional RRL designs using barcoded individuals (i.e. individualbased experiments) still represent a nontrivial cost, especially when data on large numbers of individuals are needed as in the majority of population genetics studies. The most cost-effective approach to estimate allele frequencies at massive numbers of SNP loci for large numbers of individuals in multiple populations relies on massively parallel sequencing of pools of individual DNAs (hereafter referred to as pool-based experiments) using either reduced representation (see Perez-Enciso & Ferretti 2010; and references therein) or genomic libraries (for genomes of relatively short sizes; e.g. Boitard et al. 2012; Turner et al. 2011; Zhu et al. 2012). Sequencing large population DNA pools keeps the number of redundant sequence reads low and hence provides an economic alternative to the sequencing of individual genomes. However, special care is needed to control for experimental errors, especially the possibility of substantial unequal individual contributions to the pool of sequence reads. Several recent studies have illustrated the potential of pool- versus individual-based experimental designs for identifying and quantifying SNP variants in small genomes or small genomic regions as well as in whole eukaryotic genomes (see references reviewed in the introduction section of Zhu et al. 2012). For instance, Zhu et al. (2012) experimentally tested the accuracy of resequencing pools of largely isogenic, individually sequenced Drosophila melanogaster strains. They showed that pool-based sequencing provided a faithful estimate of population allele frequency and was a reliable means of discovering novel SNPs © 2013 John Wiley & Sons Ltd

with low false positive rates. The same study, nevertheless, indicated that a sufficient number of strains should be merged in the sequenced pool. Futschik & Schlotterer (2010) provided some theoretical formalization of the comparison of pool- versus individual-based experiments to estimate allele frequencies at a genome-wide scale. However, this latter study mainly focused on the case of haploid individuals and lacked operational tools to describe the interplay between the main key parameters of pool- and individual-based experimental designs. Here, we present mathematical derivations of the accuracy of the estimator of allele frequency from poolor individual-based sequencing data sets for diploid species and its dependency on key parameters such as sampling size, sequencing depth (or coverage) and the possibility of unequal contributions of individual DNA samples to the final pool of sequence reads. We propose the intuitive notion of ‘effective pool size’ to capture this type of experimental error and derive a Bayesian hierarchical model to estimate this parameter directly from the data. We also provide a user-friendly application implementing the different mathematical derivations (see Data accessibility section), which allows NGS practitioners to assess and compare the accuracy of allele frequency estimation expected from various sampling, coverage and experimental error designs. We illustrate our finding with theoretical examples and the analyses of real data sets obtained using both pool- and individual-based RAD-seq experiments for the same population of the pine processionary moth (PPM) Thaumetopoea pityocampa.

Materials and methods Key parameters In the individual-based sequencing strategy, most of the population sampling error comes from the selection of individuals used for DNA sequencing. In the pool-based approach, this sampling error can be substantially reduced by including a large number of individuals in the pool. In most cases, however, the two approaches will yield different depths of coverage for each individual chromosome. The effects of this source of variation on allele frequency estimation depend on three main parameters: (i) the number of diploid individuals (and hence chromosomes) merged in the sequenced pool, (ii) the sequencing effort allocated to the pool (i.e. the sequencing depth or coverage) and (iii) the possibility of unequal contributions of each individual genome to the final set of sequence reads. The comparison of allele frequency estimation obtained from pool- and individual-based experimental designs hence depends on the sampling size and coverage chosen in both approaches. The mathematical

3768 M . G A U T I E R E T A L . derivations of the variance of the estimator of allele frequency from pool- or individual-based data sets, described below, therefore logically depend on these two key parameters. In a second step, we also incorporated the possibility of unequal contribution of individual genomes to the pool data set. All mathematical derivations are given below and detailed in Data S1 (Supporting information). We have implemented our mathematical derivations in a user-friendly application named PIFs (see Data Accessibility section).

Expectation and variance of the estimator of allele frequency for different sequencing experimental designs Let p represent the population frequency of a reference allele at a SNP locus. We derive the expectation and variance of estimators of p obtained for three possible sequencing experimental designs: 1 Sequencing of nh haploid individuals (or chromosomes). We consider ^h ¼ p

h 1X ci ðhÞ kh i¼1 ri ðhÞ

k

as an estimator of p where for each individual i among the kh  nh ones yielding at least one read, ri(h) represents the total number of observed sequence reads (ri(h)>0) and ci(h) (0 0.99 when only highly covered SNPs were considered. This trend was even more obvious when computing the median of the absolute differences between individual- and pool-based estimates of allele frequencies, which ranged from 0.067 for lowest individual and pool coverage (1–69 and 109–509, respectively) to 0.007 for highest individual and pool coverage (>209 and >2009, respectively). A summary representation of this trend is presented in Fig. S1 (Supporting information). The re-analyses of the data applying various minor allele frequency (MAF) thresholds indicated that these result did not depend on the large fraction of allele frequency estimations close to zero or one (results not shown). Our comparative analysis of individual-based and pool-based RAD data sets illustrates a particular feature of the RAD-seq technique. Figure 3 shows that the coverage of some particular SNPs is low or high independently of the experimental method used to cover these loci (individual- or pool-based strategies). This translates into a surprisingly high correlation value of 0.99 between individual- and pool-based coverage of individual SNPs. Such a result is consistent with the strong positive relationship between the length of the restriction fragment and its coverage recently evidenced by Davey et al. (2013) at RAD markers. SNPs located in longer fragments are thus more covered whatever the sequencing strategy, leading to the observed correlation. This more generally illustrates the effect of the genomic context on coverage, a well-known feature of NGS (not only RAD) data (e.g. Davey et al. 2013).

Estimation of effective pool sizes and experimental errors on a real data set As illustrated above, unequal contributions of individuals to the final set of sequence reads, which we summarized with the notion of effective pool size ne (and experimental error e), might lead to increased imprecision in allele frequency estimates and related statistics. To the best of our knowledge, evaluation of the magnitude of such an experimental error on real data has never been reported. We used a Bayesian hierarchical model to estimate ne for each of the ten pool replicates (identified using different barcodes) of the PPM population genotyped at RAD markers. We first evaluated the estimation procedure using simulated and hence controlled data sets mimicking the PPM data as described in the Material and Methods section. Table 1 shows that, even for a small number of SNPs, the model generally resulted in relatively good estimates

3774 M . G A U T I E R E T A L . 10X−50X 1

50X−100X

100X−150X

150X−200X

>200X

r = 0.9623 d = 52.24 n = 3490

r = na d = na n = 32

r = na d = na n=0

r = na d = na n=0

r = 0.9405 d = 69.21 n = 301

r = 0.9731 d = 42.18 n = 12048

r = 0.9827 d = 34.23 n = 4670

r = na d = na n = 64

r = na d = na n=0

r = na d = na n=1

r = 0.9829 d = 31.32 n = 859

r = 0.9905 d = 24.5 n = 6369

r = 0.9949 d = 19.36 n = 2808

r = 0.9967 d = 15.33 n = 205

r = na d = na n=0

r = na d = na n=3

r = 0.9943 d = 19.86 n = 295

r = 0.9966 d = 15.8 n = 2594

r = 0.9975 d = 13.26 n = 2420

r = na d = na n=0

r = na d = na n=0

r = na d = na n=5

r = 0.9983 d = 14.04 n = 154

r = 0.9992 d = 7.01 n = 11580

1X−6X

r = 0.9348 d = 67.29 n = 1699

0.5

6X−10X

0.5

0 1

10X−15X

0.5

0 1

15X−20X

Allelic frequency estimated from individuals

0 1

0.5

0 1

>20X

0.5

0 0

0.5

1

0

0.5

1

0

0.5

1

0

0.5

1

0

0.5

1

Allelic frequency estimated from pool Fig. 3 Allele frequency estimates from pooling and individual experimental designs at RAD markers in a pine processionary moth population. Twenty individuals and one pool of 30 individuals from the same population were RAD sequenced on a single Illumina lane. Allele frequency estimations were drawn from 49 597 SNPs scored using both pooling and individual experimental designs (see main text for details and Data S3, Supporting information for allele count data). The x-axis and y-axis categorizations correspond to pooling and individual coverage values of SNPs, respectively. Correlation coefficients (r) and median of (absolute) difference (91000) between individual and pool-based estimations of allele frequencies (d) were computed when the number of SNPs (n) was above 100.

(here posterior mean) of ne (and e) for the different simulated pools. In particular, the estimates provide a good picture of the effective pool sizes and experimental errors across the different pools with low to moderate relative mean square errors (RMSE). We found, however, that ne (respectively e) tended to be slightly

overestimated (respectively underestimated) except when the true values were close to the boundaries (i.e. e = 0 and ne = nd). Increasing the number of SNPs (slightly to mildly) decreased RMSE values and sharpened the intervals containing the estimates (posterior means) across the 500 simulated data sets, leading to © 2013 John Wiley & Sons Ltd

© 2013 John Wiley & Sons Ltd

Each simulated data set consisted of allele read counts generated at nsnp bi-allelic SNPs (nsnp = 500, 1000 or 2000) for four replicates of a pool of nd = 30 diploid individuals with experimental error e varying from 0% (ne = nd = 30) to 200% (ne = 6). To further mimic the real pine processionary moth data set, we used the number of SNP (total) read counts observed for randomly chosen replicates (with duplicate filtered); see the main text and Table 2. The simulated population frequencies for the reference allele at each SNP were sampled from a uniform distribution with lower and upper bounds equal to 0.01 and 0.99, respectively. For a given SNP, read counts were then simulated in each replicate under the model incorporating unequal individual contributions to the pool reads (see the main text, Data S1 and S2, Supporting information). Finally, as for the real data set (see Table 2), in each simulated data sets, those SNPs displaying