based environmental DNA metabarcoding applied to ... - tristan cordier

rate predictive models and that they all outperform the assessment relying solely on taxonomically ... formulas, which appears as an insurmountable problem due to the lack of direct .... their low abundances across the data set reduced the dimension of ..... importance plots that can be seen as a ranked list of bioindicators.
321KB taille 8 téléchargements 195 vues
Received: 24 January 2018

|

Revised: 14 June 2018

|

Accepted: 25 June 2018

DOI: 10.1111/1755-0998.12926

RESOURCE ARTICLE

Supervised machine learning outperforms taxonomy‐based environmental DNA metabarcoding applied to biomonitoring Tristan Cordier1

| Dominik Forster2 | Yoann Dufresne1,3 | Catarina I. M. Martins4 |

Thorsten Stoeck2 | Jan Pawlowski1,5 1

Department of Genetics and Evolution, University of Geneva, Geneva, Switzerland 2

Ecology Group, University of Kaiserslautern, Kaiserslautern, Germany Institut Pasteur – Hub of Bioinformatics and Biostatistics – C3BI, USR 3756 IP CNRS, Paris, France

3

4

Marine Harvest ASA, Bergen, Norway

Abstract Biodiversity monitoring is the standard for environmental impact assessment of anthropogenic activities. Several recent studies showed that high‐throughput amplicon sequencing of environmental DNA (eDNA metabarcoding) could overcome many limitations of the traditional morphotaxonomy‐based bioassessment. Recently, we demonstrated that supervised machine learning (SML) can be used to predict

5

ID-Gene ecodiagnostics, Ltd, Plan-lesOuates, Switzerland

accurate biotic indices values from eDNA metabarcoding data, regardless of the taxonomic affiliation of the sequences. However, it is unknown to which extent the

Correspondence Tristan Cordier, Department of Genetics and Evolution, University of Geneva, 1211 Geneva, Switzerland. Email: [email protected]

accuracy of such models depends on taxonomic resolution of molecular markers or how SML compares with metabarcoding approaches targeting well‐established bioindicator species. In this study, we address these issues by training predictive models upon five different ribosomal bacterial and eukaryotic markers and measur-

Funding information Swiss Network for International Studies, Grant/Award Number: 316030_150817; Swiss National Science Foundation, Grant/ Award Number: 31003A_179125; Deutsche Forschungsgemeinschaft, Grant/Award Number: STO414/15-1; Carl Zeiss Foundation; NVIDIA GPU grant program

ing their performance to assess the environmental impact of marine aquaculture on independent data sets. Our results show that all tested markers are yielding accurate predictive models and that they all outperform the assessment relying solely on taxonomically assigned sequences. Remarkably, we did not find any significant difference in the performance of the models built using universal eukaryotic or prokaryotic markers. Using any molecular marker with a taxonomic range broad enough to comprise different potential bioindicator taxa, SML approach can overcome the limits of taxonomy‐based eDNA bioassessment. KEYWORDS

biomonitoring, biotic indices, environmental DNA, predictive models, supervised machine learning

1 | INTRODUCTION

calculate biotic indices (BIs), such as AMBI (Borja, Franco, & Pérez, 2000), ISI (Rygg, 2002), NSI or NQI1 (Rygg, 2006). These

Biodiversity monitoring is largely used for the environmental

taxon‐specific ecological weights have been defined from empiri-

impact assessment of anthropogenic activities. According to tradi-

cal and experimental data (Borja et al., 2000). The computed BI

tion, in marine ecosystems, such impacts are assessed through

value of a sample determines its ecological quality status (usually

the inventory of benthic macro‐invertebrates, which involve the

in five ordered categories from “very poor” to “very good”).

sorting and the morphotaxonomic identification of thousands of

However, such an approach is demanding taxonomic expertise

specimens for a single site (Borja, Ranasinghe, & Weisberg, 2009;

and results are typically taking months to be available. Faster

Tavakoly Sany, Hashim, Rezayi, Salleh, & Safari, 2014). Identified

and standardized alternatives are of crucial importance for envi-

taxa are being ascribed to ecological weights that are used to

ronmental management.

Mol Ecol Resour. 2018;1–11.

wileyonlinelibrary.com/journal/men

© 2018 John Wiley & Sons Ltd

|

1

2

|

CORDIER

ET AL.

High‐throughput amplicon sequencing of environmental DNA

of organic enrichment associated with salmon farming activities in

(eDNA metabarcoding) followed by taxonomic assignment of

Norway. All the tested markers are located in the ribosomal small

sequenced species offers a fast and cost‐effective way to describe

subunit (SSU) rRNA gene and include one bacterial, one specific for-

biological communities (Taberlet, Coissac, Pompanon, Brochmann, &

aminiferal and three universal eukaryotic markers. We also compared

Willerslev, 2012). The potential of eDNA metabarcoding for biomon-

the performance of the predictive models with the taxonomy‐based

itoring was evaluated in both freshwater (Kermarrec et al., 2014;

metabarcoding approach. Finally, we investigated the performance of

Visco et al., 2015; Zimmermann, Glöckner, Jahn, Enke, & Gemein-

predictive models built upon taxonomic subgroups within each of

holzer, 2015) and marine ecosystems (Bik, Halanych, Sharma, & Tho-

the five markers, to assess their potential as bioindicators of impact

mas, 2012; Chariton et al., 2015; Pawlowski, Esling, Lejzerowicz,

related to organic enrichment.

Cedhagen, & Wilding, 2014; Pawlowski et al., 2016). The ecological quality status inferred from eDNA metabarcoding data and morphotaxonomic inventories was congruent for both freshwater diatoms (Visco et al., 2015; Zimmermann et al., 2015) and marine invertebrates (Aylagas, Borja, Irigoien, & Rodríguez‐Ezpeleta, 2016; Lejzerowicz et al., 2015). However, these studies relied on reference

2 | MATERIALS AND METHODS 2.1 | Sampling, DNA extraction, PCR amplification and sequencing

sequence databases for taxonomic assignment, to retrieve taxon‐spe-

The thorough description of the sampling scheme, the reference

cific ecological weights and compute BI values. This prevents using

morphotaxonomic data as well as field and DNA extraction protocols

the majority of sequences that remain taxonomically unassigned or

can be found in Cordier et al. (2017). Briefly, a total of 144 sediment

belong to taxa of unknown ecology (Chariton et al., 2015; Lanzén,

samples were collected in June and October 2015 at 24 stations dis-

Lekang, Jonassen, Thompson, & Troedsson, 2016; Lejzerowicz et al.,

tributed at the vicinity of five salmon farms in Norway (Supporting

2015). In addition, BIs include relative abundances of taxa in their

Information Table S1). The PCR details of each of the five SSU mark-

formulas, which appears as an insurmountable problem due to the

ers, including amplification primers, PCR programs and library prepa-

lack of direct relationship between specimen abundance (or biomass)

ration, are available in Supporting Information Table S2, and the

and sequence reads amount in metabarcoding data (Dowle, Pochon,

multiplexing details are available in Supporting Information Table S3.

Banks, Shearer, & Wood, 2016; Elbrecht & Leese, 2015; Vivien,

Negative PCR controls using highly pure water instead of template

Wyler, Lafont, & Pawlowski, 2015).

DNA were included in each PCR session. From a total of 720 PCR

Recently, supervised machine learning (SML) has been proposed

to amplify the five markers on the 144 eDNA samples, 696 yielded

to overcome the issue of taxonomically and ecologically unassigned

PCR products, including a negative control for bacteria, that was

sequence data in the case of benthic monitoring in marine aquacul-

sequenced in order to remove the sequences from this negative con-

ture (Cordier et al., 2017) and for the identification of hydrocarbon‐

trol throughout the bacterial data set. The PCR products were quan-

polluted sites from bacterial communities (Smith et al., 2015). The

tified by high‐resolution capillary electrophoresis (QIAxcel System,

aim of SML is to extract knowledge from a training data set into a

Qiagen) and pooled in equimolar concentration for each library. Each

predictive model that can be used to make inference on new, unla-

pool was purified using the High Pure PCR Product Purification Kit

belled upcoming samples. In a marine biomonitoring framework, a

(Roche), quantified using a fluorometric method (QuBit HS dsDNA

training data set would be constituted of samples from which mor-

kit, Invitrogen) and used for library preparation. The raw data sets

photaxonomic‐derived BI values are known (references) and an asso-

are publicly available at the Sequence Read Archive under BioProject

ciated molecular data set is available (features). Building such

PRJNA376130 for foraminifera 37F, PRJNA417767 for Bacteria

predictive models gives the opportunity to bypass the taxonomic

V3V4,

assignment of operational taxonomic units (OTUs), because their

PRJNA431416 for Eukaryotes V9.

PRJEB23641

for

eukaryotes

V1V2

and

V4,

and

ecological signal is inferred from the training data set, regardless of their taxonomic affiliation. In addition, most SML algorithms are able to capture nonlinear relationships and association rules (Anger-

2.2 | Bioinformatics

mueller, Pärnamaa, Parts, & Oliver, 2016; Crisci, Ghattas, & Perera,

The preprocessing of each of the five data sets corresponding to the

2012), which make them particularly suitable for biomonitoring pur-

five markers is detailed in Supporting Information Table S4. Briefly,

poses with eDNA metabarcoding data. Yet, the performance of pre-

the paired‐end raw reads for each of the markers (except V9 that

dictive models is likely affected by the choice of marker for the

was sequenced in single‐end) were quality filtered, demultiplexed

generation of metabarcoding data sets. Markers with broad taxo-

and assembled into full‐length sequences with a custom pipeline

nomic targets are more likely to capture bioindicator taxa than those

written in C for the fast processing of Illumina multiplexed metabar-

with a narrow taxonomic scope, even though some taxonomic

coding data (https://github.com/esling/illumina-pipeline). The V9 1.9.1 toolkit (Caporaso et

groups are known to be good bioindicators (Pawlowski et al., 2014;

data set was preprocessed using the

Stoeck, Kochems, Forster, Lejzerowicz, & Pawlowski, 2018).

al., 2010). Each of the five preprocessed data sets was then filtered

In this study, we compared the performance of predictive models built upon five different genetic markers for the benthic monitoring

for potential chimaeras using mente,

Quince,

&

UCHIME

Knight,

QIIME

version 4.2.40 (Edgar, Haas, Cle-

2011)

implemented

in

the

CORDIER

|

ET AL.

3

identify_chimera_seq.py function of QIIME. We used the default

picked to split the tree at each node, which usually give the best

parameters of the function, but the –split_by_sampleid option was

results (Liaw & Wiener, 2002). After predicting BI values indepen-

used in order to restrict the de novo search by sample (i.e., by PCR).

dently for each sample, we trained a final model using the full data

The filtered data set was then clustered into operational taxonomic

set (i.e., the five farms) to measure the importance of each OTUs.

2.1.8 (Mahé, Rognes, Quince, De Vargas, &

To compare the performance of each genetic marker, the rela-

Dunthorn, 2015) with the default resolution (d = 1) and the fastidi-

tionships between the reference and predicted BI values were mod-

ous option. The representative sequences, that is, the most abundant

elled using the lm function in

individual sequence unit (ISU) of each OTU, were used as input of

converted into a discrete ecological quality status, after averaging

with default parameter for

per grab in the case of the predicted values. Their agreement was

units (OTUs) using

SWARM

the assign_taxonomy.py function of

QIIME

R.

These BI values were then

package (Gamer,

taxonomic assignment (uclust method), using curated nucleotide

tested using the kappa2 function of the irr v0.84

databases (Table S4). The OTU‐to‐sample matrices for each marker

Lemon, Fellows, & Singh, 2012), with squared weight because the

were generated from the result of the clustering with make_otu_ta-

ecological status values are ordered from “very poor” to “very

programming envi-

good.” Agreement between the two classifications was considered

ronment (R Development Core Team, 2016) for downstream

as “poor agreement” (i.e., kappa value ranging from 0.01 to 0.2) to

statistical analysis.

“almost perfect agreement” (i.e., kappa value ranging from 0.8 to 1)

ble.py function of

QIIME

and imported into the

R

R

(Landis & Koch, 1977). For each of the four tested BIs, the genetic marker yielding the best predictive model was the one associated

2.3 | Statistics

with the highest R2 value.

Because uneven sequencing depth across samples introduces biases

In the case of the three eukaryotic markers, we compared our

in the statistical analysis, samples with