Received: 24 January 2018
|
Revised: 14 June 2018
|
Accepted: 25 June 2018
DOI: 10.1111/1755-0998.12926
RESOURCE ARTICLE
Supervised machine learning outperforms taxonomy‐based environmental DNA metabarcoding applied to biomonitoring Tristan Cordier1
| Dominik Forster2 | Yoann Dufresne1,3 | Catarina I. M. Martins4 |
Thorsten Stoeck2 | Jan Pawlowski1,5 1
Department of Genetics and Evolution, University of Geneva, Geneva, Switzerland 2
Ecology Group, University of Kaiserslautern, Kaiserslautern, Germany Institut Pasteur – Hub of Bioinformatics and Biostatistics – C3BI, USR 3756 IP CNRS, Paris, France
3
4
Marine Harvest ASA, Bergen, Norway
Abstract Biodiversity monitoring is the standard for environmental impact assessment of anthropogenic activities. Several recent studies showed that high‐throughput amplicon sequencing of environmental DNA (eDNA metabarcoding) could overcome many limitations of the traditional morphotaxonomy‐based bioassessment. Recently, we demonstrated that supervised machine learning (SML) can be used to predict
5
ID-Gene ecodiagnostics, Ltd, Plan-lesOuates, Switzerland
accurate biotic indices values from eDNA metabarcoding data, regardless of the taxonomic affiliation of the sequences. However, it is unknown to which extent the
Correspondence Tristan Cordier, Department of Genetics and Evolution, University of Geneva, 1211 Geneva, Switzerland. Email:
[email protected]
accuracy of such models depends on taxonomic resolution of molecular markers or how SML compares with metabarcoding approaches targeting well‐established bioindicator species. In this study, we address these issues by training predictive models upon five different ribosomal bacterial and eukaryotic markers and measur-
Funding information Swiss Network for International Studies, Grant/Award Number: 316030_150817; Swiss National Science Foundation, Grant/ Award Number: 31003A_179125; Deutsche Forschungsgemeinschaft, Grant/Award Number: STO414/15-1; Carl Zeiss Foundation; NVIDIA GPU grant program
ing their performance to assess the environmental impact of marine aquaculture on independent data sets. Our results show that all tested markers are yielding accurate predictive models and that they all outperform the assessment relying solely on taxonomically assigned sequences. Remarkably, we did not find any significant difference in the performance of the models built using universal eukaryotic or prokaryotic markers. Using any molecular marker with a taxonomic range broad enough to comprise different potential bioindicator taxa, SML approach can overcome the limits of taxonomy‐based eDNA bioassessment. KEYWORDS
biomonitoring, biotic indices, environmental DNA, predictive models, supervised machine learning
1 | INTRODUCTION
calculate biotic indices (BIs), such as AMBI (Borja, Franco, & Pérez, 2000), ISI (Rygg, 2002), NSI or NQI1 (Rygg, 2006). These
Biodiversity monitoring is largely used for the environmental
taxon‐specific ecological weights have been defined from empiri-
impact assessment of anthropogenic activities. According to tradi-
cal and experimental data (Borja et al., 2000). The computed BI
tion, in marine ecosystems, such impacts are assessed through
value of a sample determines its ecological quality status (usually
the inventory of benthic macro‐invertebrates, which involve the
in five ordered categories from “very poor” to “very good”).
sorting and the morphotaxonomic identification of thousands of
However, such an approach is demanding taxonomic expertise
specimens for a single site (Borja, Ranasinghe, & Weisberg, 2009;
and results are typically taking months to be available. Faster
Tavakoly Sany, Hashim, Rezayi, Salleh, & Safari, 2014). Identified
and standardized alternatives are of crucial importance for envi-
taxa are being ascribed to ecological weights that are used to
ronmental management.
Mol Ecol Resour. 2018;1–11.
wileyonlinelibrary.com/journal/men
© 2018 John Wiley & Sons Ltd
|
1
2
|
CORDIER
ET AL.
High‐throughput amplicon sequencing of environmental DNA
of organic enrichment associated with salmon farming activities in
(eDNA metabarcoding) followed by taxonomic assignment of
Norway. All the tested markers are located in the ribosomal small
sequenced species offers a fast and cost‐effective way to describe
subunit (SSU) rRNA gene and include one bacterial, one specific for-
biological communities (Taberlet, Coissac, Pompanon, Brochmann, &
aminiferal and three universal eukaryotic markers. We also compared
Willerslev, 2012). The potential of eDNA metabarcoding for biomon-
the performance of the predictive models with the taxonomy‐based
itoring was evaluated in both freshwater (Kermarrec et al., 2014;
metabarcoding approach. Finally, we investigated the performance of
Visco et al., 2015; Zimmermann, Glöckner, Jahn, Enke, & Gemein-
predictive models built upon taxonomic subgroups within each of
holzer, 2015) and marine ecosystems (Bik, Halanych, Sharma, & Tho-
the five markers, to assess their potential as bioindicators of impact
mas, 2012; Chariton et al., 2015; Pawlowski, Esling, Lejzerowicz,
related to organic enrichment.
Cedhagen, & Wilding, 2014; Pawlowski et al., 2016). The ecological quality status inferred from eDNA metabarcoding data and morphotaxonomic inventories was congruent for both freshwater diatoms (Visco et al., 2015; Zimmermann et al., 2015) and marine invertebrates (Aylagas, Borja, Irigoien, & Rodríguez‐Ezpeleta, 2016; Lejzerowicz et al., 2015). However, these studies relied on reference
2 | MATERIALS AND METHODS 2.1 | Sampling, DNA extraction, PCR amplification and sequencing
sequence databases for taxonomic assignment, to retrieve taxon‐spe-
The thorough description of the sampling scheme, the reference
cific ecological weights and compute BI values. This prevents using
morphotaxonomic data as well as field and DNA extraction protocols
the majority of sequences that remain taxonomically unassigned or
can be found in Cordier et al. (2017). Briefly, a total of 144 sediment
belong to taxa of unknown ecology (Chariton et al., 2015; Lanzén,
samples were collected in June and October 2015 at 24 stations dis-
Lekang, Jonassen, Thompson, & Troedsson, 2016; Lejzerowicz et al.,
tributed at the vicinity of five salmon farms in Norway (Supporting
2015). In addition, BIs include relative abundances of taxa in their
Information Table S1). The PCR details of each of the five SSU mark-
formulas, which appears as an insurmountable problem due to the
ers, including amplification primers, PCR programs and library prepa-
lack of direct relationship between specimen abundance (or biomass)
ration, are available in Supporting Information Table S2, and the
and sequence reads amount in metabarcoding data (Dowle, Pochon,
multiplexing details are available in Supporting Information Table S3.
Banks, Shearer, & Wood, 2016; Elbrecht & Leese, 2015; Vivien,
Negative PCR controls using highly pure water instead of template
Wyler, Lafont, & Pawlowski, 2015).
DNA were included in each PCR session. From a total of 720 PCR
Recently, supervised machine learning (SML) has been proposed
to amplify the five markers on the 144 eDNA samples, 696 yielded
to overcome the issue of taxonomically and ecologically unassigned
PCR products, including a negative control for bacteria, that was
sequence data in the case of benthic monitoring in marine aquacul-
sequenced in order to remove the sequences from this negative con-
ture (Cordier et al., 2017) and for the identification of hydrocarbon‐
trol throughout the bacterial data set. The PCR products were quan-
polluted sites from bacterial communities (Smith et al., 2015). The
tified by high‐resolution capillary electrophoresis (QIAxcel System,
aim of SML is to extract knowledge from a training data set into a
Qiagen) and pooled in equimolar concentration for each library. Each
predictive model that can be used to make inference on new, unla-
pool was purified using the High Pure PCR Product Purification Kit
belled upcoming samples. In a marine biomonitoring framework, a
(Roche), quantified using a fluorometric method (QuBit HS dsDNA
training data set would be constituted of samples from which mor-
kit, Invitrogen) and used for library preparation. The raw data sets
photaxonomic‐derived BI values are known (references) and an asso-
are publicly available at the Sequence Read Archive under BioProject
ciated molecular data set is available (features). Building such
PRJNA376130 for foraminifera 37F, PRJNA417767 for Bacteria
predictive models gives the opportunity to bypass the taxonomic
V3V4,
assignment of operational taxonomic units (OTUs), because their
PRJNA431416 for Eukaryotes V9.
PRJEB23641
for
eukaryotes
V1V2
and
V4,
and
ecological signal is inferred from the training data set, regardless of their taxonomic affiliation. In addition, most SML algorithms are able to capture nonlinear relationships and association rules (Anger-
2.2 | Bioinformatics
mueller, Pärnamaa, Parts, & Oliver, 2016; Crisci, Ghattas, & Perera,
The preprocessing of each of the five data sets corresponding to the
2012), which make them particularly suitable for biomonitoring pur-
five markers is detailed in Supporting Information Table S4. Briefly,
poses with eDNA metabarcoding data. Yet, the performance of pre-
the paired‐end raw reads for each of the markers (except V9 that
dictive models is likely affected by the choice of marker for the
was sequenced in single‐end) were quality filtered, demultiplexed
generation of metabarcoding data sets. Markers with broad taxo-
and assembled into full‐length sequences with a custom pipeline
nomic targets are more likely to capture bioindicator taxa than those
written in C for the fast processing of Illumina multiplexed metabar-
with a narrow taxonomic scope, even though some taxonomic
coding data (https://github.com/esling/illumina-pipeline). The V9 1.9.1 toolkit (Caporaso et
groups are known to be good bioindicators (Pawlowski et al., 2014;
data set was preprocessed using the
Stoeck, Kochems, Forster, Lejzerowicz, & Pawlowski, 2018).
al., 2010). Each of the five preprocessed data sets was then filtered
In this study, we compared the performance of predictive models built upon five different genetic markers for the benthic monitoring
for potential chimaeras using mente,
Quince,
&
UCHIME
Knight,
QIIME
version 4.2.40 (Edgar, Haas, Cle-
2011)
implemented
in
the
CORDIER
|
ET AL.
3
identify_chimera_seq.py function of QIIME. We used the default
picked to split the tree at each node, which usually give the best
parameters of the function, but the –split_by_sampleid option was
results (Liaw & Wiener, 2002). After predicting BI values indepen-
used in order to restrict the de novo search by sample (i.e., by PCR).
dently for each sample, we trained a final model using the full data
The filtered data set was then clustered into operational taxonomic
set (i.e., the five farms) to measure the importance of each OTUs.
2.1.8 (Mahé, Rognes, Quince, De Vargas, &
To compare the performance of each genetic marker, the rela-
Dunthorn, 2015) with the default resolution (d = 1) and the fastidi-
tionships between the reference and predicted BI values were mod-
ous option. The representative sequences, that is, the most abundant
elled using the lm function in
individual sequence unit (ISU) of each OTU, were used as input of
converted into a discrete ecological quality status, after averaging
with default parameter for
per grab in the case of the predicted values. Their agreement was
units (OTUs) using
SWARM
the assign_taxonomy.py function of
QIIME
R.
These BI values were then
package (Gamer,
taxonomic assignment (uclust method), using curated nucleotide
tested using the kappa2 function of the irr v0.84
databases (Table S4). The OTU‐to‐sample matrices for each marker
Lemon, Fellows, & Singh, 2012), with squared weight because the
were generated from the result of the clustering with make_otu_ta-
ecological status values are ordered from “very poor” to “very
programming envi-
good.” Agreement between the two classifications was considered
ronment (R Development Core Team, 2016) for downstream
as “poor agreement” (i.e., kappa value ranging from 0.01 to 0.2) to
statistical analysis.
“almost perfect agreement” (i.e., kappa value ranging from 0.8 to 1)
ble.py function of
QIIME
and imported into the
R
R
(Landis & Koch, 1977). For each of the four tested BIs, the genetic marker yielding the best predictive model was the one associated
2.3 | Statistics
with the highest R2 value.
Because uneven sequencing depth across samples introduces biases
In the case of the three eukaryotic markers, we compared our
in the statistical analysis, samples with