The derivation of species response curves with Gaussian logistic

curves were symmetric curves that could be fitted with a quadratic logistic regression equation. When the interest is on fitting skewed response curves, however, ...
1MB taille 10 téléchargements 316 vues
e c o l o g i c a l m o d e l l i n g 1 9 9 ( 2 0 0 6 ) 164–175

available at www.sciencedirect.com

journal homepage: www.elsevier.com/locate/ecolmodel

The derivation of species response curves with Gaussian logistic regression is sensitive to sampling intensity and curve characteristics Christophe Coudun ∗ , Jean-Claude G´egout LERFoB, UMR INRA-ENGREF 1092, Ecole Nationale du G´enie Rural, des Eaux et des Forˆets, 14 Rue Girardet, CS 4216, 54042 Nancy Cedex, France

a r t i c l e

i n f o

a b s t r a c t

Article history:

We investigated quantitatively the sensitivity of plant species response curves to sampling

Published on line 10 July 2006

characteristics (number of plots, occurrence and frequency of species), along a simulated pH gradient. We defined 54 theoretical unimodal response curves, issued from combinations of

Keywords:

six values for optimum (opt = 3, 4, . . ., 8), three values for tolerance (tol = 0.5, 1.0, and 1.5, sensu

Artificial data

ter Braak and Looman [ter Braak, C.J.F., Looman, C.W.N., 1986. Weighted averaging, logistic

Ecological amplitude

regression and the Gaussian response model. Vegetatio 65, 3–11]), and three values for max-

Ecological optimum

imum probability of presence (pmax = 0.05, 0.20, and 0.50). For each of these 54 theoretical

EcoPlant database

response curves, we built artificial binary data sets (presence/absence) to test the influence

Forest plants

of species occurrence, frequency, or number of available plots. With real data extracted from

Logistic regression

´ EcoPlant, a phytoecological database for French forests [Gegout, J.-C., Coudun, Ch., Bailly, G.,

pH

Jabiol, B., 2005. EcoPlant: a forest sites database linking floristic data with soil characteristics

Species response curve

and climatic conditions. J. Veg. Sci. 16, 257–260], we compared the ecological response of 50

Sampling

plant species to soil pH, based first on a small data set (100 randomly sampled plots), and then based on the whole data set available (3810 plots). Results on simulated data showed that the curve optimum, amplitude, or maximum probability of presence, cannot be assessed reliably with logistic regression, when a species is too rare, or when its theoretical optimum lies near an extreme of the gradient. Those theoretical results were illustrated by real data extracted from EcoPlant. We suggest a general minimum value of 50 occurrences for species to derive acceptable ecological response curves with logistic regression. © 2006 Elsevier B.V. All rights reserved.

1.

Introduction

Attempts to link species presence/absence to environmental factors through the computation of response curves have received much attention from ecologists and statisticians in the last few decades (ter Braak and Looman, 1986; Odland et al., 1995; ter Braak, 1996). The most popular techniques cur-

rently used are generalised linear models (GLMs, McCullagh and Nelder, 1997) and generalised additive models (GAMs, Yee and Mitchell, 1991; Hastie and Tibshirani, 1997), and they have been heavily documented, both theoretically and empirically (Guisan et al., 2002; Lehmann et al., 2002a; Scott et al., 2002). Logistic regression is a particular case of generalised linear modelling, and one of the oldest and most popular

∗ Corresponding author at: Centre for Terrestrial Carbon Dynamics (CTCD), Forest Research, Alice Holt Lodge, Farnham, Surrey GU10 4LH, United Kingdom. Tel.: +44 1420 526289. ´ E-mail addresses: [email protected] (C. Coudun), [email protected] (J.-C. Gegout). 0304-3800/$ – see front matter © 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.ecolmodel.2006.05.024

165

e c o l o g i c a l m o d e l l i n g 1 9 9 ( 2 0 0 6 ) 164–175

Table 1 – List of 21 recent species distribution modelling case studies based on GLM/GAM techniques Reference ´ et al. (2002) Araujo ´ et al. (2004) Araujo Austin (1998) Austin et al. (2000) Bakkenes et al. (2002) Bio et al. (1998) Bio et al. (2002) Bragazza and Gerdol (1996) Brotons et al. (2004) Cawsey et al. (2002) ´ Coudun and Gegout (2005) ¨ Dirnbock et al. (2003) ¨ Dirnbock and Dullinger (2004) ´ Gegout et al. (2003) Guisan et al. (1999) Guisan and Theurillat (2000) Heegaard et al. (2001) Lehmann et al. (2002b) McPherson et al. (2004) ´ (2004) Segurado and Araujo Zaniewski et al. (2002)

Location Great Britain Europe Australia Australia Europe The Netherlands Belgium Italy Spain Australia France Austria Austria France United States Switzerland Northern Ireland New Zealand South Africa Portugal New Zealand

Species Passerine birds Plants Trees Trees and shrubs Plants Plants Plants Plants Birds Trees and shrubs Plants Plants Plants Plants Trees and shrubs Plants Aquatic plants Ferns Birds Amphibians/reptiles Ferns

Number of observations 120000 2434 9537 2530 4419 2090 2587 251 1550 2307 1200 959 1016 306 144 205 574 19875 4275 993 19875

Number of species 78 1200 88 135 1397 156 18 21 30 147 46 85 71 122 23 62 32 43 32 44 43

Minimum occurrence 5 25 50 5 20 22 20 25 15 5 50 100 100 5 4 15 20 202 26 4 66

Minimum frequency (%) 0.0 1.0 0.5 0.2 0.5 1.1 0.8 10.0 1.0 0.2 4.2 10.4 9.8 1.6 2.8 7.3 3.5 1.0 0.6 0.4 0.3

Information is given on location, nature of modelled species or species group, as well as characteristics of sampling (number of observations, number of species, species minimum occurrence and species minimum frequency). The criterion used for species selection is indicated in bold for each study.

technique used world-wide to link species presence to ecological gradients, and thus to characterise quantitatively speciesenvironment relationships (ter Braak and Looman, 1986; ter Braak, 1996). It is a simple, flexible, parametric technique that may allow symmetric bell-shaped ecological response curves to be derived for species (ter Braak and Looman, 1986). When the interest lies in numerically summarising the information contained in ecological response curves, as for example extracting the ecological optimum and ecological amplitude, logistic regression modelling has been shown to be a robust and powerful technique (Hill et al., 1999, 2000; Roy et al., 2000; ´ ´ Gegout and Krizova, 2003; Coudun and Gegout, 2005). In a recent review Diekmann (2003) also underlines the appropriateness of logistic regression to deal with the use of species indicator values in plant ecology. The decision to select or reject species from a data set, in order to compute ecological response curves, is always a crucial step for ecologists who want to characterise the ecological behaviour of several species in a particular region. Unfortunately, such decisions are rarely explained or justified (McKenney et al., 2002). In most studies, selected species are those which are present in more than an arbitrary number or proportion of observations within the data set (Table 1). The minimum number of occurrences to select a species varies greatly within studies, ranging from 5 (Guisan et al., 1999; ´ Austin et al., 2000; Araujo et al., 2002; Cawsey et al., 2002; ´ ´ Gegout et al., 2003; Segurado and Araujo, 2004) to 100 occur¨ ¨ rences (Dirnbock et al., 2003; Dirnbock and Dullinger, 2004). The minimum frequency to select a species varies as well, being dependent on sample size (Manel et al., 2001; Table 1). When the objective is studying the ecological behaviour of many species in an area (e.g., Lawesson and Oksanen, 2002; ´ Coudun, 2005; Coudun and Gegout, 2005), the threshold occur-

rence value to select species is very important because, in nature, there are few frequent species and many non-frequent species (Karl et al., 2002). The choice of a lower threshold value could thus increase the possible number of studied species, but the consequences on computed response curves remain unknown. Moreover, the position of the species optimum along the ecological gradient seems to have an influence on the derivation of species response curves, and some technical artefacts have been revealed, such as the effect of sampling pattern on species distribution along gradients (Mohler, 1983). Indeed, the computation of ecological optima near the ends of the gradients is a difficult task (Rydgren et al., 2003) and Austin et al. (1990, 1994) also showed that the response of species tend to be asymmetric when their optimum lies near one end of the gradient. Using artificial data, our objective was to evaluate a minimum occurrence required to select a species in order to derive an acceptable response curve with logistic regression. We also wanted to illustrate some technical artefacts linked to this technique that might explain the observed distribution of species optima and the proportion of non-reacting species when a real data set is considered. Our results apply only to instances where species response curves are supposed to be Gaussian, since we assumed that original species response curves were symmetric curves that could be fitted with a quadratic logistic regression equation. When the interest is on fitting skewed response curves, however, other techniques could be used, such as the Huisman–Olff–Fresco (HOF) model (Huisman et al., 1993), which has been shown to be among the best techniques to characterise species-environment relationships (Lawesson and Oksanen, 2002; Oksanen and Minchin, 2002).

166

e c o l o g i c a l m o d e l l i n g 1 9 9 ( 2 0 0 6 ) 164–175

2.

2.2.

Materials and methods

2.1. Logistic regression and numerical summaries of response curves A response curve represents the probability of presence, p(x) of a species along an ecological gradient x. Traditionally, the conceptual shape of response curves is unimodal (bell-shaped, ter Braak, 1996), but many other shapes of response curves might be also observed (Huisman et al., 1993; Lawesson and Oksanen, 2002). Assuming a bell-shaped response curve with logistic regression means that p(x) is linked under its logittranformed value to a quadratic function of x (Eq. (1), from ter Braak and Looman, 1986):

 log

p(x) 1 − p(x)

 = b0 + b1 x + b2 x2 = a −

(x − opt) 2tol2

2

(1)

In Eq. (1), b0 , b1 and b2 are the regression parameters obtained from a statistical package, such as S-Plus (MathSoft, 1999), and opt, tol, and a, are the optimum, the tolerance, and a parameter linked to pmax , the maximum value of p(x), respectively. The three parameters opt, tol, and pmax can be easily linked to the regression coefficients (Eq. (2), from ter Braak and Looman, 1986): opt =

−b1 , 2b2

tol =



1

pmax = 1 + exp



b21 4b2

− b0

1 −2b2



,

and

(2)

When the response curve is not significantly unimodal, which is determined with a residual deviance test (e.g., at the 0.05 level for the p-value, McCullagh and Nelder, 1997), Eq. (1) is simplified with b2 equal to zero and the response curve is then sigmoidal, either increasing or decreasing. Further, when the response curve is not significantly sigmoidal, Eq. (1) is simplified with b1 and b2 equal to zero and the response curve is flat. Ecologically, this implies no apparent significant reaction of the species to the ecological gradient. The derivation of the numerical summaries opt, tol, and pmax from Eq. (2), is valid only for unimodal bell-shaped response curves (b2 is significantly different from zero), and enables the comparison of the ecological behaviour of different species along gradients (Odland et al., 1995), or the comparison of the ecological ´ behaviour of the same species in different regions (Gegout and ´ Krizova, 2003; Coudun and Gegout, 2005). When the response curve is not significantly unimodal, being actually either sigmoidal or flat, the computation of tolerance (tol, sensu ter Braak and Looman, 1986), is not possible (Eq. (2)). In this study, we thus chose another measure of niche breadth, or ecolog´ ical amplitude (amp, sensu Gegout and Pierrat, 1998), that is computed as the pH range containing 80% of the distribution of probability of presence. The advantage of this amplitude (amp) is that it is applicable for any response shape, and that it represents the pH range in which presence conditions are optimal.

Theoretical response curves

We used a real pH gradient to explore the sensitivity of species response curve to sampling intensity and curve characteristics. In French forests, pH values range from 3 to 8 for most ´ sites found in the forest database EcoPlant (Gegout, 2001; ´ Gegout et al., 2005). We defined 54 theoretical response curves, representative of actual plant species pH response curves in ´ France (Coudun and Gegout, 2005), as combinations of six values for the optimum (opt = 3, 4, . . ., 8), three values for the tolerance (tol = 0.5, 1.0, 1.5), and three values for the maximum probability of the curve (pmax = 0.05, 0.20, 0.50, see Eq. (2) and Fig. 1). These 54 response curves could be easily regarded as a representation of the ecological behaviour of 54 virtual plant species along a pH gradient (Fig. 1), with acidophilous (opt = 3) to basophilous (opt = 8) species, narrow (tol = 0.5) to wide (tol = 1.5) amplitude species, and rare (pmax = 0.05) to frequent (pmax = 0.50) species. In addition, we defined three supplementary theoretical flat response curves (pmax = 0.05, 0.20, and 0.50), to assess the extent to which species reaction is predicted when the species is actually not expected to react (Fig. 1).

2.3.

Artificial data sets of varying size

For each of the 54 theoretical response curves, the triplet of numerical ecological summaries (opt, tol, pmax ) was transformed into a triplet of regression coefficients (b0 , b1 , b2 ), and used to calculate p(x) along the pH gradient (Eq. (1). We selected 50 pH values (regularly spaced with 0.1 pH unit intervals) between 3.1 and 8.0—called xi , and we created 10,000 presence/absence data (1/0) for each xi , based on the Bernoulli distribution associated with the corresponding value of p(xi ). Each theoretical response curve was thus linked to one Table T, with 50 columns (one column for each pH value xi ) and 10,000 lines (presence/absence data), each of those lines being a simulated observation (presence/absence data), from which we extracted data sets of varying size. In order to get a binary data set of n simulated observations (e.g., 500 observations) subject to logistic regression, we sampled randomly n/50 (e.g., 10) simulated observations for each column of Table T. We then re-iterated the procedure a hundred times, to get 100 repetitions of size n plots. The 11 different data set sizes we chose were: 50, 100, 150, 200, 300, 500, 750, 1000, 1500, 2000 and 5000, by sampling randomly 1, 2, 3, 4, 6, 10, 15, 20, 30, 40 and 100 values in each column of Table T, respectively. For each data set size, we produced 100 different repetitions. For each of the three flat curves (pmax = 0.05, 0.20, or 0.50), 11 data sets of different size were also built, with 100 different data sets for each size.

2.4.

Computation of resulting curve characteristics

In total, 59,400 logistic regression models were built (generalised linear models with logit link function and binomial error distribution, see McCullagh and Nelder, 1997), since each of the 54 theoretical response curves led to 11 sample sizes (50–5000 observations), which then led each to 100 repetitions. A total of 3330 other logistic regression models, corresponding to the

e c o l o g i c a l m o d e l l i n g 1 9 9 ( 2 0 0 6 ) 164–175

167

Fig. 1 – Theoretical probability of presence of simulated species vs. a pH gradient. The 54 simulated response curves are derived from combinations of: (a) six values for optimum (opt = 3, 4, 5, 6, 7 or 8); (b) three values for tolerance (tol = 0.5, 1.0 or 1.5); (c) three values for maximum probability of the curve (pmax = 0.05, 0.20 or 0.50). (b) A flat response curve (no expected reaction for species).

three flat theoretical curves, were also built (three curves, 11 data sets sizes, and 100 repetitions for each size). For each simulated response curve that was not flat, we derived the following three synthetic numerical values: ecological optimum (opt), ecological amplitude (amp, ´ sensu Gegout and Pierrat, 1998) and the maximal height of the curve (pmax ). The quality of the computed response curves was assessed by the difference between computed and theoretical values for opt, amp, and pmax . We subjectively decided that a computed response curve was reasonably close to the theoretical curve when, (i) the absolute difference between computed and theoretical optimum (|optcomputed − opttheoretical |) was strictly less than 1 pH unit, (ii) the absolute relative difference between computed and theoretical amplitude (|ampcomputed − amptheoretical |/amptheoretical ) was less than 0.25 unit, and (iii) the absolute relative difference between computed and theoretical maximum probability (|pmax computed − pmax theoretical |/pmax theoretical ) was less than 0.25 unit. We summarised the results by performing a logistic regression (generalised linear model with a logit link function and a binomial error distribution) on the 59,400 simulated models, using acceptable/not acceptable model (1/0) as the response variable, and the number of occurrences and the three characteristics of the theoretical curves (opt, amp, pmax ) as predictor variables, using a similar procedure as suggested by Rydgren et al. (2003). All computations were performed with the S-Plus statistical software package (MathSoft, 1999), and the written program is available on request to the authors.

2.5.

Application to real data

To illustrate the influence of sampling intensity and response curve characteristics on the quality of computed models, we used real data extracted from EcoPlant, a phytoecologi´ ´ cal database for forests from France (Gegout, 2001; Gegout et al., 2005) that contains complete site-specific floristic (species lists) and environmental information (e.g., climatic description, soil description, chemical analyses of soil samples). EcoPlant illustrates the fact that in nature, there are few frequent species and many non-frequent species. For the 50 most frequent vascular plant species in French forests (excluding trees), an ecological response curve along the pH gradient (from pH 3 to 8.5) was computed, and those 50 species were ordered by increasing computed pH optimum value. First, all 3810 observations available with a measurement of soil pH (H2 O) and scattered over France were used, the minimum number of occurrences being 321 for Ajuga reptans. Secondly, we used only 100 observations randomly extracted from the larger data set.

3.

Results

3.1. factor

Prediction of species response to the ecological

Among the 59,400 computed models with an expected reaction of the species, 53,637 models (90.3%) revealed a statistically significant association with the underlying gradient,

168

e c o l o g i c a l m o d e l l i n g 1 9 9 ( 2 0 0 6 ) 164–175

of reaction, when reaction is not expected, does not seem to be affected by sample size (Fig. 2b). The influence of sample size on the quality of derivation of response curves was obvious (Fig. 3), since 50 observations appears to be an insufficient data set size to reproduce the ecological behaviour of species in an acceptable way.

3.2. The number of occurrences is the criterion to select studied species

Fig. 2 – Prediction success with varying sample size of: (a) reaction rate when the species is supposed to react (based on 5400 simulated models for each sample size); (b) non-reaction rate when the species is not supposed to react (based on 300 simulated models for each sample size).

resulting in a unimodal or monotonic response curve. The other 5763 models (9.7%) predicted a flat response curve. Among the 3300 computed models with no expected reaction of the species, 2942 models (89.2%) revealed no reaction; the other 358 models (10.9%) predicted a false reaction of species. The rate of correct prediction of species reaction, when reaction should theoretically occur, increases with sample size (Fig. 2a), whereas the rate of correct prediction of the absence

Values of opt, amp, and pmax were computed for each of the 53,637 models and compared to the three values of the theoretical response curve. Sample size and species frequency had a lower influence on the accuracy in determining the right optimum than species number of occurrences (Fig. 4). The accuracy in approximating the real optimum is better for larger numbers of occurrence (Fig. 4b), such as the accuracy in approximating the amplitude or the maximum probability of the response curve (Fig. 5). A minimum of approximately 50 occurrences seems to be necessary to model species response to the pH gradient in a satisfactory manner, in terms of optimum (Fig. 5a), amplitude (Fig. 5b), and maximum probability (Fig. 5c).

3.3. Quality of computed models and response curve characteristics The percentage of acceptable models (absolute difference between computed and theoretical optimum strictly inferior than 1 pH unit, and absolute relative difference between com-

Fig. 3 – Simulated response curves for the theoretical curve with opt = 4, tol = 1.5, and pmax = 0.20, with varying sample size (50, 100, 200, 500, 1000 and 5000 plots with a mean number of occurrences equal to 5.8, 11.1, 24.1, 59.0, 118.4 and 595.8, respectively). For all sample sizes, the total number of simulated models is 100 and the number of flat response curves is 59, 33, 4, 0, 0, 0 with increasing size, the number of decreasing sigmoidal response curves is 21, 44, 56, 24, 6, 0 and the number of bell-shaped unimodal response curves is 20, 23, 40, 76, 94, 100.

e c o l o g i c a l m o d e l l i n g 1 9 9 ( 2 0 0 6 ) 164–175

169

Fig. 4 – Difference, for the 53,637 models with a significant reaction, between computed and theoretical optimum vs. (a) sample size; (b) species occurrence; (c) species frequency. For each value of the abscissa, the points inside the 95% quantile interval are represented in black points, and the points outside are represented in grey crosses.

puted and theoretical amplitude or maximum probability inferior than 0.25 unit), is dependent on sample size and characteristics of the theoretical response curve (Fig. 6). For small data sets (Fig. 6a), the percentage of acceptable models was highly dependent on pmax : lower values of pmax led to lower proportion of acceptable models and no acceptable model was found for data set size equal to 100 observations and pmax equal to 0.05, whatever the values for opt and amp. For large data sets (Fig. 6b), the percentage of acceptable models was very high for all types of response curves, except for species presenting low pmax values (i.e. 0.05). pmax thus had the strongest influence on the percentage of acceptable models. The position of the optimum also had a strong influence on the percentage of acceptable models, with greater values for intermediate values of the optimum. Indeed, when a species theoretical optimum lies near an extreme of the gradient, species occurrences are encountered in a limited portion of the gradient and are often not sufficient to result in a significant unimodal response curve, but rather often in an apparent sigmoidal curve (see Fig. 3). For example, among the 9900 computed models with a theoretical optimum equal to 4, 7519 models resulted in an unimodal curve, while 1542 and 839 models resulted in a decreasing sigmoidal and flat response curve, respectively. When a species theoretical optimum lies at the centre of the gradient, however, species occurrences are

encountered along a larger portion of the gradient and often result in a significant unimodal response curve. The influence of amplitude on the percentage of acceptable models was most pronounced for species with intermediate values for theoretical optimum. For example, among the 9900 computed models with a theoretical optimum equal to 5, 8303 models resulted in a correct unimodal curve, while 200 and 1397 models resulted in a decreasing sigmoidal and flat response curve, respectively. The larger amplitudes (tol = 1.5) coupled with the intermediate values of optimum led to species occurrences that were scattered regularly along the gradient, thus resulting in more flat response curves (Fig. 6). The easiest species to model, requiring smaller size data sets (Table 2), were frequent species (pmax = 0.50) with intermediate positions for the optimum (opt = 4, 5, 6, or 7), and lower values for amplitude (tol = 0.5 or 1.0). The most difficult species to model, requiring larger data sets (Table 2), were nonfrequent species (pmax = 0.05), with the optimum lying either at one extreme of the gradient (opt = 3 or 8), and with wide amplitude (tol = 1.5). The number of occurrences was the most discriminative variable on the acceptability of models, with model success being an increasing function of this number (see the positive coefficient associated to this variable in Table 3). Model success was also an increasing function of pmax , but curvilinear with

170

e c o l o g i c a l m o d e l l i n g 1 9 9 ( 2 0 0 6 ) 164–175

Fig. 5 – Quality of simulated response curves for low species occurrences, demonstrated by (a) difference between theoretical and computed optimum; (b) relative difference between theoretical and computed amplitude; (c) relative difference between theoretical and computed maximum probability of the curve.

opt and amp (Table 3), which confirmed previous observations (see Fig. 6).

3.4.

Application to real data

All 50 computed species response curves are acceptable representations of the species real behaviour towards soil pH in France (Fig. 7a). For the smaller data set (100 observations), the minimum number of occurrences is 4 for Galeopsis tetrahit. The

computation of response curves for those same 50 species, but with only 100 observations randomly extracted from the larger data set, illustrates two biases caused by the logistic regression technique (Fig. 7b): (i) a tendency to predict an absence of reaction for many species with an intermediate pH optimum and a wide amplitude (e.g., G. tetrahit, Athyrium filix-femina, Milium effusum, Dryopteris filix-mas) and (ii) a tendency to predict

Table 2 – Required sample size (number of plots) to obtain 95% of acceptable models (see text for definition and Fig. 6)

Table 3 – Summary of the logistic regression model (logit link and binomial error distribution) linking the acceptable/not acceptable model (1/0), the number of occurrences, and the three characteristics of theoretical response curves (opt, amp, pmax )

pmax

tol

Predictor

Reduction in deviance

p()

0.05 0.05 0.05 0.20 0.20 0.20 0.50 0.50 0.50

0.5 1.0 1.5 0.5 1.0 1.5. 0.5 1.0 1.5

Null Occurrence pmax amp amp2 opt opt2

– 33936.59 975.12 319.27 239.63 0.15 1728.75

5000 5000 5000 2000 1000 1500 300 300 750

opt 8 >5000 >5000 >5000 5000 5000 5000 2000 1000 1000

Coefficient −8.118409 0.0489191 2.368441 0.3800122 −0.2749499 2.529315 −0.2296737

Regression based on 59,400 simulations; null deviance is 81,791.51.

e c o l o g i c a l m o d e l l i n g 1 9 9 ( 2 0 0 6 ) 164–175

171

Fig. 6 – Number of acceptable simulated models (see text for definition) for two sample sizes: (a) 100 plots; (b) 1000 plots.

an extreme optimum when it actually lies near an extreme of the gradient (e.g., Calluna vulgaris, Lonicera periclymenum at the lower end of the pH gradient and e.g., Ligustrum vulgare, Cornus sanguinea, Carex flacca, Evonymus europaeus, Ligustrum vulgare, at the upper end of the pH gradient).

4.

Discussion

Species number of occurrences is an important feature in order to derive acceptable ecological response curves with logistic regression, and we suggest a general minimum value of 50 occurrences to expect good results for species. Indeed, logistic regression modelling seems to be relatively weak for accurate determination of curve characteristics for rare species, and ter Braak and Looman (1986) stated that even weighted averaging could give an acceptable estimate of rare species’ optimum. Important research work has been carried out on the shape of response curves to determine whether species presented symmetric or asymmetric responses (Austin and Nicholls, 1997; Oksanen, 1997; Karadzic et al., 2003; Rydgren et al., 2003), but our interest in this study was to check whether a known ecological response could be rebuilt, with different sample sizes and characteris-

tics of response curves. We think that better understanding of species response curves, and on species ecological behaviour, is gained by increasing data set size. Indeed, knowing the minimum value for species occurrence is very important when dealing with rare species (Wiser et al., 1998; Pearce and Ferrier, 2000a, 2000b; Elith and Burgman, 2002; Edwards et al., 2004). Most forest plant species integrated in the database EcoPlant ´ are found in less than 50 observations (Gegout et al., 2005) and lowering this threshold would allow the treatment of much more species (Stockwell and Peterson, 2002a). On the other hand, more occurrences do not always significantly increase the accuracy in determining the curve characteristics, and a maximum value between 500 and 1000 for species number of occurrences has been suggested (Virtanen et al., 1998; Stockwell and Peterson, 2002b). The importance of frequency of occurrence depends on the position of the species niche (ecological optimum) along the gradient. When a species occurs over a short range at the limits of the gradient, a linear or quadratic regression equation will provide a poor fit. Both with simulated and real data, we illustrated that some species with a true ecological optimum at the centre of the gradient could lead to falsely flat response curves when the amplitude is too wide, resulting in fewer species with an apparent intermediate optimum

172

e c o l o g i c a l m o d e l l i n g 1 9 9 ( 2 0 0 6 ) 164–175

Fig. 7 – Ecological optimum and amplitude of the 50 most frequent plant species present in EcoPlant, phytoecological database for forests from France, calculated on the basis of: (a) 3810 observations; (b) 100 observations.

e c o l o g i c a l m o d e l l i n g 1 9 9 ( 2 0 0 6 ) 164–175

(Fig. 7). This is probably due to the use of the Gaussian formula for simulating the data, which can generate non-zero values unless the data are truncated to a certain number of significant digits. This, in turn, might influence the fitting of quadratic functions. Removal of zero values well beyond the species range before fitting might improve the success of the model (Austin et al., 1994; Austin and Meyers, 1996; Austin and Nicholls, 1997). Similarly, some species with a true optimum lying near an extreme of a gradient could lead to falsely sigmoidal response curves. A technical solution to this problem could be to keep the quadratic term in logistic regression equations each time the linear term is significant. A practical solution would be to increase sampling intensity near the ends of the gradient, confirming early observations from Mohler (1983). There is a need for more knowledge on the distribution shape of species response curves, as well as relationships between species response curves and sampling characteristics in order to develop (theoretical) vegetation science (Austin et al., this issue). Our results help clarify many of the problems for a common type of data analysis. The importance of artificial data sets has been stressed (Austin et al., this issue), but we also had access to a high-quality data set with regard ´ to number of samples (Gegout et al., 2005), which made it possible to examine the effect of important sampling characteristics (see Edwards et al., this issue). Even though the subjective sampling inherent in the EcoPlant data may be of modelling concern, data sets of that size sampled by more objective statistical methods are extremely rare, if they occur at all (Coudun, 2005). EcoPlant has been especially designed to investigate plant species response to both edaphic and climatic factors and has been used recently at regional (Coudun ´ and Gegout, 2005) and the national scales in France (Coudun et al., in press). One of the most challenging perspective might then be to investigate the ecological response of some forest plant species across their whole geographic range within continental Europe.

Acknowledgements The authors wish to thank Gretchen Moisen, Patrick Osborne and Thomas Edwards for their editorial work, as well as three anonymous reviewers for their appropriate comments on earlier drafts of the manuscript. This study was financed through grants to Christophe Coudun by the French National Forest Office (ONF) and the Lorraine Regional Council (CR Lorraine). EcoPlant is a phytoecological database financed by the French Institute of Agricultural, Forest and Environmental Engineering (ENGREF), the French Ministry of Agriculture (DERF), the French National Forest Office (ONF), and the French Agency for Environment and Energy Management (ADEME).

references

´ Araujo, M.B., Williams, P.H., Fuller, R.J., 2002. Dynamics of extinction and the selection of nature reserves. Proc. R. Soc. Lond. 269, 1971–1980.

173

´ Araujo, M.B., Cabeza, M., Thuiller, W., Hannah, L., Williams, P.H., 2004. Would climate change drive species out of reserves? An assessment of existing reserve-selection methods. Glob. Change Biol. 10, 1618–1626. Austin, M.P., Nicholls, A.O., Margules, C.R., 1990. Measurement of the realised qualitative niche: environmental niches of five Eucalyptus species. Ecol. Monogr. 60, 161–177. Austin, M.P., Nicholls, A.O., Doherty, M.D., Meyers, J.A., 1994. Determining species response functions to an environmental gradient by means of a ␤-function. J. Veg. Sci. 5, 215–228. Austin, M.P., Meyers, J.A., 1996. Current approaches to modelling the environmental niche of eucalypts: implication for management of forest biodiversity. For. Ecol. Manage. 85, 95–106. Austin, M.P., Nicholls, A.O., 1997. To fix or not to fix the species limits, that is the ecological question: response to Jari Oksanen. J. Veg. Sci. 8, 743–748. Austin, M.P., 1998. An ecological perspective on biodiversity investigations: examples from Australian Eucalypt forests. Ann. Miss. Bot. Gard. 85, 2–17. Austin, M.P., Cawsey, E.M., Baker, B.L., Yaleloglou, M.M., Grice, D.J., Briggs, S.V., Barry, S., Doherty, M.D., Gallant, J., Lehmann, A., 2000. Predicted vegetation cover in the Central Lachlan region. Final report of the Natural Heritage Trust Project AA 1368.97. CSIRO Wildlife and Ecology, Canberra. Austin, M.P., Belbin, L., Meyers, J.A., Doherty, M.D., Luoto, M., this issue. Evaluation of statistical models used for predicting plant species distributions: role of artificial data and theory. Ecol. Model. Bakkenes, J., Alkemade, J.R.M., Ihle, F., Leemans, R., Latour, J., 2002. Assessing the effects of forecasted climate change on the diversity and distribution of European higher plants for 2050. Glob. Change Biol. 8, 390–407. Bio, A.M.F., Alkemade, R., Barendregt, A., 1998. Determining alternative models for vegetation response analysis: a non parametric approach. J. Veg. Sci. 9, 5–16. Bio, A.M.F., De Becker, P., De Bie, E., Huybrechts, W., Wassen, M.J., 2002. Prediction of plant species distribution in lowland river valleys in Belgium: modelling species response to site conditions. Biodiv. Conserv. 11, 2189–2216. Bragazza, L., Gerdol, R., 1996. Response surfaces of plant species along water-table depth and pH gradients in a poor mire on the southern Alps (Italy). Ann. Bot. Fenn. 33, 11–20. ´ Brotons, L., Thuiller, W., Araujo, M.B., Hirzel, A.H., 2004. Presence-absence versus presence-only modelling methods for predicting bird habitat suitability. Ecography 27, 437–448. Cawsey, E.M., Austin, M.P., Baker, B.L., 2002. Regional vegetation mapping in Australia: a case study in the practical use of statistical modelling. Biodiv. Conserv. 11, 2239–2274. ´ Coudun, Ch., 2005. Approche quantitative de la reponse ´ ` ´ etales ´ ` ´ ecologique des especes veg forestieres a` l’echelle de la France. Ph.D. dissertation. French Institute for Forestry, Agricultural and Environmental Engineering, Nancy, France, 128 pp. ´ Coudun, Ch., Gegout, J.-C., 2005. Ecological behaviour of herbaceous forest species along the pH gradient: a comparison between oceanic and semi-continental regions in northern France. Glob. Ecol. Biogeogr. 14, 263–270. ´ Coudun, Ch., Gegout, J.-C., Piedallu, C., Rameau, J.-C., in press. Soil nutritional factors improve plant species distribution models: an illustration with Acer campestre (L.) in France. J. Biogeogr, doi:10.1111/j.1365-2699.2005.01443.x. Diekmann, M., 2003. Species indicator values as an important tool in applied plant ecology: a review. Basic Appl. Ecol. 4, 493–506.

174

e c o l o g i c a l m o d e l l i n g 1 9 9 ( 2 0 0 6 ) 164–175

¨ Dirnbock, T., Dullinger, S., Grabherr, G., 2003. A regional impact assessment of climate and land use change on alpine vegetation. J. Biogeogr. 30, 1–17. ¨ Dirnbock, T., Dullinger, S., 2004. Habitat distribution models, spatial autocorrelation, functional traits and dispersal capacity of alpine plant species. J. Veg. Sci. 15, 77–84. Edwards Jr., T.C., Cutler, D.R., Geiser, L., Alegria, J., McKenzie, D., 2004. Assessing rarity of species with low detectability: lichens in Pacific Northwest forests. Ecol. Appl. 14, 414–424. Edwards Jr., T.C., Cutler, D.R., Zimmermann, N.E., Geiser, L., Moisen, G.G., this issue. Effects of underlying sample survey designs on the utility of classification tree models in ecology. Ecol. Model. Elith, J., Burgman, M.A., 2002. Predictions and their validation: rare plants in the Central Highlands, Victoria, Australia. In: Scott, J.M., Heglund, P.J., Morrison, M.L., Haufler, J.B., Raphael, M.G., Wall, W.A., Samson, F.B. (Eds.), Predicting Species Occurrences: Issues of Accuracy and Scale. Island Press, Washington, DC, pp. 303–313. ´ ´ ` Gegout, J.-C., Pierrat, J.-C., 1998. L’autecologie des especes ´ etales: ´ ´ ´ veg une approche par regression non parametrique. Ecologie 29, 473–482. ´ ´ ´ Gegout, J.-C., 2001. Creation d’une base de donnees ´ ´ ´ ` phytoecologiques pour determiner l’autecologie des especes ` de France. Rev. For. Fr. 53, 397–403. de la Flore Forestiere ´ Gegout, J.-C., Krizova, E., 2003. Comparison of indicator values of understory species in Western Carpathians (Slovakia) and Vosges Mountains (France). For. Ecol. Manage. 182, 1–11. ´ ´ J.-C., Houllier, F., Pierrat, J.-C., 2003. Prediction Gegout, J.-C., Herve, of forest soil nutrient status using vegetation. J. Veg. Sci. 14, 55–62. ´ Gegout, J.-C., Coudun, Ch., Bailly, G., Jabiol, B., 2005. EcoPlant: a forest sites database linking floristic data with soil characteristics and climatic conditions. J. Veg. Sci. 16, 257–260. Guisan, A., Weiss, S.B., Weiss, A.D., 1999. GLM versus CCA spatial modeling of plant species distribution. Plant Ecol. 143, 107–122. Guisan, A., Theurillat, J.-P., 2000. Assessing alpine plant vulnerability to climate change: a modeling perspective. Integr. Asses. 1, 307–320. Guisan, A., Edwards Jr., T.C., Hastie, T.J., 2002. Generalized linear and generalized additive models in studies of species distributions: setting the scene. Ecol. Model. 157, 89–100. Hastie, T.J., Tibshirani, R., 1997. Generalized Additive Models. Chapman and Hall, London. Heegaard, E., Birks, H.J.B., Gibson, C.E., Smith, S.J., Wolfe-Murphy, S., 2001. Species-environmental relationships of aquatic macrophytes in Northern Ireland. Aquat. Bot. 70, 175–223. Hill, M.O., Mountford, J.O., Roy, D.B., Bunce, R.G.H., 1999. Ellenberg’s indicator values for British plants. Technical Annex to Volume 2 of the ECOFACT Research Report Series. Center of Ecology and Hydrology (CEH), Natural Environment Research Council, pp. 1–46. Hill, M.O., Roy, D.B., Mountford, J.O., Bunce, R.G.H., 2000. Extending Ellenberg’s indicator values to a new area: an algorithmic approach. J. Appl. Ecol. 37, 3–15. Huisman, J., Olff, H., Fresco, L.F.M., 1993. A hierarchical set of models for species response analysis. J. Veg. Sci. 4, 37–46. Karadzic, B., Marinkovic, S., Katarinovski, D., 2003. Use of the beta-function to estimate the skewness of species responses. J. Veg. Sci. 14, 799–805. Karl, J.W., Svancara, L.K., Heglund, P.J., Wright, N.M., Scott, J.M., 2002. Species commonness and the accuracy of habitat-relationship models. In: Scott, J.M., Heglund, P.J., Morrison, M.L., Haufler, J.B., Raphael, M.G., Wall, W.A., Samson, F.B. (Eds.), Predicting Species Occurrences: Issues of Accuracy and Scale. Island Press, Washington, DC, pp. 573–580.

Lawesson, J.E., Oksanen, J., 2002. Niche characteristics of Danish woody species as derived from coenoclines. J. Veg. Sci. 13, 279–290. Lehmann, A., Overton, J.McC., Austin, M.P., 2002a. Regression models for spatial prediction: their role for biodiversity and conservation. Biodiv. Conserv. 11, 2085–2092. Lehmann, A., Leathwick, J.R., Overton, J.McC., 2002b. Assessing New Zealand fern diversity from spatial predictions of species assemblages. Biodiv. Conserv. 11, 2217–2238. Manel, S., Williams, H.C., Ormerod, S.J., 2001. Evaluating presence-absence models in ecology: the need to account for prevalence. J. Appl. Ecol. 38, 921–931. MathSoft Inc., 1999. S-Plus 2000, Programmer’s Guide. Seattle, Washington, USA, MathSoft Inc. McCullagh, P., Nelder, J.A., 1997. Generalized Linear Models. Chapman and Hall, London, UK. McKenney, D.W., Venier, L.A., Heerdegen, A., McCarthy, M.A., 2002. A Monte Carlo experiment for species mapping problems. In: Scott, J.M., Heglund, P.J., Morrison, M.L., Haufler, J.B., Raphael, M.G., Wall, W.A., Samson, F.B. (Eds.), Predicting Species Occurrences: Issues of Accuracy and Scale. Island Press, Washington, DC, pp. 377–381. McPherson, J.M., Jetz, W., Rogers, D.J., 2004. The effects of species’ range sizes on the accuracy of distribution models: ecological phenomenon or statistical artefact? J. Appl. Ecol. 41, 811–823. Mohler, C.L., 1983. Effect of sampling pattern on estimation of species distributions along gradients. Vegetatio 54, 97–102. Odland, A., Birks, H.J.B., Line, J.M., 1995. Ecological optima and tolerances of Thelypteris limbosperma, Athyrium distentifolium, and Matteuccia struthiopteris along environmental gradients in Western Norway. Vegetatio 120, 115–129. Oksanen, J., 1997. Why the beta-function cannot be used to estimate skewness of species responses. J. Veg. Sci. 8, 147–152. Oksanen, J., Minchin, P.R., 2002. Continuum theory revisited: what shape are species responses along ecological gradients? Ecol. Model. 157, 119–129. Pearce, J., Ferrier, S., 2000a. An evaluation of alternative algorithms for fitting species distribution models using logistic regression. Ecol. Model. 128, 127–147. Pearce, J., Ferrier, S., 2000b. Evaluating the predictive performance of habitat models developed using logistic regression. Ecol. Model. 133, 225–245. Roy, D.B., Hill, M.O., Rothery, P., Bunce, R.G.H., 2000. Ecological indicator values of British species: an application of Gaussian logistic regression. Ann. Bot. Fen. 37, 219–226. Rydgren, K., Økland, R.H., Økland, T., 2003. Species response curves along environmental gradients. A case study from SE Norwegian swamp forests. J. Veg. Sci. 14, 869–880. Scott, J.M., Heglund, P.J., Samson, F., Haufler, J., Morrison, M., Raphael, M., Wall, B., 2002. Predicting Species Occurrences: Issues of Accuracy and Scale. Island Press, Covelo, California. ´ Segurado, P., Araujo, M.B., 2004. An evaluation of methods for modelling species distributions. J. Biogeogr. 31, 1–14. Stockwell, D.R.B., Peterson, A.T., 2002a. Effects of sample size on accuracy of species distribution models. Ecol. Model. 148, 1–13. Stockwell, D.R.B., Peterson, A.T., 2002b. Controlling bias in biodiversity data. In: Scott, J.M., Heglund, P.J., Morrison, M.L., Haufler, J.B., Raphael, M.G., Wall, W.A., Samson, F.B. (Eds.), Predicting Species Occurrences: Issues of Accuracy and Scale. Island Press, Washington, DC, pp. 537–546. ter Braak, C.J.F., Looman, C.W.N., 1986. Weighted averaging, logistic regression and the Gaussian response model. Vegetatio 65, 3–11. ter Braak, C.J.F., 1996. Unimodal Models to Relate Species to Environment. Agricultural Mathematics Group.

e c o l o g i c a l m o d e l l i n g 1 9 9 ( 2 0 0 6 ) 164–175

Virtanen, A., Kairisto, V., Uusipaikka, E., 1998. Regression-based reference limits: determination of sufficient sample size. Clin. Chem. 44, 2353–2358. Wiser, S., Peet, R.K., White, P.S., 1998. Prediction of rare-plant occurrence: a southern Appalachian example. Ecol. Appl. 8, 909–920.

175

Yee, T.W., Mitchell, N.D., 1991. Generalized additive models in plant ecology. J. Veg. Sci. 2, 587–602. Zaniewski, A.E., Lehmann, A., Overton, J.McC., 2002. Predicting species spatial distributions using presence-only data: a case study of native New Zealand ferns. Ecol. Model. 157, 261–280.