Multiple Temporal Cluster Detection - Nicolas Molinari

allows consideration of a variable population size during the time of study. A model selection ..... second example presents a case with multiple clusters.
656KB taille 16 téléchargements 235 vues
BIOMETRICS 57, 577-583 June 2001

Multiple Temporal Cluster Detection Nicolas Molinari,’ Chistophe Bonaldi, and Jean-Pierre Daurhs Laboratoire de Biostatistique, Institut Universitaire de Recherche Clinique, 641 Avenue Gaston Giraud, 34093 Montpellier, France

* email: [email protected]

SUMMARY.This article proposes a simple method to determine single or multiple temporal clustering on a variable size population. By a transformation of the data set, the method based on a regression model

allows consideration of a variable population size during the time of study. A model selection procedure and a resampling method are used t o select the number of clusters. The results have applications in epidemiological studies of rare diseases. KEY WORDS: Model selection; Nearest neighbor; Regression method; Resampling method.

1. Introduction Naus (1965) introduced a test known as the scan test. The test statistic, the maximum number of cases observed in an In epidemiological studies, when the etiology of a disease has not yet been well established, it is sometimes required interval of length t , is found by scaling all intervals of length t to examine data for obtaining evidence of temporal cluster- in the time period. Statistical significance of the scan test ing or of cyclical clustering, as in seasonal variations. Let is assessed by using tables of pvalues computed by Naus X I , . . . ,X N be independent and identically distributed (i.i.d.) (1966) and Wallenstein (1980) for selected interval lengths, random variables that denote the times of occurrence of N time lengths, and sample sizes. Weinstock (1981) proposed a events in an interval (0,T ) . We wish t o test the null hypoth- generalization of the scan test that is adjusted for changes esis that the events are uniformly distributed against the al- among the population at risk. Unfortunately, with the simternative that they cluster within some subintervals of (0, T ) . ulated example, the scan test does not provide a significant Ederer, Myers, and Mantel (1964) developed a test for statistic to reject the uniformity hypothesis (with six subintemporal clustering using a cell-occupancy approach. They tervals, p = 0.38). An efficient method for detecting temporal clustering is divided the time period into disjoint subintervals. This test statistic is simply the number of cases occurring in a subin- proposed by Kulldorff and Nagarwalla (1995). With the scan terval. Under the no-clustering hypothesis, the N cases are statistics with variable window, the cluster time window size randomly distributed among the subintervals. However, the does not need to be chosen a priori. This test is the genresulting chi-square test (used t o test the multinomial distri- eralized likelihood ratio test for a uniform null distribution against an alternative of nonrandom clustering. Bootstrapped bution) does not yield an efficient method. Consider the following hypothetical example. Suppose we simulations are performed to carry out the significance test. observe a rare disease during 1 year in a little town, say Clus- For the example, we obtain p = 0.09 with 1000 simulations, terville. The number of known events is N = 42 for the whole and we fixed the minimal number of points at five. The test year. The study starts the first day and the first event occurs only considers clusters that contain five or more points (Naon the 11th day. Every 10 days, we observe another event ex- garwalla, 1996). An extension of this method is presented by cept from day 181 to day 241, when one event occurs every Kulldorff (1997). The scan statistic with a variable window is 5 days. The study stops at day 365. It is clear that [181,241] used for detecting disease clusters in heterogeneous populais a time window with clustered events. Tango (1984) pro- tions. He introduced a spatial scan statistic for the detection poses a test of temporal clustering based on the distribution of clusters not explained by the baseline process in heterogeof counts in disjoint equal time intervals. Whittemore and neous populations. Larsen, Holmes, and Heath (1973) developed a rank-order Keller (1986) showed that the distribution of Tango’s index is asymptotically normal. Applying this procedure to Clus- procedure. The time period is divided into disjoint subinterterville, assuming six time intervals does not allow rejection vals that are numbered sequentially. The test statistic is the of the null hypothesis of uniformity (p = 0.2). Note that the sum of absolute differences between the rank of the subintercluster interval [181,241] does not match the interval using val in which a case occurred and the median subinterval rank. Tango’s index, which does not contain the event occurring at This test is sensitive only to unimodal clustering and cannot day 181. distinguish between multiple clustering and randomness.

577

578

Biometrics, June 2001

Huntington and Naus (1975), Cressie (1977), and Hwang (1977) derived and then Naus (1982) accurately approximated the probability of at least one cluster. Barton and David (1956) found the distribution of the number of clusters of size two. McClure (1976) obtained asymptotic results for the distribution of the number of clusters of a given size. Glaz and Naus (1983) established the expectation, variance, and approximate distribution of the number of clusters of a given size. So, with rare diseases, a long time of study is necessary to examine data for evidence of temporal clustering. The problem is that, in this case, the population at risk evolves during time. Due to a natural increasing or to a seasonal evolution, the population at risk is not constant during the time of study. In the next section, we present a new method for determining data clustering. Based on a simple transformation of the data, our method determines a time window with excess events and, for any position of the window, it scans continuously across the period of observation. Moreover, the method is effective with changes in the population at risk. Existence of one or more clusters is determined by using bootstrapped simulations and a classical model selection procedure. The regression method is explored using simulations that allow for an examination of its properties and also on the classical Knox data set. Another data set consists of 62 spontaneous hemoptysis admissions (pulmonary disease) at Nice hospital from January 1 to December 31, 1995. Detecting periods of significant cluster occurrences brings precious information on the disease. The purpose of this investigation is to adapt conditions of admission or treatment of predisposed patients during a favorable period. Another objective is to point out potential climatic factors, like temperature or hydrometry, that influence the disease occurrence. Nevertheless, since Nice is situated in the south of France, each summer, a lot of tourists increase the population at risk. An estimation of this popula, tion is used in our model for detecting clusters. 2. Method Presentation The approach is first based on a transformation of the data set in order to produce values corresponding to the time (the distance) between two successive events. Under the no-clustering hypothesis, these values can be estimated by a constant, i.e., the mean distance. On the contrary, a piecewise constant model improves the fitting. A classical criterion for selecting models allows determination of the presence of clusters. Statistical tests for cluster detection must have a correct nominal (Y level. Since the proposed method is not a conventional statistical test, we propose using bootstrapped samples to obtain a pvalue and to compare its performance with those of existing statistical tests. At the end of this section, we propose a simple transformation of the data set that considers changes in the population at risk. 2.1 Data Thnsfomnataon Let X I , . ..,X N be defined as in the Introduction. Without loss of generality, set T = 1 throughout this section. Suppose that X I , . . . ,X N are dropped at random in the unit interval (0,l). As indicated in Figure 1, denote the ordered distances of these points from the origin by z ( ~ )(i = I,.. . , N ) and set yi = z(i)- z ( ~ - ("~(0)) = 0). Assuming that the X z ' s are i.i.d. uniform U(0, l ) , the random variables X ( l ) ,. . . ,X ( N are then distributed as N-order statistics from a uniform UtO, 1)

YN

YZ

Y1

0 = Z(0)

....

l(2)

l(1)

Figure 1.

l(N-1)

z(N)

Random division of an interval.

+

parent, i.e., X ( i ) follows a beta distribution @ ( i , N - i 1) and Y , = X ( i ) - X ( i - l ) has a beta distribution @ ( l , N ) (see David, 1980). A slightly efficient method for detecting nonrandom clusters of points on a line is, e.g., by verifying, using a Kolmogorov-Smirnov test, that the yi's have a beta distribution p(1,N ) . This method is equivalent to testing the assumption of uniformity of the X i ' s . In the case where a cluster is present, the test does not provide the time window with excess events. In the next section, we present our method for detecting, according t o the Yi values, the cluster's presence and also for determining its ranges. 2.2 Data Fatting Let (21,. . . , Z N ) be a sample of X and ( y l , . . . , Y N ) be the . . . , Y N )defined as in the corresponding sample of Y = (Yl, previous section. Consider the data set ( i , y i ) i = l , , . . , ~ .Under the no-clusters hypothesis, an appropriate regression on this data set is the constant function N

1

f(i) = jj = - c y .

N

j=1

j '

Figure 2 presents the regression function and the data points corresponding to the Clusterville example. Assume that events ~ ( k ). ,. . ,z ( k + l ) are clustered, i.e., 1 ac=,

c

-

k+l

1

N

Yi