Gaussian Parsimonious Clustering Models Scale ... - Alexandre Lourme

Gaussian Parsimonious Clustering Models Scale Invariant and Stable by .... among two nested RTV models thanks to the Likelihood Ratio Test (outside.
307KB taille 5 téléchargements 282 vues
Gaussian Parsimonious Clustering Models Scale Invariant and Stable by Projection Christophe Biernacki(1), Alexandre Lourme(2) (1) Universit´e Lille I, Modal Team, INRIA Lille - Nord Europe (France). (2) Institut de Math´ematiques de Bordeaux & Universit´e Bordeaux IV (France).

Gaussian mixtures are commonly used for clustering continuous data. In order to reduce the variability of the general heteroscedastic model, Celeux and Govaert (1995) define geometrical parsimonious models based on a spectral decomposition of the covariance matrices. These models have had a seminal influence in recent years but some of them suffer from multiple drawbacks. Then we display a new family of parsimonious Gaussian mixtures based on a variance-correlation decomposition of the covariance matrices. The parsimony of these models refers to parameters of statistical interpretation (standard-deviation, correlation, coefficient of variation) instead of a geometric interpretation (volume, orientation, shape). These new mixtures own multiple stability properties.

New Parsimonious Models

Experiments

Introduction In order to infer some structure in a sample x = {xi; i = 1, . . . , n} ⊂ Rd, Gaussian model-based clustering that the data arise from a mixture of K normal comPassumes ponents : f (•; ψ) = K k=1 πk φd(•; µk , Σk ) (see McLachlan and Peel, 2000, chap. 3). µk ∈ Rd is the center of the Gaussian component k, Σk ∈ Rd×d denotes its covariance PK matrix and πk its weight (πk > 0 and j=1 πj = 1). Spectral decomposition of the covariance matrices Each matrix Σk is symmetric, definite, positive. So Σk = λk Sk Λk S′k where : (i) λk = |Σk |1/d (volume of the class k), (ii) Sk is an orthogonal matrix the columns of which are Σk eigenvectors (orientation of the class k) and (iii) Λk is a diagonal matrix with determinant 1 and diagonal coefficients in decreasing order (shape of the class k). Parsimonious models In order to reduce the variability of the general heteroscedastic model Celeux and Govaert (1995), inspired by Banfield and Raftery (1993), define several parsimonious models by assuming λk , Rk and Λk parameters to be either homogeneous on k or free. These models have had a seminal influence in recent years (see Biernacki et al. 2006 ; Bouveyron et al. 2007 ; Maugis et al. 2009), but they suffer from several drawbacks. Several weaknesses of the geometrical models • Some representations of the constraints are not adequate although widespread.

Variance-correlation decomposition of the covariance matrices Each matrix Σk can also be written Σk = Tk Rp k Tk where : (i) Tk is the diagonal matrix of conditional standard-deviations (Tk (i, j) = Σk (i, j) if i = j and 0 otherwise) and (ii) Rk the associated matrix of conditional correlations (Rk = (Tk )−1Σk (Tk )−1). Parsimonious hypotheses on statistical parameters The matrices Tk are considered either as free, isotropically transformed (∀ k : Tk = ak T1; ak ∈ R∗+) or homogeneous (Tk = T) ; the matrices Rk are either free or homogeneous (Rk = R) ; the vectors Vk = T−1 k µk are free or equal (Vk = V). The so-called RTV family consists of eleven Gaussian mixture models obtained by combining the previous constraints on Tk , Rk and Vk parameters (the family does not include the model assuming all parameters as homogeneous since the latter merges all the components). Remark. V k = V means that conditional coefficients of variation are homogeneous. Properties of the new models • Every RTV model is faithfully representable in low dimension.

Clustering Old Faithful geyser eruptions (from Venables and Ripley 2002) The variables Waiting and Duration describing the n = 272 eruptions of the geyser are either (i) both measured in min (original units), or (ii) measured in sec × min, or else (iii) both divided by their standard-deviation. We set K = 2 and the models are sorted within each family (geometric and RTV) according to ICL values.

The ellipso¨ıds on the right figure are related to the model [R, T k , V k ] in R3. The constraints remain in each canonical plane : correlations are homogeneous (solid segments with equal slopes), standard-deviations are free (nonsuperimposable dashed rectangles), coefficients of variation are free (non-superimposable arrows).

1. Situation (i) of original units : according to the best model (ICL = 1158.8) Waiting and Duration are homogeneously correlated among short and long eruptions (the RTV family provides significant interpretations).

5

5

4.5

4.5

3.5 3 2.5 2

4 3.5

2.5 2

1.5

1.5

1

1

0.5 −1

0.5 0.2

0

1

Waiting (hour)

2

3

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Waiting (hour)

• The model structure is broken by projection into the canonical planes. Right figure : two Gaussian components in R3 with identical shapes, homogeneous volumes and free orientations ([λ, S k , Λ]). The Gaussian projected components into a canonical plane have no more the same volume. • Changing the measurement units breaks the model structure. Let D ∈ Rd×d be diagonal, definite, positive. For example the model [λk , S k , Λ] (homogeneous shapes and free other parameters) cannot have generated both samples x and Dx = {Dxi; i = 1, . . . , n}. • The model selected by AIC (Akaike 1974), BIC (Schwarz 1978), ICL (Biernacki et al. 2000) or many other likelihood-based criteria, depends on the units of the data. By selecting a model with ICL, two clusters of Old Faithful eruptions (Column 3) may have heterogeneous or homogeneous shapes, depending on Waiting and Duration units.

Left : the model [R, T k , V k ] inferred on Old Faithful data (K = 2), Waiting being measured in hour and Duration in minute. Right : the correlations keep on being homogenous (and the other parameters free) by converting Waiting from hour into quarter of hour.

ICL

model

ICL

1

[λk , S, Λk ]

1160.3

[λk , S, Λk ]

2272.4

[λk , S k , Λ]

414.6

2

[λk , S k , Λk ]

1161.4

[λk , S k , Λk ]

2275.0

[λk , S k , Λk ]

415.6

3

[λk , S, Λ]

1161.7

[λk , S, Λk ]

2275.1

[λk , S, Λ]

415.9

4

[λk , S k , Λ]

1162.9

[λk , S, Λ]

2275.4

[λ, S k , Λ]

417.0

1

[R, T k , V k ]

1158.8

[R, T k , V k ]

2272.5

[R, T k , V k ]

413.0

2

[Rk , T k , V k ]

1161.4

[Rk , T k , V k ]

2275.0

[Rk , T k , V k ]

415.6

3

[R, ak T 1, V k ] 1161.7 [R, T , V k ]

[R, ak T 1, V k ] 2275.4 [R, T , V k ]

1163.4

(i) min×min

[1] Akaike, H. : A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–723 (1974) [2] Banfield, J.D., and Raftery, A.E. : Model-Based Gaussian and Non-Gaussian Clustering. Biometrics, 49, 803–821 (1993) [3] Biernacki, C., Celeux, G., and Govaert, G. : Assessing a mixture for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 719–725 (2000)

[R, T , V k ]

2277.0

417.6

(iii) no unit × no unit

(ii) sec×min

Left : the data : n = 336 seabirds described by d = 5 morphological variables : Culmen depth, etc. Right : the researched partition according to the bird subspecies : borealis, diomedea or edwardsii.

4.5 5.5

4

5

4.5

3.5

3

2.5

4

3.5

60

58

58

56

56

54

54

52 50 48

52 50 48

46

46

44

44

42

42

40

borealis diomedea edwarsii

40 10

11

12

13

14

15

16

17

18

10

11

12

Culmen depth

3

13

14

15

16

17

Culmen depth

• Using BIC for selecting K.

2.5

2

60

2

1.5 1.5

0.6

0.8

1

1.2

1.4

1.6

1.8

Waiting (hour)

2.5

3

3.5

4

4.5

5

5.5

6

6.5

Waiting (quart. of hour)

• The selected model does not depend on the measurement units of the data. Let D ∈ Rd×d be diagonal, definite, positive, and Dx = {Dxi; i = 1, . . . , n} be converted data. For any RTV model m, the BIC criterion (Schwarz 1978) satisfies : BIC(m; Dx) = BIC(m; x) − n log |D|.

(1)

Then changing the measurement units does not modify the RTV model ranks associated to BIC values. This property, so as relation (1), holds for many other likelihood-based criteria like AIC (Akaike 1974) or ICL (Biernacki et al. 2000). • Dividing each variable by its standard-deviation does not change neither the RTV model selected by AIC, BIC, or ICL, nor the associated partition of the data. So there is no need to normalize the data before clustering them. • For any couple of RTV models m0 and m1, the maximized-likelihood ratio ˆ m ; x)/L(ψ ˆ m ; x) does not depend on x-data measurement units. L(ψ 0 1 So selecting among two nested RTV models thanks to the Likelihood Ratio Test (outside the clustering context) leads to identical decisions whatever are the measurement units.

7

The RTV family enables to retrieve three clusters of Shearwaters.

K 1 2 3 4 5 RTV models 4472.0 4356.6 4335.2 4347.8 4370.1 geometric models 4472.0 4362.5 4344.2 4341.7 4355.8

• Comparing the inferred partition (K = 3) to the bird subspecies. The 3-class partition related to RTV provides some especially low misclassification error rate. • Interpretation of the inferred model.

family best model BIC error rate RTV [R, Tk , V] 4335.2 2.68% geometric [λk , S, Λ] 4344.2 2.98%

65

60

The overall best model [R, Tk , V] (BIC = 4335.2) allows the following interpretation : the three Shearwaters subspecies derive stochastically from a common reference population (see Biernacki and Lourme 2012).

55

50

45

T 40

T borealis (estimated) 35 diomedea (estimated) edwardsii (estimated) 30 reference population Γ(ρ,V,R) 25 9 10 RT )11 12 Γ(ρ,T V,T k

R´ ef´ erences

[R, ak T 1, V k ] 415.9

Clustering Cory’s Shearwaters (from Thibault et al. 1997)

5

3

model

4

• Every RTV structure is scale invariant.

Duration (min)

4

ICL

Bill length

5.5

RTV

model

Tarsus

5.5

geometrical

rank

Bill length

6

family

2. From (i) to (iii) the best RTV model remains [R, T k , V k ] (the selection of some RTV hypothesis is robust to any data scale modification).

Duration (min)

6

Duration (min)

Left : two normal components with same orientations, free shapes and free volumes ([λk , S, Λk ]) inferred on Old Faithful data (K = 2), and represented in an orthonormal basis. Right : the major axes of the ellipses have no more the same direction if x-axis scale is changed.

Duration (min)

Geometrical Parsimonious Models

[4] Biernacki, C., Celeux, G., Govaert, G., and Langrognet, F. : Model-based cluster and discriminant analysis with the mixmod software. Computational Statistics and Data Analysis, 51(2), 587–600 (2006) [5] Biernacki, C., and Lourme, A. : Gaussian Parsimonious Clustering Models Scale Invariant end Stable by Projection. Tech. report, INRIA Research Report number 7932 (2012) [6] Bouveyron, C., Girard, S., and Schmid, C. : High-dimensional data clustering. Computational Statistics and Data Analysis, 52(1), 502-519 (2007) [7] Celeux, G., and Govaert, G. : Gaussian parsimonious clustering models. Pattern Recognition, 28(5), 781–793 (1995)

k

k

1

T2

3

13

14

15

16

17

18

19

Culmen depth

[8] Maugis, C., Celeux, G., and Martin-Magniette, M.L. : Variable selection for Clustering with Gaussian Mixture Models. Biometrics, 65, 701–709 (2009) [9] McLachlan, G. and Peel, D. : Finite Mixture Models. Wiley (2000) [10] Schwarz, G. : Estimating the dimension of a model. The Annals of Statistics, 6, 461–464 (1978) [11] Thibault, J.C., Bretagnolle, V., and Rabouam, C. : Cory’s shearwater calonectris diomedea. Birds of Western Paleartic Update, 1, 75–98 (1997) [12] Venables, W.N., and Ripley, B.D. : Modern Applied Statistics with S. Fourth Edition. Springer, New York (2002)

18