Using the Ï2-distance â computing distances from all the principal .... Surface.feeling -2.52. 2.63. 3.62 .... A website with documentation, examples, data sets:.
Clustering and Principal Component Methods 1 Clustering Methods 2 Principal Components Methods as a Preprocessing Step 3 Graphical Complementarity
1 / 24
Clustering Methods
Preprocessing
Graphical Complementarity
References
Unsupervised classication
• • •
Data set: table individuals × variables (or a distance matrix) Objective: to produce homogeneous groups of individuals (or groups of variables) Two kinds of clustering to dene two structures on individuals: hierarchy or partition
2 / 24
Clustering Methods
Preprocessing
Graphical Complementarity
References
Hierarchical Clustering Principle: sequentially agglomerate (clusters of) individuals using • a distance between individuals: City block, Euclidean • an agglomerative criterion: single linkage, complete linkage, average linkage, Ward's criterion Single linkage
Euclidean City-block
Complete linkage
Representation with a dendrogram ⇒ Eulidean distance is used in principal component methods ⇒ Ward's criterion is based on multidimensional variance (inertia)
which is the core of principal component methods
3 / 24
Clustering Methods
Preprocessing
Graphical Complementarity
References
Ascending Hierarchical Clustering AHC algorithm: • Compute the Euclidean distance matrix (I × I ) • Consider each individual as a cluster • Merge the two clusters A and B which are the closest with respect to the Ward's criterion: ∆ward (A, B ) =
•
A IB d 2 (µ , µ ) A B IA + IB I
with d the Euclidean distance, µA the barycentre and IA the cardinality of the set A Repeat until the number of clusters is equal to one 4 / 24
Clustering Methods
Preprocessing
Graphical Complementarity
References
Ward's criterion
• •
Individuals can be represented by a cloud of points in RK Total inertia = multidimensional variance
With Q groups of individuals, inertia can be decomposed as: I Q X K X X q
K
Q
K
Q
I
q
2 X X Iq (¯xqk −¯xk )2 +X X X(xiqk −¯xqk )2 (xiqk −¯ xk ) = k =1 q=1 i =1 k =1 q=1 k =1 q=1 i =1 Total inertia = Between inertia + Within inertia
5 / 24
Clustering Methods
Preprocessing
Graphical Complementarity
References
Ward's criterion Step 1: 1 cluster = 1 individual Within = 0 Between = Total
Step I : only 1 cluster Within = Total Between = 0
⇒ Ward minimizes the increasing of within inertia
6 / 24
Clustering Methods
Preprocessing
Graphical Complementarity
References
K-means algorithm 1 2 3 4
Choose Q points at random (the barycentre) Aect the points to the closest barycentre Compute the new barycentre Iterate 2 and 3 until convergence
7 / 24
Clustering Methods
Preprocessing
Graphical Complementarity
References
PCA as a preprocessing With continuous variables: ⇒ AHC and k-means onto the raw data ⇒ AHC or k-means onto principal components PCA transforms the raw variables into orthogonal principal components F.1 , ..., F.K with decreasing variance λ1 ≥ λ2 ≥ ...λK x.1
x.k
x.K
F1
FQ
FK
PCA Data
Structure
Noise
⇒ Keeping the rst components makes the clustering more robust ⇒ But, how many components do you keep to denoise? 8 / 24
Clustering Methods
Preprocessing
Graphical Complementarity
References
MCA as a preprocessing Clustering on categorical variables: which distance to use? • •
with two categories: Jaccard index, Dice's coecient, simple match, etc. Indices well-tted for presence/absence data with more than 2 categories: use for example the χ2 -distance
Using the χ2 -distance ⇔ computing distances from all the principal components obtained from MCA In practice, MCA is used as a preprocessing in order to • transform categorical variables in continuous ones • delete the last dimensions to make the clustering more robust 9 / 24
Clustering Methods
Preprocessing
Graphical Complementarity
References
MFA as a preprocessing X1
X2
i i’
MFA balances the inuence of the groups when computing distances between individuals K J X 1 X 2 0 p d (i , i ) = (xik − xi 0 k )2 λ j j =1 k =1 j
AHC or k-means onto the rst principal components (F.1 , ..., F.Q ) obtained from MFA allows to • take into account the groups structure in the clustering • make the clustering more robust by deleting the last dimensions 10 / 24
Clustering Methods
Preprocessing
Graphical Complementarity
References
Back to the wine data! AHC onto the rst 5 principal components from MFA Hierarchical Clustering
V Font Coteaux
V Aub Marigny
V Font Brûlés
V Font Domaine
S Buisse Cristal
S Michaud
S Buisse Domaine
S Trotignon
S Renaudie
0.0
V Aub Silex
0.5
1.0
1.5
2.0
Hierarchical Classification
Individuals are sorted according to their coordinate F.1 11 / 24
Page 1 ..... of gravity. ⢠Assign the points to the closest center. ⢠Calculate anew the. Q centers of gravity q ..... Idea : rank the variables by decreasing |test statistic|.
Pages 11 - 13 .... Next, we have to calculate the distance between each individual and the group A C. To do ... Time to calculate the next distance matrix. ..... So, what we can do is rank the variables in terms of increasing absolute value of the ..
Graphical representations. 3 / 40 ... Sensory analysis: products - descriptors. ⢠Ecology: plants ..... v1, ...,vQ are the eigenvectors of W = XX the inner product.
text-mining: number of times the word i is in the text j. ⢠solutions (acid, bitter, etc.) - answers (acid, bitter, etc.): number of persons who answer j for the stimulus i.
Page 1 ..... Rank of the axis. % of in ertia. 0.0. 0.1. 0.2. 0.3. 0.4. 23 / 43 .... Along the s-th axis, we calculate the barycenter of each column, with column j given a ...
Describe the dimensions in terms of the initial variables. 5. Should we standardize the variables? (optional) Do clustering of the individuals (and describe the ...
Aug 11, 2008 - ... with lemon, milk? ⢠the product's perception (12 questions): is tea good for health? is it stimulating? ⢠personal details (4 questions): sex, age.
Common Structure. Groups Study. Partial Analyses. Example. Multiple Factor Analysis. 1 Data - Issues. 2 Common Structure. 3 Groups Study. 4 Partial Analyses.
Sensory analysis: score for attribute k of product i. ⢠Ecology: concentration of pollutant k in river i. ⢠Economics: indicator value k for year i. ⢠Genetics: expression ...
nique to keep as few relevant dimensions as possible to describe each of ... subspace clustering methods and discuss their performances; we then describe.
network is brought down: echo 0 > /proc/mosix/admin/mospe rm -f /var/lock/subsys/mosix umount. /mfs. These commands remove the node from the cluster and ...
It partitions the data into k clusters represented by their centers. The ... a partition of K â 1 clusters. ..... popular measures: precision, recall and Rand Index (RI).
No need to mention that the practitioner has ... Notice that the supplementary information don't intervene in any way in the calculus of the vectors Fs and Gs but ... athletics meeting (2004 Olympic Game or 2004 Decastar). By default, the PCA ...
remains an open issue. The main problem is that the evaluation of clustering re- sults is subjective by nature. Indeed, there are often many different and relevant.
Page 1 ..... and the way of calculating the dissimilarities between objects and imprecise classes. .... Precision (EP), Evidential Recall (ER) and Evidential Rank.