Analyzing categorization data
Marine Cadoret, Sébastien Lê, Jérôme Pagès AGROCAMPUS OUEST, France
International Federation of Classification Societies March 15th 2009, Dresden
Introduction
Categorization consists in grouping objects in function of their resemblances. Following this task, a verbalization task can also be asked to describe the groups (“qualified” categorization).
2/28
Data
98 consumers carried out a “qualified” categorization on 12 luxury perfumes:
Angel
Shalimar
Lolita Lempicka
L’instant Cinéma
Aromatics Chanel Coco Elixir Mademoiselle n°5
J’adore J’adore (ET) (EP)
Pure Poison
Pleasures 3/28
« gourmand, vanilla, wooded »
« spicy, aldehyde »
« white flower, vanilla, orange »
« oriental, showy, wooded, Patchouli oil »
« flower, floral, green »
4/28
Data table (1) Shalimar Shalimar Aromatics Elixir Chanel n°5 Angel Lolita Lempicka Cinéma L'instant Pure Poison Coco Mademoiselle Pleasures J'adore (EP) J'adore (ET)
98 42 30 21 9 10 13 11 9 6 6 7
Aromatics Chanel Lolita Pure Coco J'adore J'adore Angel Cinéma L'instant Pleasures Elixir n°5 Lempicka Poison Mademoiselle (EP) (ET) 42 98 51 27 6 8 13 12 12 11 12 7
30 51 98 15 8 9 10 21 11 14 12 14
21 27 15 98 36 18 14 10 10 11 11 12
9 6 8 36 98 42 22 18 21 18 18 18
10 8 9 18 42 98 26 28 30 22 23 24
13 13 10 14 22 26 98 25 20 23 28 22
11 12 21 10 18 28 25 98 33 30 29 28
9 12 11 10 21 30 20 33 98 28 28 38
6 11 14 11 18 22 23 30 28 98 38 48
6 12 12 11 18 23 28 29 28 38 98 56
Data usually gathered in a cooccurrences (or dissimilarities) matrix and analyzed by non-metric MDS
5/28
7 7 14 12 18 24 22 28 38 48 56 98
Data table (2)
produit Angel Aromatic Elixir Chanel n°5 Cinéma Coco Mademoiselle J'adore (EP) J'adore (ET) L'instant Lolita Lempicka Pleasures Pure Poison Shalimar
juge 12 1 3 4 2 1 1 2 1 1 3 1 2
juge 13 4 3 3 5 5 6 6 4 5 4 1 2
juge 14 1 5 4 6 2 2 2 6 1 6 2 3
juge 15 5 2 1 4 4 3 3 2 5 3 4 2
juge 16 2 1 3 2 3 3 3 4 2 4 4 1
Each consumer can also be considered as a categorical variable
6/28
Data table (2)
produit Angel Aromatic Elixir Chanel n°5 Cinéma Coco Mademoiselle J'adore (EP) J'adore (ET) L'instant Lolita Lempicka Pleasures Pure Poison Shalimar
juge 12 fleuri doux fort homme Gr 4 fleuri artificiel herbe fleuri doux fleuri doux fleuri artificiel herbe fleuri doux fleuri doux fort homme fleuri doux fleuri artificiel herbe
juge 13 fruité fort capiteux grand-mère capiteux grand-mère fruité moyen fruité moyen sucré faible sucré faible fruité fort fruité moyen fruité fort acidulé désodorisant fort lavande eau de cologne
juge 14 vanillé épicé esprit des îles rude fort toilettes sucré douceur fleuri douceur fleuri douceur fleuri sucré vanillé épicé esprit des îles sucré douceur fleuri renfermé agressif
juge 15 à manger sucré le vieux savon doux doux fleuri fleuri le vieux à manger sucré fleuri doux le vieux
juge 16 nourriture épice ménager cire connu classique nourriture épice connu classique connu classique connu classique fleuri nourriture épice fleuri fleuri ménager cire
Let’s run MCA on this data table!
7/28
Representation of the perfumes MCA factor map
1.5
Angel
1.0 0.5
Cinéma
Shalimar
0.0
L'instant
Aromatics Elixir
-0.5
Coco Mademoiselle Pure Poison J'adore (ET) Pleasures J'adore (EP)
Chanel n°5
-1.0
Dim 2 (13.64%)
Lolita Lempicka
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
Dim 1 (17.8%)
8/28
Co-occurrences matrix Shalimar Shalimar Aromatics Elixir Chanel n°5 Angel Lolita Lempicka Cinéma L'instant Pure Poison Coco Mademoiselle Pleasures J'adore (EP) J'adore (ET)
98 42 30 21 9 10 13 11 9 6 6 7
Aromatics Chanel Lolita Pure Coco J'adore J'adore Angel Cinéma L'instant Pleasures Elixir n°5 Lempicka Poison Mademoiselle (EP) (ET) 42 98 51 27 6 8 13 12 12 11 12 7
30 51 98 15 8 9 10 21 11 14 12 14
21 27 15 98 36 18 14 10 10 11 11 12
9 6 8 36 98 42 22 18 21 18 18 18
10 8 9 18 42 98 26 28 30 22 23 24
13 13 10 14 22 26 98 25 20 23 28 22
11 12 21 10 18 28 25 98 33 30 29 28
9 12 11 10 21 30 20 33 98 28 28 38
6 11 14 11 18 22 23 30 28 98 38 48
6 12 12 11 18 23 28 29 28 38 98 56
9/28
7 7 14 12 18 24 22 28 38 48 56 98
Representation of the words 2
strong fruity strong honey marked sweet cold tabacco
Angel Lolita Lempicka 1
Dim 2 (13.64%)
strong spicy warm vanilla vanilla sweet fruity chocolate warm sweet
MCA factor map
Cinéma
-1
0
Shalimar L'instant toilet Coco Mademoiselle Aromatics Elixir Pure Poison passion J'adore (ET)Pleasures Chanel n°5 J'adore (EP) sweet light
soft cleanliness fruity chemical solvant discrete discrete fruity
-1
0
1
spicy medicine oriental nauseating strong deodorant
2
Dim 1 (17.8%) 10/28
Confidence ellipses around products
P4 P1 P3 P2 F4
F1 F2
F3
Superimposed representation of the products and their descriptions 11/28
Confidence ellipses around products
P4 P1 P3 P2 F4
F1 F2
F3
P4 is at the barycentre of the words used to describe the groups
12/28
Confidence ellipses around products Panelist’s words (resampled) P4
product P4 (resampled)
13/28
Confidence ellipses around products product P4 (resampled) P4
14/28
Confidence ellipses around products
P4
product P4 (resampled)
15/28
Confidence ellipses around products
P4
16/28
1.5
2.0
Confidence ellipses for perfumes data Angel
0.5
Cin éma
0.0
L'instant
Shalimar
-0.5
Coco Mademoiselle Pure Poison J'adore (ET)
Aromatics
J'adore (EP) Pleasures
Elixir
Chanel n °5
-1.0
Dim 2 (13.64%)
1.0
Lolita Lempicka
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
Dim 1 (17.8%)
17/28
0.5
Coco Mademoiselle Angel Aromatics Elixir Cinéma Chanel n°5
0.0
Shalimar
-0.5
J'adore (ET)
L'instant Lolita Lempicka Pure Poison Pleasures J'adore (EP)
-1.0
Dim 2 (10.14%)
1.0
Confidence ellipses for random data
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
Dim 1 (11.03%) 18/28
Explanation
Number of columns >> number of rows
Automatic production of common dimensions
Looking for an indicator of consensus between subjects
19/28
Significance of the results
H0: absence of consensus
Indicator: first eigenvalue
Evolution of the indicator under H0:
Number of products
10 20 50 100
10 0,69566 0,471051 0,306195 0,236413
Number of subjects 20 50 0,624637 0,557816 0,389234 0,319656 0,228844 0,167827 0,16401 0,109948
100 0,524184 0,286282 0,14006 0,086396
20/28
Bar plots of the eigenvalues (10 products) 20 subjects
dim 6
dim 7
dim 8
0.6 0.5 0.4 0.3 0.2 0.1 dim 1
dim 9
dim 2
dim 3
dim 4
dim 5
dim 6
dim 7
dim 8
dim 9
dim 1
dim 2
dim 3
dim 4
dim 5
dim 6
dim 7
dim 8
dim 9
1000 subjects
0.4
0.5
0.5
0.6
0.6
0.7
0.7
100 subjects
0.4
0.3 0.2 0.1 0.0
dim 5
0.3
dim 4
0.2
dim 3
0.1
dim 2
0.0
dim 1
0.0
0.0
0.0
0.1
0.1
0.2
0.2
0.3
0.3
0.4
0.4
0.5
0.5
0.6
0.6
0.7
0.7
50 subjects
0.7
10 subjects
dim 1
dim 2
dim 3
dim 4
dim 5
dim 6
dim 7
dim 8
dim 9
dim 1
dim 2
dim 3
dim 4
dim 5
dim 6
dim 7
dim 8
dim 9
21/28
Significance of the indicator for a given data table
Calculate the p-value associated to the first eigenvalue of the MCA: 1. Repeat a great number of times: 1. Independent row permutations within each column 2. Calculate the first eigenvalue associated to the permutated table
2. Distribution of the eigenvalues under H0 3. Identify the observed eigenvalue in this distribution to get the p-value
22/28
100 50 0
Frequency
150
Significance of the first dimension for perfumes data
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
First eigenvalue
23/28
Cinéma
L'instant Coco Mademoiselle Pure Poison J'adore (ET) J'adore (EP) Pleasures
-1.0 -0.5
Shalimar AromaticsElixir Chanel n°5
0.0 0.5 1.0 1.5 Dim 1 (17.8%)
Perfume data
2.0
-1.0
0.5
1.0
Lolita Lempicka
Dim 2 (10.14%) -0.5 0.0 0.5 1.0
Angel
-1.0 -0.5 0.0
Dim 2 (13.64%)
1.5
2.0
Confidence ellipses Coco Mademoiselle Angel Aromatics Elixir Cinéma Chanel n°5 Shalimar J'adore (ET) L'instantLolita Lempicka Pure Poison Pleasures J'adore (EP)
-1.5 -1.0 -0.5 0.0 0.5 1.0 Dim 1 (11.03%)
1.5
Random data 24/28
Second empirical indicator
Ellipses overlapping
Total inertia
=
Between inertia
+
Within inertia
between inertia ≤1 Calculate the inertia ratio: 0 ≤ total inertia 25/28
150 100 50 0
Frequency
200
250
300
Significance of the inertia ratio for perfumes data
0.982
0.984
0.986
0.988
0.990
0.992
0.994
0.996
ratio
26/28
Conclusion
MCA: suitable method for categorization data
Sensory data: few rows and many columns
By construction, many relationships
External validation
27/28
SensoMineR a package for sensory data analysis Journal of sensory studies (2008)
FactoMineR: an R package for multivariate analysis Journal of statistical software (2008)
http://www.agrocampus-rennes.fr/math/ 28/28
The applied mathematics department of Agrocampus organizes Paris Rennes
The R User Conference 2009 July 8-10 in Rennes, France
http://www.agrocampus-rennes.fr/math/useR-2009/