Missing values and elements of validity in categorization Marine Cadoret, Sébastien Lê, Jérôme Pagès Applied Mathematics Department, Agrocampus Ouest, Rennes, France
Vilnius, july 1
ASMDA09
Vilnius
st
2009
Missing values in categorization
1 / 36
Context
Overview
Context
Analyzing complete categorization data
Analyzing incomplete categorization data
Validation of categorization results
Conclusion
ASMDA09
Vilnius
Missing values in categorization
2 / 36
Context
Categorization
What is categorization?
I Categorization (or sorting task) consists in grouping objects in function of their resemblances.
I Following this task, a verbalization task can also be asked to describe the groups (qualied categorization).
I Each subject provides one partition
ASMDA09
Vilnius
Missing values in categorization
3 / 36
Context
Data 98 consumers carried out a qualied categorization on 12 luxury perfumes:
Angel
Shalimar
ASMDA09
Vilnius
Lolita Lempicka
J’adore J’adore (ET) (EP)
L’instant Cinéma
Aromatics Chanel Coco Elixir Mademoiselle n°5
Pure Poison
Missing values in categorization
Pleasures
4 / 36
Context
Example of categorization « gourmand, vanilla, wooded »
« spicy, aldehyde »
« white flower, vanilla, orange »
« oriental, showy, wooded, Patchouli oil »
« flower, floral, green »
4/28 ASMDA09
Vilnius
Missing values in categorization
5 / 36
Analyzing complete categorization data
Overview
Context
Analyzing complete categorization data
Analyzing incomplete categorization data
Validation of categorization results
Conclusion
ASMDA09
Vilnius
Missing values in categorization
6 / 36
Analyzing complete categorization data
Analyzing categorization data: complete data
Angel Aromatics Elixir Chanel n◦ 5 Cinéma Coco Mademoiselle J'adore(EP) J'adore(ET) L'instant Lolita Lempicka Pleasures Pure Poison Shalimar
Subject 1 1 3 4 2 1 1 2 1 1 3 1 2
Subject 2 4 3 3 5 5 6 6 4 5 4 1 2
Subject 3 1 5 4 6 2 2 2 6 1 6 2 3
Subject 4 5 2 1 4 4 3 3 2 5 3 4 2
Subject 5 2 1 3 2 3 3 3 4 2 4 4 1
I One subject = One qualitative variable
ASMDA09
Vilnius
Missing values in categorization
7 / 36
Analyzing complete categorization data
Analyzing categorization data: complete data
Angel Aromatics Elixir Chanel n◦ 5 Cinéma Coco Mademoiselle J'adore(EP) J'adore(ET) L'instant Lolita Lempicka Pleasures Pure Poison Shalimar
Subject 1 oral soft strong man Gr4 oral grass oral soft oral soft oral grass oral soft oral soft strong man oral soft oral grass
Subject 2 fruity strong heady grandmother heady grandmother fruity medium fruity medium sweet light sweet light fruity strong fruity medium fruity strong slightly deodorant strong lavender
Subject 3 vanilla spicy hard strong toilet sweet softness oral softness oral softness oral sweet vanilla spicy sweet softness oral fuggy aggressive
Subject 4 to eat sweet old soap soft soft oral oral old to eat sweet oral soft old
Subject 5 food spicy domestic wax classical food spicy classical classical classical oral food spicy oral oral domestic wax
I Qualied categorization: label = word(s) associated to the group
I Appropriate data table for Multiple Correspondence Analysis (MCA)
ASMDA09
Vilnius
Missing values in categorization
8 / 36
Analyzing complete categorization data
Multiple Correspondence Analysis Subject 1
Subject j
Group 1
Subject J
Group kj kj
Group KJ KJ
Product 1
x ik
010000
Product i
001000
Product I
I
I
1
I
k
K
I Data taken into account via the disjonctive datatable I Distance between 2 products
d (i , l ) = 2
ASMDA09
Vilnius
1
J
i
and
X
l
I (x I
:
ik
k
k
Missing values in categorization
− xlk )2 9 / 36
Analyzing complete categorization data
Representation of the perfumes
MCA factor map
1.5
Angel
1.0 0.5
Cinéma Shalimar
L'instant
0.0
Dim 2 (13.64%)
Lolita Lempicka
Aromatics Elixir
-0.5
Coco Mademoiselle Pure Poison J'adore (ET) Pleasures J'adore (EP)
-1.0
Chanel n°5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
Dim 1 (17.8%)
ASMDA09
Vilnius
Missing values in categorization
10 / 36
Analyzing complete categorization data
Representation of the words
MCA factor map
2
strong fruity strong honey marked sweet cold tabacco
Angel Lolita Lempicka 1
Dim 2 (13.64%)
strong spicy warm vanilla vanilla sweet fruity chocolate warm sweet
Cinéma
-1
0
Shalimar L'instant toilet Coco Mademoiselle Aromatics Elixir Pure Poison passion J'adore (ET)Pleasures Chanel n°5 J'adore (EP) sweet light
soft cleanliness fruity chemical solvant discrete discrete fruity
ASMDA09
Vilnius
-1
0
1
spicy medicine oriental nauseating strong deodorant
2
Dim 1 (17.8%)
Missing values in categorization
11 / 36
Analyzing incomplete categorization data
Overview
Context
Analyzing complete categorization data
Analyzing incomplete categorization data
Validation of categorization results
Conclusion
ASMDA09
Vilnius
Missing values in categorization
12 / 36
Analyzing incomplete categorization data
Problems
If too many products are to be tested: selection of a subset of products for each subject
I How to analyze incomplete data? I How to choose the best way to select products tested for each subject?
ASMDA09
Vilnius
Missing values in categorization
13 / 36
Analyzing incomplete categorization data
Analyzing categorization data: incomplete data
Angel Aromatics Elixir Chanel n◦ 5 Cinéma Coco Mademoiselle J'adore(EP) J'adore(ET) L'instant Lolita Lempicka Pleasures Pure Poison Shalimar
Subject 1 S1.miss 3 4 2 S1.miss 1 2 1 1 3 1 2
Subject 2 4 3 3 S2.miss 5 6 6 S2.miss 5 4 1 2
Subject 3 1 S3.miss 4 6 2 2 2 6 1 6 S3.miss 3
Subject 4 5 2 1 4 4 3 S4.miss 2 5 S4.miss 4 2
Subject 5 S5.miss S5.miss 3 2 3 3 3 4 2 4 4 1
For each subject:
I Non-presented products = additional group
ASMDA09
Vilnius
Missing values in categorization
14 / 36
Analyzing incomplete categorization data
Selection of the subset of products
Each subject tests
p products among I
according to at least 2
possibilities:
I randomly: I I
the product i is tested r times the pair of products (i , l ) is tested λ times i
il
I from a BIBD: I I
ASMDA09
each product is tested a same number of time: r each pair of products is tested a same number of time: λ
Vilnius
Missing values in categorization
15 / 36
Analyzing incomplete categorization data
Selection of the subset of products (randomly) & MCA Reordered disjonctive table according to the products
i ((r
Distance between the 2 products
I
d (i , l ) = d 0 (i , l ) + 2
ASMDA09
Vilnius
2
1
K
J I
I
−p
i
and
i
and
l:
l:
− λil ) + (rl − λil ))
Missing values in categorization
16 / 36
Analyzing incomplete categorization data
Selection of the subset of products (with a BIBD) & MCA Reordered disjonctive table according to products
1
K’ 1
k’
r-λ
r-λ 1
1
i
and
l:
J-2r+λ
1
J
i
10110
…
0110
1 1 1… 1 1
0 0 0… 0 0
1 1 1 …1 1
J
l
00111
…
1010
0 0 0… 0 0
1 1 1 …1 1
1 1 1 …1 1
J
I
J I1
Ik’
I-p
IK’
…
I-p
I-p
…
I-p I-p
…
I-p
i and l : 2(r − λ)
Distance between the 2 products
I
d (i , l ) = d 0 (i , l ) + 2
ASMDA09
Vilnius
2
1
k
J I
I
−p
Missing values in categorization
17 / 36
Analyzing incomplete categorization data
Evaluation of the approach
Analyzing incomplete data:
I How much better is the BIBD compared to the random selection?
I Are the obtained results coherent with complete data?
ASMDA09
Vilnius
Missing values in categorization
18 / 36
Analyzing incomplete categorization data
BIBD
vs.
random selection: simulation process (1)
incomplete data
complete data random
ASMDA09
Vilnius
Missing values in categorization
incomplete data BIBD
19 / 36
Analyzing incomplete categorization data
BIBD
random selection: simulation process (1)
vs.
incomplete data
complete data random
MCA
ASMDA09
Vilnius
incomplete data BIBD
MCA
Missing values in categorization
MCA
19 / 36
Analyzing incomplete categorization data
BIBD
random selection: simulation process (1)
vs.
incomplete data
complete data random
MCA
BIBD
MCA
RV
ASMDA09
Vilnius
incomplete data
Missing values in categorization
MCA
RV
19 / 36
Analyzing incomplete categorization data
BIBD
vs.
random selection: simulation process (2)
Simulations of dierent situations:
30 subjects 60 subjects 98 subjects
6 perfumes 50% miss. random BIBD
8 perfumes 33% miss. random BIBD
10 perfumes 17% miss. random BIBD
Table: Average RV (complete, incomplete) coecients from 100 simulations.
ASMDA09
Vilnius
Missing values in categorization
20 / 36
Analyzing incomplete categorization data
BIBD
vs.
random selection: simulation process (2)
Simulations of dierent situations:
30 subjects 60 subjects 98 subjects
6 perfumes 50% miss. random BIBD 0.471 0.621 0.721
8 perfumes 33% miss. random BIBD 0.652 0.77 0.88
10 perfumes 17% miss. random BIBD 0.76 0.92 0.97
Table: Average RV (complete, incomplete) coecients from 100 simulations.
ASMDA09
Vilnius
Missing values in categorization
20 / 36
Analyzing incomplete categorization data
BIBD
vs.
random selection: simulation process (2)
Simulations of dierent situations:
30 subjects 60 subjects 98 subjects
6 perfumes 50% miss. random BIBD 0.471 0.579 0.621 0.758 0.721 0.855
8 perfumes 33% miss. random BIBD 0.652 0.746 0.77 0.869 0.88 0.94
10 perfumes 17% miss. random BIBD 0.76 0.781 0.92 0.942 0.97 0.98
Table: Average RV (complete, incomplete) coecients from 100 simulations.
ASMDA09
Vilnius
Missing values in categorization
20 / 36
Analyzing incomplete categorization data
Simulated incomplete data
ASMDA09
Vilnius
Missing values in categorization
21 / 36
Analyzing incomplete categorization data
Simulated incomplete data
ASMDA09
Vilnius
Missing values in categorization
21 / 36
Analyzing incomplete categorization data
Real incomplete data
ASMDA09
Vilnius
Missing values in categorization
22 / 36
Analyzing incomplete categorization data
Coherence with complete data (1) I 2 datasets : I I
ASMDA09
Complete: 98 subjects - 12 products Incomplete: 42 subjects - 8 dierent products among 12
Vilnius
Missing values in categorization
23 / 36
Analyzing incomplete categorization data
Coherence with complete data (1) I 2 datasets : I I
Complete: 98 subjects - 12 products Incomplete: 42 subjects - 8 dierent products among 12
I Multiple Factor Analysis on the 2 datasets
1
98 1
42
1
12
ASMDA09
Vilnius
Missing values in categorization
23 / 36
Analyzing incomplete categorization data
Coherence with complete data (2) I Multiple Factor Analysis on the 2 datasets
ASMDA09
Vilnius
Missing values in categorization
24 / 36
Validation of categorization results
Overview
Context
Analyzing complete categorization data
Analyzing incomplete categorization data
Validation of categorization results
Conclusion
ASMDA09
Vilnius
Missing values in categorization
25 / 36
Validation of categorization results
Elements of validity
Categorization data:
I Number of columns
number of rows
I PCA: automatic production of correlations
ASMDA09
Vilnius
Missing values in categorization
26 / 36
Validation of categorization results
Elements of validity
Categorization data:
I Number of columns
number of rows
I PCA: automatic production of correlations I Categorization data (MCA): automatic production of common dimensions
I Validity of the observed common dimensions (consensus)? I Looking for an indicator of consensus between subjects
ASMDA09
Vilnius
Missing values in categorization
26 / 36
Validation of categorization results
Consensus between subjects I Indicator: rst eigenvalue I
H
0:
⇒ λ1 =
1
P
J
j
η 2 (F1 , j )
absence of consensus (independently and randomly
categorizations)
ASMDA09
Vilnius
Missing values in categorization
27 / 36
Validation of categorization results
Consensus between subjects I Indicator: rst eigenvalue I
H
0:
⇒ λ1 =
1
P
J
j
η 2 (F1 , j )
absence of consensus (independently and randomly
categorizations)
I Simulated complete data ( the indicator under
ASMDA09
Vilnius
H
K
j
= 5∀j , Ik = 5I ∀k ):
evolution of
0
Missing values in categorization
27 / 36
Validation of categorization results
Consensus between subjects
I Indicator: rst eigenvalue I
H
0:
⇒ λ1 =
1
P j
J
η 2 (F1 , j )
absence of consensus (independently and randomly
categorizations)
I Simulated complete data ( the indicator under
H
K
j
= 5∀j , Ik = 5I ∀k ):
evolution of
0 Number of subjects 10
20
50
100
10 Number of products
20 50 100
ASMDA09
Vilnius
Missing values in categorization
27 / 36
Validation of categorization results
Consensus between subjects
I Indicator: rst eigenvalue I
H
0:
⇒ λ1 =
1
P j
J
η 2 (F1 , j )
absence of consensus (independently and randomly
categorizations)
I Simulated complete data ( the indicator under
H
K
j
= 5∀j , Ik = 5I ∀k ):
evolution of
0 Number of subjects 10
20
50
100
0.6957
0.6246
0.5578
0.5242
20
0.471
0.3892
0.3196
0.2863
50
0.3062
0.2288
0.1678
0.14
100
0.2364
0.164
0.1099
0.0864
10 Number of products
ASMDA09
Vilnius
Missing values in categorization
27 / 36
Validation of categorization results
Consensus between subjects
I Indicator: rst eigenvalue I
H
0:
⇒ λ1 =
1
P j
J
η 2 (F1 , j )
absence of consensus (independently and randomly
categorizations)
I Simulated complete data ( the indicator under
H
K
j
= 5∀j , Ik = 5I ∀k ):
evolution of
0 Number of subjects 10
20
50
100
0.6957
0.6246
0.5578
0.5242
20
0.471
0.3892
0.3196
0.2863
50
0.3062
0.2288
0.1678
0.14
100
0.2364
0.164
0.1099
0.0864
10 Number of products
I Simulated incomplete data: simulation with one group on the 5 from a BIBD
ASMDA09
Vilnius
→
same evolution of the rst eigenvalue
Missing values in categorization
27 / 36
Validation of categorization results
Signicance of the indicator for a given data table Associate a p-value to the rst eigenvalue of the MCA:
I Complete data: I
Repeat a great number of times:
1. To be under H , independent row permutations within each column 2. Calculating the rst eigenvalue associated to the permutated table 0
I I
ASMDA09
Distribution of the eigenvalues under H0 Identify the observed eigenvalue in this distribution to get the p-value
Vilnius
Missing values in categorization
28 / 36
Validation of categorization results
Signicance of the indicator for a given data table Associate a p-value to the rst eigenvalue of the MCA:
I Complete data: I
Repeat a great number of times:
1. To be under H , independent row permutations within each column 2. Calculating the rst eigenvalue associated to the permutated table 0
I I
Distribution of the eigenvalues under H0 Identify the observed eigenvalue in this distribution to get the p-value
I Incomplete data: for each subject, permutations only for presented products (in order to preserve the BIBD)
ASMDA09
Vilnius
Missing values in categorization
28 / 36
Validation of categorization results
250
Frequency
0.0
1.0
1.0
150
Frequency
0
0
50
50
100
100
Frequency
200
150
0.0
0 100
250
0 50
Frequency
150
300
Signicance of the rst eigenvalue for perfumes data
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
0.38
0.40
0.42
0.44
0.46
0.48
0.50
Complete case
Incomplete case
98 subjects - 12 perfumes
42 subjects - 8 perfumes
ASMDA09
Vilnius
0.52
First eigenvalue
First eigenvalue
Missing values in categorization
29 / 36
Validation of categorization results
Condence ellipses around products: what we want
P4 P1 P3 P2 F4
F1 F2
F3
ASMDA09
Vilnius
Missing values in categorization
30 / 36
Validation of categorization results
Condence ellipses around products: how to obtain them
P4 P1 P3 P2 F4
F1 F2
F3
Superimposed representation of the products and the categories (= word(s)) ASMDA09
Vilnius
Missing values in categorization
31 / 36
Validation of categorization results
Condence ellipses around products: how to obtain them
P4 P1 P3 P2 F4
F1 F2
F3
P4 is at the barycentre of the words ASMDA09
Vilnius
Missing values in categorization
31 / 36
Validation of categorization results
Condence ellipses around products: how to obtain them
Panelist’s words (resampled) P4
product P4 (resampled)
ASMDA09
Vilnius
Missing values in categorization
31 / 36
Validation of categorization results
Condence ellipses around products: how to obtain them
product P4 (resampled) P4
ASMDA09
Vilnius
Missing values in categorization
31 / 36
Validation of categorization results
Condence ellipses around products: how to obtain them
P4
ASMDA09
Vilnius
product P4 (resampled)
Missing values in categorization
31 / 36
Validation of categorization results
Condence ellipses around products: how to obtain them
P4
ASMDA09
Vilnius
Missing values in categorization
31 / 36
Validation of categorization results
Condence ellipses around products Confidence ellipses for the mean points
●
1.0
1.5
2.0
Confidence ellipses for the mean points
Angel
●
●
0.5
Lolita Lempicka
1.0
●
L'instant ●
●
Coco Mademoiselle Pure Poison J'adore Pleasures (EP) J'adore (ET)
●
−0.5
●
0.0
●
J_adore_(ET) ●
●
Coco_Mademoiselle ●
−0.5
Dim 2 (11.5%)
Cinéma
J_adore_(EP)
L_instant Lolita_Lempicka Pure_Poison Cinéma
●
●
●
Shalimar
−1.0
0.5
●
Aromatics_Elixir
Chanel_n5 Shalimar ●
●
0.0
Dim 2 (13.64%)
Pleasures ●
●
Aromatics Elixir
−1.5
●
●
● ●
●
Angel
Chanel n°5
−2.0
−1.0
●
−1.0
−0.5
0.0
0.5
1.0
1.5
2.0
−1
Dim 1 (17.8%)
0
1
Dim 1 (14.48%)
Complete case
Incomplete case
98 subjects - 12 perfumes
42 subjects - 8 perfumes
ASMDA09
Vilnius
Missing values in categorization
32 / 36
Validation of categorization results
Ellipses overlapping
I Can the observed non-overlapping be obtained by pure chance?
ASMDA09
Vilnius
Missing values in categorization
33 / 36
Validation of categorization results
Ellipses overlapping
I Can the observed non-overlapping be obtained by pure chance?
I Ellipses overlapping between two products
A1
A2
B2 B3
A1
A2
A3
I Overlapping indicator:
ASMDA09
Vilnius
A2
B2 B
A3
B1
=
R
2 o
Between inertia
=
B3
A
B
A3
B1
Total inertia
A1 B2 B3
A
A and B :
B1
+
Within inertia
between inertia total inertia
Missing values in categorization
(0
≤ Ro2 ≤ 1)
33 / 36
Validation of categorization results
Ellipses overlapping
I Can the observed non-overlapping be obtained by pure chance?
I Ellipses overlapping between two products
A1
A2
B2 B3
A1
A2
A3
I Overlapping indicator:
A2
B2 B
A3
B1
=
R
2 o
Between inertia
=
B3
A
B
A3
B1
Total inertia
A1 B2 B3
A
A and B :
B1
+
Within inertia
between inertia total inertia
(0
≤ Ro2 ≤ 1)
I Same permutation procedure as for the rst eigenvalue
ASMDA09
Vilnius
Missing values in categorization
33 / 36
Validation of categorization results
0
0
50
50
100
Frequency
150 100
Frequency
200
250
150
300
200
Signicance of the inertia ratio for perfumes data
0.982
0.984
0.986
0.988
0.990
0.992
0.994
0.996
0.960
0.965
0.970
0.975
Complete case
Incomplete case
98 subjects - 12 perfumes
42 subjects - 8 perfumes
ASMDA09
Vilnius
0.980
Ratio
ratio
Missing values in categorization
34 / 36
Conclusion
Overview
Context
Analyzing complete categorization data
Analyzing incomplete categorization data
Validation of categorization results
Conclusion
ASMDA09
Vilnius
Missing values in categorization
35 / 36
Conclusion
Conclusion
I MCA: suitable method to analyze categorization data I Incomplete categorization data: I I
BIBD: best way to select the products BIBD + MCA provide useful results
I Categorization data: few rows and many columns I Necessary to validate the results I I
First eigenvalue Inertia ratio
I Available in SensoMineR
ASMDA09
Vilnius
Missing values in categorization
36 / 36
The applied mathematics department of Agrocampus organizes Paris Rennes
The R User Conference 2009 July 8-10 in Rennes, France http://www.agrocampus-rennes.fr/math/useR-2009/