Missing values and elements of validity in ... - Marine Cadoret

J'adore (ET). L'instant. Lolita Lempicka. Pleasures. Pure Poison. Shalimar q q q q q qq q q q q q. Complete case. 98 subjects - 12 perfumes q. −1. 0. 1. −2.0. −1.5.
2MB taille 2 téléchargements 326 vues
Missing values and elements of validity in categorization Marine Cadoret, Sébastien Lê, Jérôme Pagès Applied Mathematics Department, Agrocampus Ouest, Rennes, France

Vilnius, july 1

ASMDA09

Vilnius

st

2009

Missing values in categorization

1 / 36

Context

Overview

Context

Analyzing complete categorization data

Analyzing incomplete categorization data

Validation of categorization results

Conclusion

ASMDA09

Vilnius

Missing values in categorization

2 / 36

Context

Categorization

What is categorization?

I Categorization (or sorting task) consists in grouping objects in function of their resemblances.

I Following this task, a verbalization task can also be asked to describe the groups (qualied categorization).

I Each subject provides one partition

ASMDA09

Vilnius

Missing values in categorization

3 / 36

Context

Data 98 consumers carried out a qualied categorization on 12 luxury perfumes:

Angel

Shalimar

ASMDA09

Vilnius

Lolita Lempicka

J’adore J’adore (ET) (EP)

L’instant Cinéma

Aromatics Chanel Coco Elixir Mademoiselle n°5

Pure Poison

Missing values in categorization

Pleasures

4 / 36

Context

Example of categorization « gourmand, vanilla, wooded »

« spicy, aldehyde »

« white flower, vanilla, orange »

« oriental, showy, wooded, Patchouli oil »

« flower, floral, green »

4/28 ASMDA09

Vilnius

Missing values in categorization

5 / 36

Analyzing complete categorization data

Overview

Context

Analyzing complete categorization data

Analyzing incomplete categorization data

Validation of categorization results

Conclusion

ASMDA09

Vilnius

Missing values in categorization

6 / 36

Analyzing complete categorization data

Analyzing categorization data: complete data

Angel Aromatics Elixir Chanel n◦ 5 Cinéma Coco Mademoiselle J'adore(EP) J'adore(ET) L'instant Lolita Lempicka Pleasures Pure Poison Shalimar

Subject 1 1 3 4 2 1 1 2 1 1 3 1 2

Subject 2 4 3 3 5 5 6 6 4 5 4 1 2

Subject 3 1 5 4 6 2 2 2 6 1 6 2 3

Subject 4 5 2 1 4 4 3 3 2 5 3 4 2

Subject 5 2 1 3 2 3 3 3 4 2 4 4 1

I One subject = One qualitative variable

ASMDA09

Vilnius

Missing values in categorization

7 / 36

Analyzing complete categorization data

Analyzing categorization data: complete data

Angel Aromatics Elixir Chanel n◦ 5 Cinéma Coco Mademoiselle J'adore(EP) J'adore(ET) L'instant Lolita Lempicka Pleasures Pure Poison Shalimar

Subject 1 oral soft strong man Gr4 oral grass oral soft oral soft oral grass oral soft oral soft strong man oral soft oral grass

Subject 2 fruity strong heady grandmother heady grandmother fruity medium fruity medium sweet light sweet light fruity strong fruity medium fruity strong slightly deodorant strong lavender

Subject 3 vanilla spicy hard strong toilet sweet softness oral softness oral softness oral sweet vanilla spicy sweet softness oral fuggy aggressive

Subject 4 to eat sweet old soap soft soft oral oral old to eat sweet oral soft old

Subject 5 food spicy domestic wax classical food spicy classical classical classical oral food spicy oral oral domestic wax

I Qualied categorization: label = word(s) associated to the group

I Appropriate data table for Multiple Correspondence Analysis (MCA)

ASMDA09

Vilnius

Missing values in categorization

8 / 36

Analyzing complete categorization data

Multiple Correspondence Analysis Subject 1

Subject j

Group 1

Subject J

Group kj kj

Group KJ KJ

Product 1

x ik

010000

Product i

001000

Product I

I

I

1

I

k

K

I Data taken into account via the disjonctive datatable I Distance between 2 products

d (i , l ) = 2

ASMDA09

Vilnius

1

J

i

and

X

l

I (x I

:

ik

k

k

Missing values in categorization

− xlk )2 9 / 36

Analyzing complete categorization data

Representation of the perfumes

MCA factor map

1.5

Angel

1.0 0.5

Cinéma Shalimar

L'instant

0.0

Dim 2 (13.64%)

Lolita Lempicka

Aromatics Elixir

-0.5

Coco Mademoiselle Pure Poison J'adore (ET) Pleasures J'adore (EP)

-1.0

Chanel n°5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

Dim 1 (17.8%)

ASMDA09

Vilnius

Missing values in categorization

10 / 36

Analyzing complete categorization data

Representation of the words

MCA factor map

2

strong fruity strong honey marked sweet cold tabacco

Angel Lolita Lempicka 1

Dim 2 (13.64%)

strong spicy warm vanilla vanilla sweet fruity chocolate warm sweet

Cinéma

-1

0

Shalimar L'instant toilet Coco Mademoiselle Aromatics Elixir Pure Poison passion J'adore (ET)Pleasures Chanel n°5 J'adore (EP) sweet light

soft cleanliness fruity chemical solvant discrete discrete fruity

ASMDA09

Vilnius

-1

0

1

spicy medicine oriental nauseating strong deodorant

2

Dim 1 (17.8%)

Missing values in categorization

11 / 36

Analyzing incomplete categorization data

Overview

Context

Analyzing complete categorization data

Analyzing incomplete categorization data

Validation of categorization results

Conclusion

ASMDA09

Vilnius

Missing values in categorization

12 / 36

Analyzing incomplete categorization data

Problems

If too many products are to be tested: selection of a subset of products for each subject

I How to analyze incomplete data? I How to choose the best way to select products tested for each subject?

ASMDA09

Vilnius

Missing values in categorization

13 / 36

Analyzing incomplete categorization data

Analyzing categorization data: incomplete data

Angel Aromatics Elixir Chanel n◦ 5 Cinéma Coco Mademoiselle J'adore(EP) J'adore(ET) L'instant Lolita Lempicka Pleasures Pure Poison Shalimar

Subject 1 S1.miss 3 4 2 S1.miss 1 2 1 1 3 1 2

Subject 2 4 3 3 S2.miss 5 6 6 S2.miss 5 4 1 2

Subject 3 1 S3.miss 4 6 2 2 2 6 1 6 S3.miss 3

Subject 4 5 2 1 4 4 3 S4.miss 2 5 S4.miss 4 2

Subject 5 S5.miss S5.miss 3 2 3 3 3 4 2 4 4 1

For each subject:

I Non-presented products = additional group

ASMDA09

Vilnius

Missing values in categorization

14 / 36

Analyzing incomplete categorization data

Selection of the subset of products

Each subject tests

p products among I

according to at least 2

possibilities:

I randomly: I I

the product i is tested r times the pair of products (i , l ) is tested λ times i

il

I from a BIBD: I I

ASMDA09

each product is tested a same number of time: r each pair of products is tested a same number of time: λ

Vilnius

Missing values in categorization

15 / 36

Analyzing incomplete categorization data

Selection of the subset of products (randomly) & MCA Reordered disjonctive table according to the products

i ((r

Distance between the 2 products

I

d (i , l ) = d 0 (i , l ) + 2

ASMDA09

Vilnius

2

1

K

J I

I

−p

i

and

i

and

l:

l:

− λil ) + (rl − λil ))

Missing values in categorization

16 / 36

Analyzing incomplete categorization data

Selection of the subset of products (with a BIBD) & MCA Reordered disjonctive table according to products

1

K’ 1

k’

r-λ

r-λ 1

1

i

and

l:

J-2r+λ

1

J

i

10110



0110

1 1 1… 1 1

0 0 0… 0 0

1 1 1 …1 1

J

l

00111



1010

0 0 0… 0 0

1 1 1 …1 1

1 1 1 …1 1

J

I

J I1

Ik’

I-p

IK’



I-p

I-p



I-p I-p



I-p

i and l : 2(r − λ)

Distance between the 2 products

I

d (i , l ) = d 0 (i , l ) + 2

ASMDA09

Vilnius

2

1

k

J I

I

−p

Missing values in categorization

17 / 36

Analyzing incomplete categorization data

Evaluation of the approach

Analyzing incomplete data:

I How much better is the BIBD compared to the random selection?

I Are the obtained results coherent with complete data?

ASMDA09

Vilnius

Missing values in categorization

18 / 36

Analyzing incomplete categorization data

BIBD

vs.

random selection: simulation process (1)

incomplete data

complete data random

ASMDA09

Vilnius

Missing values in categorization

incomplete data BIBD

19 / 36

Analyzing incomplete categorization data

BIBD

random selection: simulation process (1)

vs.

incomplete data

complete data random

MCA

ASMDA09

Vilnius

incomplete data BIBD

MCA

Missing values in categorization

MCA

19 / 36

Analyzing incomplete categorization data

BIBD

random selection: simulation process (1)

vs.

incomplete data

complete data random

MCA

BIBD

MCA

RV

ASMDA09

Vilnius

incomplete data

Missing values in categorization

MCA

RV

19 / 36

Analyzing incomplete categorization data

BIBD

vs.

random selection: simulation process (2)

Simulations of dierent situations:

30 subjects 60 subjects 98 subjects

6 perfumes 50% miss. random BIBD

8 perfumes 33% miss. random BIBD

10 perfumes 17% miss. random BIBD

Table: Average RV (complete, incomplete) coecients from 100 simulations.

ASMDA09

Vilnius

Missing values in categorization

20 / 36

Analyzing incomplete categorization data

BIBD

vs.

random selection: simulation process (2)

Simulations of dierent situations:

30 subjects 60 subjects 98 subjects

6 perfumes 50% miss. random BIBD 0.471 0.621 0.721

8 perfumes 33% miss. random BIBD 0.652 0.77 0.88

10 perfumes 17% miss. random BIBD 0.76 0.92 0.97

Table: Average RV (complete, incomplete) coecients from 100 simulations.

ASMDA09

Vilnius

Missing values in categorization

20 / 36

Analyzing incomplete categorization data

BIBD

vs.

random selection: simulation process (2)

Simulations of dierent situations:

30 subjects 60 subjects 98 subjects

6 perfumes 50% miss. random BIBD 0.471 0.579 0.621 0.758 0.721 0.855

8 perfumes 33% miss. random BIBD 0.652 0.746 0.77 0.869 0.88 0.94

10 perfumes 17% miss. random BIBD 0.76 0.781 0.92 0.942 0.97 0.98

Table: Average RV (complete, incomplete) coecients from 100 simulations.

ASMDA09

Vilnius

Missing values in categorization

20 / 36

Analyzing incomplete categorization data

Simulated incomplete data

ASMDA09

Vilnius

Missing values in categorization

21 / 36

Analyzing incomplete categorization data

Simulated incomplete data

ASMDA09

Vilnius

Missing values in categorization

21 / 36

Analyzing incomplete categorization data

Real incomplete data

ASMDA09

Vilnius

Missing values in categorization

22 / 36

Analyzing incomplete categorization data

Coherence with complete data (1) I 2 datasets : I I

ASMDA09

Complete: 98 subjects - 12 products Incomplete: 42 subjects - 8 dierent products among 12

Vilnius

Missing values in categorization

23 / 36

Analyzing incomplete categorization data

Coherence with complete data (1) I 2 datasets : I I

Complete: 98 subjects - 12 products Incomplete: 42 subjects - 8 dierent products among 12

I Multiple Factor Analysis on the 2 datasets

1

98 1

42

1

12

ASMDA09

Vilnius

Missing values in categorization

23 / 36

Analyzing incomplete categorization data

Coherence with complete data (2) I Multiple Factor Analysis on the 2 datasets

ASMDA09

Vilnius

Missing values in categorization

24 / 36

Validation of categorization results

Overview

Context

Analyzing complete categorization data

Analyzing incomplete categorization data

Validation of categorization results

Conclusion

ASMDA09

Vilnius

Missing values in categorization

25 / 36

Validation of categorization results

Elements of validity

Categorization data:

I Number of columns



number of rows

I PCA: automatic production of correlations

ASMDA09

Vilnius

Missing values in categorization

26 / 36

Validation of categorization results

Elements of validity

Categorization data:

I Number of columns



number of rows

I PCA: automatic production of correlations I Categorization data (MCA): automatic production of common dimensions

I Validity of the observed common dimensions (consensus)? I Looking for an indicator of consensus between subjects

ASMDA09

Vilnius

Missing values in categorization

26 / 36

Validation of categorization results

Consensus between subjects I Indicator: rst eigenvalue I

H

0:

⇒ λ1 =

1

P

J

j

η 2 (F1 , j )

absence of consensus (independently and randomly

categorizations)

ASMDA09

Vilnius

Missing values in categorization

27 / 36

Validation of categorization results

Consensus between subjects I Indicator: rst eigenvalue I

H

0:

⇒ λ1 =

1

P

J

j

η 2 (F1 , j )

absence of consensus (independently and randomly

categorizations)

I Simulated complete data ( the indicator under

ASMDA09

Vilnius

H

K

j

= 5∀j , Ik = 5I ∀k ):

evolution of

0

Missing values in categorization

27 / 36

Validation of categorization results

Consensus between subjects

I Indicator: rst eigenvalue I

H

0:

⇒ λ1 =

1

P j

J

η 2 (F1 , j )

absence of consensus (independently and randomly

categorizations)

I Simulated complete data ( the indicator under

H

K

j

= 5∀j , Ik = 5I ∀k ):

evolution of

0 Number of subjects 10

20

50

100

10 Number of products

20 50 100

ASMDA09

Vilnius

Missing values in categorization

27 / 36

Validation of categorization results

Consensus between subjects

I Indicator: rst eigenvalue I

H

0:

⇒ λ1 =

1

P j

J

η 2 (F1 , j )

absence of consensus (independently and randomly

categorizations)

I Simulated complete data ( the indicator under

H

K

j

= 5∀j , Ik = 5I ∀k ):

evolution of

0 Number of subjects 10

20

50

100

0.6957

0.6246

0.5578

0.5242

20

0.471

0.3892

0.3196

0.2863

50

0.3062

0.2288

0.1678

0.14

100

0.2364

0.164

0.1099

0.0864

10 Number of products

ASMDA09

Vilnius

Missing values in categorization

27 / 36

Validation of categorization results

Consensus between subjects

I Indicator: rst eigenvalue I

H

0:

⇒ λ1 =

1

P j

J

η 2 (F1 , j )

absence of consensus (independently and randomly

categorizations)

I Simulated complete data ( the indicator under

H

K

j

= 5∀j , Ik = 5I ∀k ):

evolution of

0 Number of subjects 10

20

50

100

0.6957

0.6246

0.5578

0.5242

20

0.471

0.3892

0.3196

0.2863

50

0.3062

0.2288

0.1678

0.14

100

0.2364

0.164

0.1099

0.0864

10 Number of products

I Simulated incomplete data: simulation with one group on the 5 from a BIBD

ASMDA09

Vilnius



same evolution of the rst eigenvalue

Missing values in categorization

27 / 36

Validation of categorization results

Signicance of the indicator for a given data table Associate a p-value to the rst eigenvalue of the MCA:

I Complete data: I

Repeat a great number of times:

1. To be under H , independent row permutations within each column 2. Calculating the rst eigenvalue associated to the permutated table 0

I I

ASMDA09

Distribution of the eigenvalues under H0 Identify the observed eigenvalue in this distribution to get the p-value

Vilnius

Missing values in categorization

28 / 36

Validation of categorization results

Signicance of the indicator for a given data table Associate a p-value to the rst eigenvalue of the MCA:

I Complete data: I

Repeat a great number of times:

1. To be under H , independent row permutations within each column 2. Calculating the rst eigenvalue associated to the permutated table 0

I I

Distribution of the eigenvalues under H0 Identify the observed eigenvalue in this distribution to get the p-value

I Incomplete data: for each subject, permutations only for presented products (in order to preserve the BIBD)

ASMDA09

Vilnius

Missing values in categorization

28 / 36

Validation of categorization results

250

Frequency

0.0

1.0

1.0

150

Frequency

0

0

50

50

100

100

Frequency

200

150

0.0

0 100

250

0 50

Frequency

150

300

Signicance of the rst eigenvalue for perfumes data

0.35

0.40

0.45

0.50

0.55

0.60

0.65

0.70

0.38

0.40

0.42

0.44

0.46

0.48

0.50

Complete case

Incomplete case

98 subjects - 12 perfumes

42 subjects - 8 perfumes

ASMDA09

Vilnius

0.52

First eigenvalue

First eigenvalue

Missing values in categorization

29 / 36

Validation of categorization results

Condence ellipses around products: what we want

P4 P1 P3 P2 F4

F1 F2

F3

ASMDA09

Vilnius

Missing values in categorization

30 / 36

Validation of categorization results

Condence ellipses around products: how to obtain them

P4 P1 P3 P2 F4

F1 F2

F3

Superimposed representation of the products and the categories (= word(s)) ASMDA09

Vilnius

Missing values in categorization

31 / 36

Validation of categorization results

Condence ellipses around products: how to obtain them

P4 P1 P3 P2 F4

F1 F2

F3

P4 is at the barycentre of the words ASMDA09

Vilnius

Missing values in categorization

31 / 36

Validation of categorization results

Condence ellipses around products: how to obtain them

Panelist’s words (resampled) P4

product P4 (resampled)

ASMDA09

Vilnius

Missing values in categorization

31 / 36

Validation of categorization results

Condence ellipses around products: how to obtain them

product P4 (resampled) P4

ASMDA09

Vilnius

Missing values in categorization

31 / 36

Validation of categorization results

Condence ellipses around products: how to obtain them

P4

ASMDA09

Vilnius

product P4 (resampled)

Missing values in categorization

31 / 36

Validation of categorization results

Condence ellipses around products: how to obtain them

P4

ASMDA09

Vilnius

Missing values in categorization

31 / 36

Validation of categorization results

Condence ellipses around products Confidence ellipses for the mean points



1.0

1.5

2.0

Confidence ellipses for the mean points

Angel





0.5

Lolita Lempicka

1.0



L'instant ●



Coco Mademoiselle Pure Poison J'adore Pleasures (EP) J'adore (ET)



−0.5



0.0



J_adore_(ET) ●



Coco_Mademoiselle ●

−0.5

Dim 2 (11.5%)

Cinéma

J_adore_(EP)

L_instant Lolita_Lempicka Pure_Poison Cinéma







Shalimar

−1.0

0.5



Aromatics_Elixir

Chanel_n5 Shalimar ●



0.0

Dim 2 (13.64%)

Pleasures ●



Aromatics Elixir

−1.5





● ●



Angel

Chanel n°5

−2.0

−1.0



−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

−1

Dim 1 (17.8%)

0

1

Dim 1 (14.48%)

Complete case

Incomplete case

98 subjects - 12 perfumes

42 subjects - 8 perfumes

ASMDA09

Vilnius

Missing values in categorization

32 / 36

Validation of categorization results

Ellipses overlapping

I Can the observed non-overlapping be obtained by pure chance?

ASMDA09

Vilnius

Missing values in categorization

33 / 36

Validation of categorization results

Ellipses overlapping

I Can the observed non-overlapping be obtained by pure chance?

I Ellipses overlapping between two products

A1

A2

B2 B3

A1

A2

A3

I Overlapping indicator:

ASMDA09

Vilnius

A2

B2 B

A3

B1

=

R

2 o

Between inertia

=

B3

A

B

A3

B1

Total inertia

A1 B2 B3

A

A and B :

B1

+

Within inertia

between inertia total inertia

Missing values in categorization

(0

≤ Ro2 ≤ 1)

33 / 36

Validation of categorization results

Ellipses overlapping

I Can the observed non-overlapping be obtained by pure chance?

I Ellipses overlapping between two products

A1

A2

B2 B3

A1

A2

A3

I Overlapping indicator:

A2

B2 B

A3

B1

=

R

2 o

Between inertia

=

B3

A

B

A3

B1

Total inertia

A1 B2 B3

A

A and B :

B1

+

Within inertia

between inertia total inertia

(0

≤ Ro2 ≤ 1)

I Same permutation procedure as for the rst eigenvalue

ASMDA09

Vilnius

Missing values in categorization

33 / 36

Validation of categorization results

0

0

50

50

100

Frequency

150 100

Frequency

200

250

150

300

200

Signicance of the inertia ratio for perfumes data

0.982

0.984

0.986

0.988

0.990

0.992

0.994

0.996

0.960

0.965

0.970

0.975

Complete case

Incomplete case

98 subjects - 12 perfumes

42 subjects - 8 perfumes

ASMDA09

Vilnius

0.980

Ratio

ratio

Missing values in categorization

34 / 36

Conclusion

Overview

Context

Analyzing complete categorization data

Analyzing incomplete categorization data

Validation of categorization results

Conclusion

ASMDA09

Vilnius

Missing values in categorization

35 / 36

Conclusion

Conclusion

I MCA: suitable method to analyze categorization data I Incomplete categorization data: I I

BIBD: best way to select the products BIBD + MCA provide useful results

I Categorization data: few rows and many columns I Necessary to validate the results I I

First eigenvalue Inertia ratio

I Available in SensoMineR

ASMDA09

Vilnius

Missing values in categorization

36 / 36

The applied mathematics department of Agrocampus organizes Paris Rennes

The R User Conference 2009 July 8-10 in Rennes, France http://www.agrocampus-rennes.fr/math/useR-2009/