Handling missing values with a special focus on the use ... - FactoMineR

Page 2 .... 79. 14.9. 17.5. 18.9. 5. 5. 4. 0. -1.0419. -1.3892. NA. 0611. 101. NA. 19.6. 21.4. 2. 4. 4. -0.766. NA. -2.2981. 79. 0612. NA. 18.3. 21.9. 22.9. 5. 6. 8.
2MB taille 22 téléchargements 181 vues
Introduction

SI for continuous var.

SI for categorical var.

SI for mixed var.

Multiple imputation

Handling missing values with a special focus on the use of principal components methods François Husson & Julie Josse Applied mathematics department, Agrocampus Ouest, Rennes, France

1 / 81

Introduction

SI for continuous var.

SI for categorical var.

SI for mixed var.

Multiple imputation

Research activities Variables factor map (PCA)

11

1

3

1.0

5 2

0.5

4

0

12

Typicity 10

0.0

Attack.intensity Acid

Sweet 0

2

4

-0.5

-2

8

-2

Dim 1 (71.34%)

VJ -1.0

Vj

Color.intensity Bitter

9 7

V1

Odor.intensity

Pulp

6

-1

Dim 2 (17.16%)

1

2

Individuals factor map (PCA)

ind 1

-1.0

-0.5

0.0

0.5

1.0

1.5

Dim 1 (71.34%)

tea shop

ind i 2

unpackaged

0

ind I

Dim 2 (8.103%) 1

p_upscale

green dinner black lemon tearoom Not.friends No.sugar Not.breakfast Not.resto Not.work Not.tea time chain store+tea shop Not.lunch alone always Not.pub tea bag+unpackaged Not.evening other homeevening Not.always Not.home Not.dinner Not.tearoom tea time friends pub breakfast sugar p_variable p_cheap Earl Grey work lunch chain store milk resto p_branded tea bag p_private label p_unknown

-1

0

1

Dim 1 (9.885%)

2 / 81

Introduction

SI for continuous var.

SI for categorical var.

SI for mixed var.

Multiple imputation

Research activities Variables factor map (PCA)

11

1

3

1.0

5 2

0.5

4

0

12

Typicity 10

0.0

Attack.intensity Acid

Sweet 0

2

4

-0.5

-2

8

-2

Dim 1 (71.34%)

VJ -1.0

Vj

Color.intensity Bitter

9 7

V1

Odor.intensity

Pulp

6

-1

Dim 2 (17.16%)

1

2

Individuals factor map (PCA)

ind 1

-1.0

-0.5

0.0

0.5

1.0

1.5

Dim 1 (71.34%)

tea shop

ind i 2

unpackaged

0

ind I

Dim 2 (8.103%) 1

p_upscale

green dinner black lemon tearoom Not.friends No.sugar Not.breakfast Not.resto Not.work Not.tea time chain store+tea shop Not.lunch alone always Not.pub tea bag+unpackaged Not.evening other homeevening Not.always Not.home Not.dinner Not.tearoom tea time friends pub breakfast sugar p_variable p_cheap Earl Grey work lunch chain store milk resto p_branded tea bag p_private label p_unknown

-1

0

1

Dim 1 (9.885%)

2 / 81

Introduction

SI for continuous var.

SI for categorical var.

SI for mixed var.

Multiple imputation

Research activities

Groups of continuous/qualitative variables

3 / 81

Introduction

SI for continuous var.

SI for categorical var.

SI for mixed var.

Multiple imputation

Research activities

Groups of continuous/qualitative variables

Multiple Factor Analysis

Correlation circle Individual factor map 1.5

1.0

CGH expr

CGH expr

1

2

Individual factor map A GBM O OA

0.5

-2

-1

0 Dim 1 (20.99 %)

1

2

3

0.5 0.0 -1.0

-0.5

Dim 2 (13.51%)

GBM

OA

-1.5

A

-2.0

-1.0

-2

-0.5

-1

A

O

0.0

Dim 2 (13.51%)

0

OA

-3

Dim 2 (13.51 %)

1.0

O GBM

-1.0

-0.5

0.0 Dim 1 (20.99%)

0.5

1.0

-1

0

1

2

Dim 1 (20.99%)

3 / 81

Introduction

SI for continuous var.

SI for categorical var.

SI for mixed var.

Multiple imputation

Research activities • Exploratory multivariate data analysis (principal components

methods to visualize data) • Missing values • Fields of application: Bio-sciences; sensory analysis

• Books (Exploratory multivariate analysis with R, R for

Statistics and 3 books in French) • R packages (FactoMineR - missMDA - SensoMineR) • A MOOC on exploratory multivariate data analysis

4 / 81

Introduction

SI for continuous var.

SI for categorical var.

SI for mixed var.

Multiple imputation

Outline

1 Introduction 2 Single imputation for continuous variables 3 Single imputation for categorical variables 4 Single imputation for mixed variables 5 Multiple imputation

5 / 81

Introduction

SI for continuous var.

SI for categorical var.

SI for mixed var.

Multiple imputation

Missing values

“The best thing to do with missing values is not to have any”

Gertrude Mary Cox

Missing values are ubiquitous: • no answer in a questionnaire • data that are lost or destroyed • machines that fail • plants damaged • ... 6 / 81

Introduction

SI for continuous var.

SI for categorical var.

SI for mixed var.

Multiple imputation

Missing values

“The best thing to do with missing values is not to have any”

Gertrude Mary Cox

Missing values are ubiquitous: • no answer in a questionnaire • data that are lost or destroyed • machines that fail • plants damaged • ... Still an issue in the big data area 6 / 81

Introduction

SI for continuous var.

SI for categorical var.

SI for mixed var.

Multiple imputation

A real dataset 0601 0602 0603 0604 0605 0606 0607 0610 0611 0612 0613 . . . 0919 0920 0921 0922 0923 0924 0925 0927 0928 0929 0930

O3 NA 82 92 114 94 80 NA 79 101 NA 101 . . . NA 71 96 98 92 NA 84 NA 99 NA 70

T9 15.6 17 NA 16.2 17.4 17.7 16.8 14.9 NA 18.3 17.3 . . . 14.8 15.5 NA NA 14.7 13.3 13.3 16.2 16.9 16.9 15.7

T12 18.5 18.4 17.6 NA 20.5 NA 15.6 17.5 19.6 21.9 19.3 . . . 16.3 18 NA NA 17.6 17.7 17.7 20.8 23 19.8 18.6

T15 18.4 17.7 19.5 NA NA 18.3 14.9 18.9 21.4 22.9 20.2 . . . 15.9 17.4 NA NA 18.2 17.7 17.8 22.1 22.6 22.1 20.7

Ne9 4 5 2 1 8 NA 7 5 2 5 NA . . . 7 7 3 2 1 NA 3 6 NA 6 NA

Ne12 4 5 5 1 8 NA 8 5 4 6 NA . . . 7 7 3 2 4 NA 5 5 4 5 NA

Ne15 8 7 4 0 7 NA 8 4 4 8 NA . . . 7 6 3 2 6 NA 6 5 7 3 NA

Vx9 NA NA 2.9544 NA -0.5 -5.6382 -4.3301 0 -0.766 1.2856 -1.5 . . . -4.3301 -3.9392 NA 4 5.1962 -0.9397 0 -0.6946 1.5 -4 0

Vx12 -1.7101 NA 1.8794 NA NA -5 -1.8794 -1.0419 NA -2.2981 -1.5 . . . -6.0622 -3.0642 NA 5 5.1423 -0.766 -1 -2 0.8682 -3.7588 -1.0419

Vx15 -0.6946 NA 0.5209 NA -4.3301 -6 -3.7588 -1.3892 -2.2981 -3.9392 -0.8682 . . . -5.1962 0 NA 4.3301 3.5 -0.5 -1.2856 -1.3681 0.8682 -4 -4

O3v 84 87 82 92 114 94 80 NA 79 101 NA

42 NA 71 96 98 92 NA 71 NA 99 NA

7 / 81

Introduction

SI for continuous var.

SI for categorical var.

SI for mixed var.

Multiple imputation

Some references Schafer (1997),

Joseph L. Schafer

Little & Rubin (1987, 2002)

Roderick Little

Donald Rubin

Suggested reading: chap 25 of Gelman & Hill (2006)

Andrew Gelman

Jennifer L. Hill

8 / 81

Introduction

SI for continuous var.

SI for categorical var.

SI for mixed var.

Multiple imputation

Missing values problematic

A very simple way: deletion (default lm function in R) Dealing with missing values depends on: • the pattern of missing values • the mechanism leading to missing values

9 / 81

Introduction

SI for continuous var.

SI for categorical var.

SI for mixed var.

Multiple imputation

Missing values problematic

A very simple way: deletion (default lm function in R) Dealing with missing values depends on: • the pattern of missing values • the mechanism leading to missing values • MCAR: probability does not depend on any values • MAR: probability may depend on values on other variables • MNAR: probability depends on the value itself

(Ex: Income - Age)

9 / 81

Introduction

SI for continuous var.

SI for categorical var.

SI for mixed var.

Multiple imputation

Missing values problematic

A very simple way: deletion (default lm function in R) Dealing with missing values depends on: • the pattern of missing values • the mechanism leading to missing values • MCAR: probability does not depend on any values • MAR: probability may depend on values on other variables • MNAR: probability depends on the value itself

(Ex: Income - Age) ⇒ Visualization of missing data

9 / 81

Introduction

SI for continuous var.

SI for categorical var.

SI for mixed var.

Multiple imputation

Count missing values > library(VIM) > res res[rev(order(res[,2])),] Variables sorted by number of missings: Variable Count Ne12 0.37500000 T9 0.33035714 T15 0.33035714 Ne9 0.30357143 T12 0.29464286 Ne15 0.28571429 Vx15 0.18750000 Vx9 0.16071429 maxO3 0.14285714 maxO3v 0.10714286 Vx12 0.08928571

Combinations Count Percent 0:0:0:0:0:0:0:0:0:0:0 13 11.6071429 0:1:1:1:0:0:0:0:0:0:0 7 6.2500000 0:0:0:0:0:1:0:0:0:0:0 5 4.4642857 0:1:0:0:0:0:0:0:0:0:0 4 3.5714286 0:1:0:0:1:1:1:0:0:0:0 3 2.6785714 0:0:1:0:0:0:0:0:0:0:0 3 2.6785714 0:0:0:1:0:0:0:0:0:0:0 3 2.6785714 0:0:0:0:1:1:1:0:0:0:0 3 2.6785714 0:0:0:0:0:1:0:0:0:0:1 3 2.6785714 0:1:1:1:1:0:0:0:0:0:0 2 1.7857143 0:0:0:0:1:0:0:0:0:1:0 2 1.7857143 0:0:0:0:0:0:1:1:0:0:0 2 1.7857143 0:0:0:0:0:0:1:0:0:0:0 2 1.7857143 ..................... . ...

10 / 81

Introduction

SI for continuous var.

SI for categorical var.

SI for mixed var.

Multiple imputation

Combinations

0.25 0.20 0.15 0.10

Ne12 T9 T15 Ne9 T12 Ne15 Vx15 Vx9 maxO3 maxO3v Vx12

Ne12 T9 T15 Ne9 T12 Ne15 Vx15 Vx9 maxO3 maxO3v Vx12

0.00

0.05

Proportion of missings

0.30

0.35

Pattern visualization

> library(VIM) > aggr(don,only.miss=TRUE,sortVar=TRUE) 11 / 81

Introduction

SI for continuous var.

SI for categorical var.

SI for mixed var.

Multiple imputation

● ●

● ●



● ●



140

160

Visualization

● ●

● ●

● ●

120





100

maxO3

100

● ● ● ●





● ● ●● ●●● ● ●











● ● ●●

● ● ● ● ●● ● ● ● ●● ● ● ●

● ●









● ● ●

40

60







●●



40

● ●

16

20

● ●



● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

80

80 60

Index

● ● ● ● ● ●







● ● ●

●●

● ●●

Vx15

maxO3v

Vx9

Vx12

Ne15

Ne12

T15

Ne9

T9

T12

maxO3

0

4 37 12

14

16

18

20

22

24

T9

> library(VIM) > matrixplot(don,sortby=2) > marginplot(don[,c("T9","maxO3")])

12 / 81

Introduction

SI for continuous var.

SI for categorical var.

SI for mixed var.

Multiple imputation

Visualization with Multiple Correspondence Analysis ⇒ Create the missingness matrix > > > >

mis.ind library(FactoMineR) > resMCA plot(resMCA,invis="ind",title="MCA graph of the categories") 14 / 81

Introduction

SI for continuous var.

SI for categorical var.

SI for mixed var.

Multiple imputation

Recommended approaches

⇒ Modify the method, the estimation process to deal with missing values

⇒ Imputation (multiple imputation) to get a completed data set on which you can perform any statistical method

15 / 81

Introduction

SI for continuous var.

SI for categorical var.

SI for mixed var.

Multiple imputation

Expectation - Maximization (Dempster et al., 1977) Need the modification of the estimation process (not always easy!) Rationale to get ML estimates on the observed values max Lobs through max of Lcomp of X = (Xobs , Xmiss ). Augment the data to simplify the problem E step (conditional expectation): Q(θ, θ` ) =

Z

ln(f (X |θ))f (Xmiss |Xobs , θ` )dXmiss

M step (maximization): θ`+1 = argmaxθ Q(θ, θ` ) Result: when θ`+1 max Q(θ, θ` ) then L(Xobs , θ`+1 ) ≥ L(Xobs , θ` ) 16 / 81

Introduction

SI for continuous var.

SI for categorical var.

SI for mixed var.

Multiple imputation

Maximum likelihood approach Hypothesis xi. ∼ N (µ, Σ) ⇒ Point estimates with EM: > > > >

library(norm) pre > >

library(norm) pre > >

library(norm) pre > >

pre res.cm > >



0

1

2

3 nb dim

34 / 81

Introduction

SI for continuous var.

SI for categorical var.

SI for mixed var.

Multiple imputation

Imputation with PCA in practice

⇒ Step 2: Imputation of the missing values > res.comp res.comp$completeObs[1:3,] maxO3 T9 T12 T15 Ne9 Ne12 0601 87 15.60 18.50 20.47 4 4.00 0602 82 18.51 20.88 21.81 5 5.00 0603 92 15.30 17.60 19.50 2 3.98

Ne15 Vx9 Vx12 Vx15 maxO3v 8.00 0.69 -1.71 -0.69 84 7.00 -4.33 -4.00 -3.00 87 3.81 2.95 1.97 0.52 82

35 / 81

Introduction

SI for continuous var.

SI for categorical var.

SI for mixed var.

Multiple imputation

Cherry on the cake: PCA on incomplete data! ⇒ visualization of the incomplete data: a crucial step 1.0

Variables factor map (PCA)

East North West South

T9













0

● ●





● ●

● ● ●

● ●



● ● ● ● ● ● ●





● ●

● ●

0.5



● ●





● ●●● ● ●







● ●

● ●







T15 maxO3



Vx9 Vx12 Vx15



−6

−4

−2

0

2

Dim 1 (57.47%)

> > > >

Ne9 Ne15 Ne12

−1.0





East●





T12 maxO3v

● ● ● ● ● ●





−4











● ● North ● ●





● ●

South ●



●● ● ●



−2

Dim 2 (21.34%)



● ●●

● West ● ●

● ●● ● ● ● ●



●● ●

● ●

● ●



Dim 2 (21.73%)

● ● ●

−0.5

2



0.0

4

Individuals factor map (PCA)

4

6

−1.0

−0.5

0.0

0.5

1.0

Dim 1 (55.85%)

imp dim(na.omit(don)) ## Delete species with missing values [1] 72 6 ## only 72 remaining species! > library(VIM) > aggr(don,numbers=TRUE,sortVar=TRUE)

Combinations

0.6 0.4 0.2

LMA

Amass

Nmass

LL

Pmass

Rmass

LMA

Amass

Nmass

LL

Pmass

Rmass

0.0

Proportion of missings

0.8

0.0004 0.0004 0.0004 0.0004 0.0020 0.0024 0.0024 0.0024 0.0028 0.0036 0.0052 0.0056 0.0080 0.0120 0.0124 0.0124 0.0152 0.0180 0.0180 0.0289 0.0397 0.0525 0.0573 0.0589 0.0714 0.1359 0.1985 0.2326

38 / 81

Introduction

SI for continuous var.

SI for categorical var.

SI for mixed var.

Multiple imputation

An ecological data set 1.5

MCA graph of the categories

Rmass_m Amass_m LL_m



LL_o Amass_o

Rmass_o

LMA_o Nmass_o

−0.5

0.0

0.5

Pmass_m

−1.0

Dim 2 (21.07%)

1.0

Nmass_m LMA_m

−1.5

Pmass_o

−1

0

1

2

Dim 1 (33.67%)

> > > > > >

mis.ind > > >

5

−1

0

1

Dim 1 (91.18%)

2

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

Dim 1 (91.18%)

library(missMDA) nb res.mice library(missMDA) > res.MIPCA res.MIPCA$resMI

71 / 81

Introduction

SI for continuous var.

SI for categorical var.

SI for mixed var.

Multiple imputation

Multiple imputation in practice ⇒ Step 2: visualization Observed versus Imputed Values of maxO3 200

0.12

Observed and Imputed values of T12

150 50

100

Imputed Values

0.08 0.06 0.04 0.00

0.02

Relative Density

0.10

Mean Imputations Observed Values

0−.2 10

15

20

25

30

35

T12 −− Fraction Missing: 0.295

> > > >

40

60

.2−.4 80

.4−.6 100

120

.6−.8 140

.8−1 160

Observed Values

library(Amelia) res.amelia > > >

Ne9 Ne15 Ne12

−1.0





East●





−4

● ● ● ● ●





● ●●● ● ●

● ●

T12 maxO3v











● ● North ● ●













●● ● ● ●



−2

Dim 2 (21.34%)





●●

● West ● ●

● ●● ● ● ● ●



●● ●

● ●

● ●





Dim 2 (21.73%)

● ● ●

−0.5

2



0.0

4

Individuals factor map (PCA) East North West South

4

6

−1.0

−0.5

0.0

0.5

1.0

Dim 1 (55.85%)

imp > > > >

1 PM ˆ m=1 βm M   P 1 d βˆm =M Var m



+ 1+

1 M



1 M−1

2 P ˆ ˆ β − β m m

library(mice) imp.mice