Introduction
SI for continuous var.
SI for categorical var.
SI for mixed var.
Multiple imputation
Handling missing values with a special focus on the use of principal components methods François Husson & Julie Josse Applied mathematics department, Agrocampus Ouest, Rennes, France
1 / 81
Introduction
SI for continuous var.
SI for categorical var.
SI for mixed var.
Multiple imputation
Research activities Variables factor map (PCA)
11
1
3
1.0
5 2
0.5
4
0
12
Typicity 10
0.0
Attack.intensity Acid
Sweet 0
2
4
-0.5
-2
8
-2
Dim 1 (71.34%)
VJ -1.0
Vj
Color.intensity Bitter
9 7
V1
Odor.intensity
Pulp
6
-1
Dim 2 (17.16%)
1
2
Individuals factor map (PCA)
ind 1
-1.0
-0.5
0.0
0.5
1.0
1.5
Dim 1 (71.34%)
tea shop
ind i 2
unpackaged
0
ind I
Dim 2 (8.103%) 1
p_upscale
green dinner black lemon tearoom Not.friends No.sugar Not.breakfast Not.resto Not.work Not.tea time chain store+tea shop Not.lunch alone always Not.pub tea bag+unpackaged Not.evening other homeevening Not.always Not.home Not.dinner Not.tearoom tea time friends pub breakfast sugar p_variable p_cheap Earl Grey work lunch chain store milk resto p_branded tea bag p_private label p_unknown
-1
0
1
Dim 1 (9.885%)
2 / 81
Introduction
SI for continuous var.
SI for categorical var.
SI for mixed var.
Multiple imputation
Research activities Variables factor map (PCA)
11
1
3
1.0
5 2
0.5
4
0
12
Typicity 10
0.0
Attack.intensity Acid
Sweet 0
2
4
-0.5
-2
8
-2
Dim 1 (71.34%)
VJ -1.0
Vj
Color.intensity Bitter
9 7
V1
Odor.intensity
Pulp
6
-1
Dim 2 (17.16%)
1
2
Individuals factor map (PCA)
ind 1
-1.0
-0.5
0.0
0.5
1.0
1.5
Dim 1 (71.34%)
tea shop
ind i 2
unpackaged
0
ind I
Dim 2 (8.103%) 1
p_upscale
green dinner black lemon tearoom Not.friends No.sugar Not.breakfast Not.resto Not.work Not.tea time chain store+tea shop Not.lunch alone always Not.pub tea bag+unpackaged Not.evening other homeevening Not.always Not.home Not.dinner Not.tearoom tea time friends pub breakfast sugar p_variable p_cheap Earl Grey work lunch chain store milk resto p_branded tea bag p_private label p_unknown
-1
0
1
Dim 1 (9.885%)
2 / 81
Introduction
SI for continuous var.
SI for categorical var.
SI for mixed var.
Multiple imputation
Research activities
Groups of continuous/qualitative variables
3 / 81
Introduction
SI for continuous var.
SI for categorical var.
SI for mixed var.
Multiple imputation
Research activities
Groups of continuous/qualitative variables
Multiple Factor Analysis
Correlation circle Individual factor map 1.5
1.0
CGH expr
CGH expr
1
2
Individual factor map A GBM O OA
0.5
-2
-1
0 Dim 1 (20.99 %)
1
2
3
0.5 0.0 -1.0
-0.5
Dim 2 (13.51%)
GBM
OA
-1.5
A
-2.0
-1.0
-2
-0.5
-1
A
O
0.0
Dim 2 (13.51%)
0
OA
-3
Dim 2 (13.51 %)
1.0
O GBM
-1.0
-0.5
0.0 Dim 1 (20.99%)
0.5
1.0
-1
0
1
2
Dim 1 (20.99%)
3 / 81
Introduction
SI for continuous var.
SI for categorical var.
SI for mixed var.
Multiple imputation
Research activities • Exploratory multivariate data analysis (principal components
methods to visualize data) • Missing values • Fields of application: Bio-sciences; sensory analysis
• Books (Exploratory multivariate analysis with R, R for
Statistics and 3 books in French) • R packages (FactoMineR - missMDA - SensoMineR) • A MOOC on exploratory multivariate data analysis
4 / 81
Introduction
SI for continuous var.
SI for categorical var.
SI for mixed var.
Multiple imputation
Outline
1 Introduction 2 Single imputation for continuous variables 3 Single imputation for categorical variables 4 Single imputation for mixed variables 5 Multiple imputation
5 / 81
Introduction
SI for continuous var.
SI for categorical var.
SI for mixed var.
Multiple imputation
Missing values
“The best thing to do with missing values is not to have any”
Gertrude Mary Cox
Missing values are ubiquitous: • no answer in a questionnaire • data that are lost or destroyed • machines that fail • plants damaged • ... 6 / 81
Introduction
SI for continuous var.
SI for categorical var.
SI for mixed var.
Multiple imputation
Missing values
“The best thing to do with missing values is not to have any”
Gertrude Mary Cox
Missing values are ubiquitous: • no answer in a questionnaire • data that are lost or destroyed • machines that fail • plants damaged • ... Still an issue in the big data area 6 / 81
Introduction
SI for continuous var.
SI for categorical var.
SI for mixed var.
Multiple imputation
A real dataset 0601 0602 0603 0604 0605 0606 0607 0610 0611 0612 0613 . . . 0919 0920 0921 0922 0923 0924 0925 0927 0928 0929 0930
O3 NA 82 92 114 94 80 NA 79 101 NA 101 . . . NA 71 96 98 92 NA 84 NA 99 NA 70
T9 15.6 17 NA 16.2 17.4 17.7 16.8 14.9 NA 18.3 17.3 . . . 14.8 15.5 NA NA 14.7 13.3 13.3 16.2 16.9 16.9 15.7
T12 18.5 18.4 17.6 NA 20.5 NA 15.6 17.5 19.6 21.9 19.3 . . . 16.3 18 NA NA 17.6 17.7 17.7 20.8 23 19.8 18.6
T15 18.4 17.7 19.5 NA NA 18.3 14.9 18.9 21.4 22.9 20.2 . . . 15.9 17.4 NA NA 18.2 17.7 17.8 22.1 22.6 22.1 20.7
Ne9 4 5 2 1 8 NA 7 5 2 5 NA . . . 7 7 3 2 1 NA 3 6 NA 6 NA
Ne12 4 5 5 1 8 NA 8 5 4 6 NA . . . 7 7 3 2 4 NA 5 5 4 5 NA
Ne15 8 7 4 0 7 NA 8 4 4 8 NA . . . 7 6 3 2 6 NA 6 5 7 3 NA
Vx9 NA NA 2.9544 NA -0.5 -5.6382 -4.3301 0 -0.766 1.2856 -1.5 . . . -4.3301 -3.9392 NA 4 5.1962 -0.9397 0 -0.6946 1.5 -4 0
Vx12 -1.7101 NA 1.8794 NA NA -5 -1.8794 -1.0419 NA -2.2981 -1.5 . . . -6.0622 -3.0642 NA 5 5.1423 -0.766 -1 -2 0.8682 -3.7588 -1.0419
Vx15 -0.6946 NA 0.5209 NA -4.3301 -6 -3.7588 -1.3892 -2.2981 -3.9392 -0.8682 . . . -5.1962 0 NA 4.3301 3.5 -0.5 -1.2856 -1.3681 0.8682 -4 -4
O3v 84 87 82 92 114 94 80 NA 79 101 NA
42 NA 71 96 98 92 NA 71 NA 99 NA
7 / 81
Introduction
SI for continuous var.
SI for categorical var.
SI for mixed var.
Multiple imputation
Some references Schafer (1997),
Joseph L. Schafer
Little & Rubin (1987, 2002)
Roderick Little
Donald Rubin
Suggested reading: chap 25 of Gelman & Hill (2006)
Andrew Gelman
Jennifer L. Hill
8 / 81
Introduction
SI for continuous var.
SI for categorical var.
SI for mixed var.
Multiple imputation
Missing values problematic
A very simple way: deletion (default lm function in R) Dealing with missing values depends on: • the pattern of missing values • the mechanism leading to missing values
9 / 81
Introduction
SI for continuous var.
SI for categorical var.
SI for mixed var.
Multiple imputation
Missing values problematic
A very simple way: deletion (default lm function in R) Dealing with missing values depends on: • the pattern of missing values • the mechanism leading to missing values • MCAR: probability does not depend on any values • MAR: probability may depend on values on other variables • MNAR: probability depends on the value itself
(Ex: Income - Age)
9 / 81
Introduction
SI for continuous var.
SI for categorical var.
SI for mixed var.
Multiple imputation
Missing values problematic
A very simple way: deletion (default lm function in R) Dealing with missing values depends on: • the pattern of missing values • the mechanism leading to missing values • MCAR: probability does not depend on any values • MAR: probability may depend on values on other variables • MNAR: probability depends on the value itself
(Ex: Income - Age) ⇒ Visualization of missing data
9 / 81
Introduction
SI for continuous var.
SI for categorical var.
SI for mixed var.
Multiple imputation
Count missing values > library(VIM) > res res[rev(order(res[,2])),] Variables sorted by number of missings: Variable Count Ne12 0.37500000 T9 0.33035714 T15 0.33035714 Ne9 0.30357143 T12 0.29464286 Ne15 0.28571429 Vx15 0.18750000 Vx9 0.16071429 maxO3 0.14285714 maxO3v 0.10714286 Vx12 0.08928571
Combinations Count Percent 0:0:0:0:0:0:0:0:0:0:0 13 11.6071429 0:1:1:1:0:0:0:0:0:0:0 7 6.2500000 0:0:0:0:0:1:0:0:0:0:0 5 4.4642857 0:1:0:0:0:0:0:0:0:0:0 4 3.5714286 0:1:0:0:1:1:1:0:0:0:0 3 2.6785714 0:0:1:0:0:0:0:0:0:0:0 3 2.6785714 0:0:0:1:0:0:0:0:0:0:0 3 2.6785714 0:0:0:0:1:1:1:0:0:0:0 3 2.6785714 0:0:0:0:0:1:0:0:0:0:1 3 2.6785714 0:1:1:1:1:0:0:0:0:0:0 2 1.7857143 0:0:0:0:1:0:0:0:0:1:0 2 1.7857143 0:0:0:0:0:0:1:1:0:0:0 2 1.7857143 0:0:0:0:0:0:1:0:0:0:0 2 1.7857143 ..................... . ...
10 / 81
Introduction
SI for continuous var.
SI for categorical var.
SI for mixed var.
Multiple imputation
Combinations
0.25 0.20 0.15 0.10
Ne12 T9 T15 Ne9 T12 Ne15 Vx15 Vx9 maxO3 maxO3v Vx12
Ne12 T9 T15 Ne9 T12 Ne15 Vx15 Vx9 maxO3 maxO3v Vx12
0.00
0.05
Proportion of missings
0.30
0.35
Pattern visualization
> library(VIM) > aggr(don,only.miss=TRUE,sortVar=TRUE) 11 / 81
Introduction
SI for continuous var.
SI for categorical var.
SI for mixed var.
Multiple imputation
● ●
● ●
●
● ●
●
140
160
Visualization
● ●
● ●
● ●
120
●
●
100
maxO3
100
● ● ● ●
●
●
● ● ●● ●●● ● ●
●
●
●
●
●
● ● ●●
● ● ● ● ●● ● ● ● ●● ● ● ●
● ●
●
●
●
●
● ● ●
40
60
●
●
●
●●
●
40
● ●
16
20
● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
80
80 60
Index
● ● ● ● ● ●
●
●
●
● ● ●
●●
● ●●
Vx15
maxO3v
Vx9
Vx12
Ne15
Ne12
T15
Ne9
T9
T12
maxO3
0
4 37 12
14
16
18
20
22
24
T9
> library(VIM) > matrixplot(don,sortby=2) > marginplot(don[,c("T9","maxO3")])
12 / 81
Introduction
SI for continuous var.
SI for categorical var.
SI for mixed var.
Multiple imputation
Visualization with Multiple Correspondence Analysis ⇒ Create the missingness matrix > > > >
mis.ind library(FactoMineR) > resMCA plot(resMCA,invis="ind",title="MCA graph of the categories") 14 / 81
Introduction
SI for continuous var.
SI for categorical var.
SI for mixed var.
Multiple imputation
Recommended approaches
⇒ Modify the method, the estimation process to deal with missing values
⇒ Imputation (multiple imputation) to get a completed data set on which you can perform any statistical method
15 / 81
Introduction
SI for continuous var.
SI for categorical var.
SI for mixed var.
Multiple imputation
Expectation - Maximization (Dempster et al., 1977) Need the modification of the estimation process (not always easy!) Rationale to get ML estimates on the observed values max Lobs through max of Lcomp of X = (Xobs , Xmiss ). Augment the data to simplify the problem E step (conditional expectation): Q(θ, θ` ) =
Z
ln(f (X |θ))f (Xmiss |Xobs , θ` )dXmiss
M step (maximization): θ`+1 = argmaxθ Q(θ, θ` ) Result: when θ`+1 max Q(θ, θ` ) then L(Xobs , θ`+1 ) ≥ L(Xobs , θ` ) 16 / 81
Introduction
SI for continuous var.
SI for categorical var.
SI for mixed var.
Multiple imputation
Maximum likelihood approach Hypothesis xi. ∼ N (µ, Σ) ⇒ Point estimates with EM: > > > >
library(norm) pre > >
library(norm) pre > >
library(norm) pre > >
pre res.cm > >
●
0
1
2
3 nb dim
34 / 81
Introduction
SI for continuous var.
SI for categorical var.
SI for mixed var.
Multiple imputation
Imputation with PCA in practice
⇒ Step 2: Imputation of the missing values > res.comp res.comp$completeObs[1:3,] maxO3 T9 T12 T15 Ne9 Ne12 0601 87 15.60 18.50 20.47 4 4.00 0602 82 18.51 20.88 21.81 5 5.00 0603 92 15.30 17.60 19.50 2 3.98
Ne15 Vx9 Vx12 Vx15 maxO3v 8.00 0.69 -1.71 -0.69 84 7.00 -4.33 -4.00 -3.00 87 3.81 2.95 1.97 0.52 82
35 / 81
Introduction
SI for continuous var.
SI for categorical var.
SI for mixed var.
Multiple imputation
Cherry on the cake: PCA on incomplete data! ⇒ visualization of the incomplete data: a crucial step 1.0
Variables factor map (PCA)
East North West South
T9
●
●
●
●
●
●
0
● ●
●
●
● ●
● ● ●
● ●
●
● ● ● ● ● ● ●
●
●
● ●
● ●
0.5
●
● ●
●
●
● ●●● ● ●
●
●
●
● ●
● ●
●
●
●
T15 maxO3
●
Vx9 Vx12 Vx15
●
−6
−4
−2
0
2
Dim 1 (57.47%)
> > > >
Ne9 Ne15 Ne12
−1.0
●
●
East●
●
●
T12 maxO3v
● ● ● ● ● ●
●
●
−4
●
●
●
●
●
● ● North ● ●
●
●
● ●
South ●
●
●● ● ●
●
−2
Dim 2 (21.34%)
●
● ●●
● West ● ●
● ●● ● ● ● ●
●
●● ●
● ●
● ●
●
Dim 2 (21.73%)
● ● ●
−0.5
2
●
0.0
4
Individuals factor map (PCA)
4
6
−1.0
−0.5
0.0
0.5
1.0
Dim 1 (55.85%)
imp dim(na.omit(don)) ## Delete species with missing values [1] 72 6 ## only 72 remaining species! > library(VIM) > aggr(don,numbers=TRUE,sortVar=TRUE)
Combinations
0.6 0.4 0.2
LMA
Amass
Nmass
LL
Pmass
Rmass
LMA
Amass
Nmass
LL
Pmass
Rmass
0.0
Proportion of missings
0.8
0.0004 0.0004 0.0004 0.0004 0.0020 0.0024 0.0024 0.0024 0.0028 0.0036 0.0052 0.0056 0.0080 0.0120 0.0124 0.0124 0.0152 0.0180 0.0180 0.0289 0.0397 0.0525 0.0573 0.0589 0.0714 0.1359 0.1985 0.2326
38 / 81
Introduction
SI for continuous var.
SI for categorical var.
SI for mixed var.
Multiple imputation
An ecological data set 1.5
MCA graph of the categories
Rmass_m Amass_m LL_m
●
LL_o Amass_o
Rmass_o
LMA_o Nmass_o
−0.5
0.0
0.5
Pmass_m
−1.0
Dim 2 (21.07%)
1.0
Nmass_m LMA_m
−1.5
Pmass_o
−1
0
1
2
Dim 1 (33.67%)
> > > > > >
mis.ind > > >
5
−1
0
1
Dim 1 (91.18%)
2
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
Dim 1 (91.18%)
library(missMDA) nb res.mice library(missMDA) > res.MIPCA res.MIPCA$resMI
71 / 81
Introduction
SI for continuous var.
SI for categorical var.
SI for mixed var.
Multiple imputation
Multiple imputation in practice ⇒ Step 2: visualization Observed versus Imputed Values of maxO3 200
0.12
Observed and Imputed values of T12
150 50
100
Imputed Values
0.08 0.06 0.04 0.00
0.02
Relative Density
0.10
Mean Imputations Observed Values
0−.2 10
15
20
25
30
35
T12 −− Fraction Missing: 0.295
> > > >
40
60
.2−.4 80
.4−.6 100
120
.6−.8 140
.8−1 160
Observed Values
library(Amelia) res.amelia > > >
Ne9 Ne15 Ne12
−1.0
●
●
East●
●
●
−4
● ● ● ● ●
●
●
● ●●● ● ●
● ●
T12 maxO3v
●
●
●
●
●
● ● North ● ●
●
●
●
●
●
●
●● ● ● ●
●
−2
Dim 2 (21.34%)
●
●
●●
● West ● ●
● ●● ● ● ● ●
●
●● ●
● ●
● ●
●
●
Dim 2 (21.73%)
● ● ●
−0.5
2
●
0.0
4
Individuals factor map (PCA) East North West South
4
6
−1.0
−0.5
0.0
0.5
1.0
Dim 1 (55.85%)
imp > > > >
1 PM ˆ m=1 βm M P 1 d βˆm =M Var m
+ 1+
1 M
1 M−1
2 P ˆ ˆ β − β m m
library(mice) imp.mice