A method for model selection March 29, 2011 Aurélie Boisbunon*, Stéphane Canu†, Dominique Fourdrinier*, William E. Strawderman‡, Martin T. Wells†† *Université de Rouen, †INSA de Rouen, ‡Rutgers University, ††Cornell University
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
1 Framework and Notations
Model Selection Variable selection Our idea 2 Choices and assumptions
Spherically symmetric distributions Canonical form Number of comparisons 3 Loss estimation
How does it work? Application Results 4 Conclusion
2 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Plan 1 Framework and Notations
Model Selection Variable selection Our idea 2 Choices and assumptions
Spherically symmetric distributions Canonical form Number of comparisons 3 Loss estimation
How does it work? Application Results 4 Conclusion 3 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Problem
Data y ∈ Rn , X = (x1 , . . . , xp ), xj ∈ Rn . n and p can be large (high dimension problem).
Aim x1
y1
x2
y2 . . .
xn
Find a regression function r linking X with y : . . .
yn
y = r (X ) + ε where ε ∈ Rn .
4 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Problem
Data y ∈ Rn , X = (x1 , . . . , xp ), xj ∈ Rn . n and p can be large (high dimension problem).
Aim x1
r
y1
. . . xn
Find a regression function r linking X with y :
y2
x2
. . . yn
y = r (X ) + ε where ε ∈ Rn .
4 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Model Selection Several models possible:
I
Which one is the “best” model?
I
How do we choose it?
I
Can we find an automatic way to choose for a new dataset? 5 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Variable selection
I
Special case of model selection: linear model family
Linear model y = X β + ε, I
I
β ∈ Rp
Search for the “best” submodel, i.e. of the “best” subset I ∗ ⊂ {1, . . . , p} of variables. βj 6= 0 j ∈ I ∗ Assumption: βj = 0 j ∈ / I∗
6 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Litterature
Criteria for variable selection: I
empirical criteria I
I
Cross Validation (CV)
analytical criteria I
I
I
Akaike Information Criterion (AIC)/ Mallows Cp Bayes Information Criterion or Schwarz Criterion (BIC) ...
→ huge cost
→ over estimation → under estimation
7 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Litterature
Criteria for variable selection: I
empirical criteria I
I
Cross Validation (CV)
analytical criteria I
I
I
Akaike Information Criterion (AIC)/ Mallows Cp Bayes Information Criterion or Schwarz Criterion (BIC) ...
→ huge cost
→ over estimation → under estimation
7 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Our idea
Let k = #I be the number of selected variables:
Our criterion for estimating kˆ ∗ is based on: I
a large distributional assumption
I
loss estimation
8 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Our idea
}
Let k = #I be the number of selected variables:
Our criterion for estimating kˆ ∗ is based on: I
a large distributional assumption
I
loss estimation
8 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Plan 1 Framework and Notations
Model Selection Variable selection Our idea 2 Choices and assumptions
Spherically symmetric distributions Canonical form Number of comparisons 3 Loss estimation
How does it work? Application Results 4 Conclusion 9 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Spherically symmetric distributions
General assumption on error distribution: ε ∼ N (0, σ 2 In )
Estimated density of ozone level 20 15 Density
I
Figure: Daily ozone level (ppm) in California, 2008
10 5
I
I
Real life: most of the time non Gaussian In all cases: E[ε] = 0.
0 ï0.05
0
0.05 0.1 Ozone level (ppm) Repartition Estimated density Gaussian density
0.15
http://www.stat.ucla.edu/~nchristo/ statistics_c173_c273/ca_ozone.txt
10 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Spherically symmetric distributions Assumption 1: Distribution of ε ε follows a spherically symmetric distribution around 0. We write : ε ∼ s.s.(0). (εi i.i.d.) Properties : I
Generalization of N (0, σ 2 In ) I I
I
ε = RU I I
I
symmetry about 0 ε ∼ s.s.(0) ⇒ Y ∼ s.s.(X β) R =k ε k U ∼ US1
Distribution I I
f (ε) = g(k ε k) f not necessarily defined 11 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Spherically symmetric distributions Assumption 1: Distribution of ε ε follows a spherically symmetric distribution around 0. We write : ε ∼ s.s.(0). (εi i.i.d.) Properties : I
Generalization of N (0, σ 2 In ) I I
I
ε = RU I I
I
symmetry about 0 ε ∼ s.s.(0) ⇒ Y ∼ s.s.(X β) R =k ε k U ∼ US1
Distribution I I
f (ε) = g(k ε k) f not necessarily defined 11 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Spherically symmetric distributions
Examples: I
Gaussian distribution N (0, σ 2 In ),
I
Student t(ν), ν > 1,
I
hyperbolic secant distribution,
I
Kotz distribution,
I
logistic distribution,
I
(centered) Gaussian mixtures,
I
etc.
12 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Spherically symmetric distributions
Why this assumption? I
adds dependency, but no correlation
I
larger assumption on error distribution than N (0, σ 2 In ) ⇒ robustness
I
still an important constraint (symmetry) ⇒ allows easy computation
13 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Canonical form Projection of Y by G =
Y ∈ Rn
G1 G2
(orthogonal matrix)
Z ∈ Rp U ∈ Rn−p 14 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Canonical form
G
−→
Y = Xβ + ε
where G =
G1 G2
Z U
Example: X = QR
such that:
→
= GX β + Gε G1 X β = +η Xβ G2 θ = +η 0 I
G (n, n), G orthogonal
I
G1 (p, n), G1T ∈ span(X )
I
G2 (n − p, n), G2 ⊥X
G = QT . 15 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Canonical form Usual form Y ∼ s.s.(X β) Estimation of β
Canonical form Z θ ∼ s.s. U 0
Estimation of θ = G1 X β = Aβ
Example: Least-squares estimate ϕ0
= AβˆLS = A(X T X )−1 X T Y = A[(G1T A)T G1T A]−1 (G1T A)T Y = A[AT G1 G1T A]−1 AT G1 Y = AA−1 (AT )−1 AT Z = Z
16 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Canonical form Why use this form? I
Equivalent problems
I
“Pure” form
I
Existing work: estimation of the mean for s.s. distributions ⇒ link with linear models
I
additional data allows good variance estimation ⇒ robustness
BUT: Additional constraint Assumption 2 pλ}
= (Zj − λ sign(Zj ))1{|Zj |>λ}
Problem to solve Find the optimal value for hyperparameter λ. 22 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Plan 1 Framework and Notations
Model Selection Variable selection Our idea 2 Choices and assumptions
Spherically symmetric distributions Canonical form Number of comparisons 3 Loss estimation
How does it work? Application Results 4 Conclusion 23 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
The idea
I
Basic idea: evaluate the quality of ϕ by a loss function L(ϕ, θ)
I
Problem: θ unknown → L(ϕ, θ) =?
I
Common solution: empirical risk ˆ 2 /n (mean) based on data ||y − X β||
I
Our proposition: trade-off between the 2
Proposed criterion Loss estimation of parameter θ based only on the data.
24 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Recipe
Step 1a
Step 1b
Define an estimator ϕ∗ of θ.
Define a loss function L(ϕ∗ , θ).
Step 2 Define an estimator δ of loss L(ϕ∗ , θ). Step 3 ˆ = arg minλ∈R δ(λ) Find the minimum of δ : λ +
25 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Application Step 1a: Define an estimator of θ ϕlasso = Z + g(Z ) où gj (Z ) = −λsgn(Zj )1{|Zj |>λ} − Zj 1{|Zj |≤λ} Step 1b: Define a loss function of ϕlasso L(ϕ∗ , θ) =k ϕlasso − θ k2
I
common loss in regression
I
easy computations
I
X orthogonal ⇒k ϕ∗ − θ k2 =k βˆ∗ − β k2 26 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Step 2: Define a loss estimator of ϕlasso Definition: unbiased estimator Eθ [δ0 ] = Eθ L(ϕlasso , θ) = Rθ (ϕlasso )
Eθ k ϕlasso − θ k2
Eθ k Z − θ k2
Eθ (Z − θ)T g(Z )
= Eθ k Z + g(Z ) − θ k2 = Eθ k Z − θ k2 + k g(Z ) k2 +2(Z − θ)T g(Z )
=
= = where
Eθ
p k U k2 n−p
1 Eθ [k U k2 div g(Z )] n−p 1 Eθ [k U k2 (k − p)] (Stein’s equality) n−p k = #ˆI lasso 27 / 45
Framework and Notations
Choices and assumptions
Eθ k ϕlasso − θ k2
Loss estimation
Conclusion
= Eθ k Z + g(Z ) − θ k2 i h p k −p k U k2 + k g(Z ) k2 +2 k U k2 = Eθ n−p n−p p h 2k − p i X = Eθ k U k2 + Zj2 ∧ λ2 n−p j=1
where gj (Z ) = −λsgn(Zj )1{|Zj |>λ} − Zj 1{|Zj |≤λ} Unbiased estimator of L(ϕlasso , θ) p
δ0 (λ) =
X 2k − p k U k2 + Zj2 ∧ λ2 n−p j=1
28 / 45
Framework and Notations
Choices and assumptions
Loss estimation
I
Sometimes unbiased estimators are not satisfying (like βˆLS )
I
Correction with a small biais
I
Need for a criterion allowing comparison between δ0 and a corrective estimator δ: I
Loss: L(δ) = (δ − L(ϕlasso , θ))2
I
Risk: R(δ) = Eθ [(δ − L(ϕlasso , θ))2 ]
I
Aim: R(δ) ≤ R(δ0 )
Conclusion
29 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Corrective estimator: δγ (λ) = δ0 (λ)− k U k4 γ(Z ) where γ : Rp 7→ R is a correction fonction. I
I
a1 (James-Stein type) k Z k2 a2 γ2 (Z ) = P 2 2 (k + 1)Z(k ) + pj=k +1 Z(j) (proposed by W.E Strawderman) γ1 (Z ) =
Corrective estimator δγ (λ) = δ0 (λ)− k U k4
(k +
2 1)Z(k )
→ constant w.r.t λ → varies with λ
a2 P 2 + pj=k +1 Z(j)
30 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Corrective estimator: δγ (λ) = δ0 (λ)− k U k4 γ(Z ) where γ : Rp 7→ R is a correction fonction. I
I
a1 (James-Stein type) k Z k2 a2 γ2 (Z ) = P 2 2 (k + 1)Z(k ) + pj=k +1 Z(j) (proposed by W.E Strawderman) γ1 (Z ) =
Corrective estimator δγ (λ) = δ0 (λ)− k U k4
(k +
2 1)Z(k )
→ constant w.r.t λ → varies with λ
a2 P 2 + pj=k +1 Z(j)
30 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Summary
I
Estimator of θ:
I
Loss function for ϕlasso : L(ϕlasso , θ) =k ϕlasso − θ k2
I
Loss estimators
I
ϕlasso = Z + g(Z ),
2k −p n−p
Unbiased est.:
δ0 (λ) =
I
Corrective est.:
δγ (λ) = δ0 (λ)− k U
Problem to solve:
2 j=1 Zj 1{|Zj |≤λ} k4 (k +1)Z 2 +a2Pp 2 j=k +1 Z(j) (k )
k U k2 +
I
Pp
find optimal value of hyperparameter λ.
31 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Results Experimental protocol I I
n = 300, p = 250 ˜ 1,...,p , where Q ˜ is drawn from a QR decomposition of X =Q p Gaussian random vectors
I
βj ∼ N (0, 2) j = 1, . . . , k k ∈ {10, 20, . . . , 240} (very sparse to non sparse problems) ε ∼ N (0, σ 2 In ) Usual assumption ε ∼ t(ν), ν = 4 Our assumption σ ∈ {0.5, 1, 3} ε ∼ U[−σ,σ] Assumption violated r = 50 replicates de ε
I
a2 = −8
I
I
32 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Results
Comparisons I
AIC(λ) =
||y − X βˆlasso (λ)||2 + 2k σ ˆ2
I
BIC(λ) =
||y − X βˆlasso (λ)||2 + log(n)k σ ˆ2
where σ ˆ 2 = ||y − X βˆLS ||2
33 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Visualization: Criteria Number of real nonïzero coefficient: k=40 b0
1400
ba
2
loss (logarithmic scale)
1200
L(q,e) AIC BIC min(criteria)
1000 800 600 k=2 k=95
400 200
k=30 0
k=77 k=95 50
100 150 No of selected variables
200
250
Figure: ε gaussian, σ = 1, k = 40 34 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Visualization: empirical probabilities
Empirical probabilities
Error law = gauss(m=1), k = 40 1 b0 ba
0.5
2
AIC BIC 0 ï4
ï2
0
2
4
6
` real Figure: ε gaussian, σ = 1, k = 40 35 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Measures of quality
Contingency table :
real
βj = 0 βj 6= 0
2TP 2TP+FP+FN
estimated ˆ βj = 0 βˆj 6= 0 TN FP FN TP 0 ≤ F -score ≤ 1
I
F -score =
I
Prediction error = ||y new − X βˆ0lasso ||2
36 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
250 b0
200
b
a2
150
AIC BIC 1st/3rd quartile
100 50 0
0
50 100 150 200 No. of real nonïzero coefficients
250
1
1000
Fïscore
0.8
800
0.6 600 0.4 400
0.2 0
0
50 100 150 200 No. of nonïzero coefficients
250 0
50 100 150 200 No. of nonïzero coefficients
Prediction error
No. of estimated nonïzero coef.
`¾ N(0,2), ¡¾ N(0,m), m=1, n=300 300
200 250
Figure: ε ∼ N (0, σ 2 I), σ = 1 37 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
250 b0
200
ba2
150
AIC BIC 1st/3rd quartile
100 50 0
0
50 100 150 200 No. of real nonïzero coefficients
250
1
14000 12000
0.8 Fïscore
10000 0.6
8000
0.4
6000 4000
0.2 0
Prediction error
No. of estimated nonïzero coef.
`¾ N(0,2), ¡ ¾ t(i), m=1, n=300 300
2000 0
50 100 150 200 No. of nonïzero coefficients
250 0
50 100 150 200 No. of nonïzero coefficients
0 250
Figure: β gaussien, ε ∼ t(ν), σ = 1 38 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
250 b
0
200
b
a2
150
AIC BIC 1st/3rd quartile
100 50 0
0
50 100 150 200 No. of real nonïzero coefficients
250
1
300
Fïscore
0.8
250
0.6 200 0.4 150
0.2 0
0
50 100 150 200 No. of nonïzero coefficients
250 0
50 100 150 200 No. of nonïzero coefficients
Prediction error
No. of estimated nonïzero coef.
`¾ N(0,2), ¡¾ U([ïm,m]), m=1, n=300 300
100 250
Figure: β gaussien, ε ∼ U[−σ,σ]n , σ = 1 39 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
250 b0
200
b
a2
150
AIC BIC 1st/3rd quartile
100 50 0
0
50 100 150 200 No. of real nonïzero coefficients
250
1
1800
0.8
1600
0.6
1400
0.4
1200
0.2
1000
0
0
50 100 150 200 No. of nonïzero coefficients
250 0
50 100 150 200 No. of nonïzero coefficients
Prediction error
Fïscore
No. of estimated nonïzero coef.
`¾ N(0,2), ¡¾ N(0,m), m=1, n=1000 300
800 250
Figure: ε ∼ N (0, σ 2 I), σ = 1 40 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
No. of estimated nonïzero coef.
`¾ N(0,2), ¡ ¾ t(i), m=1, n=1000 300 250 b0
200
ba2
150
AIC BIC 1st/3rd quartile
100 50 0
0
50 100 150 200 No. of real nonïzero coefficients
250
5
x 10 3.5
1
Fïscore
2.5 0.6
2
0.4
1.5 1
0.2 0
Prediction error
3
0.8
0.5 0
50 100 150 200 No. of nonïzero coefficients
250 0
50 100 150 200 No. of nonïzero coefficients
0 250
Figure: β gaussien, ε ∼ t(ν), σ = 1 41 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
250 b
0
200
b
a2
150
AIC BIC 1st/3rd quartile
100 50 0
0
50 100 150 200 No. of real nonïzero coefficients
250
1
500
Fïscore
0.8
450
0.6 400 0.4 350
0.2 0
0
50 100 150 200 No. of nonïzero coefficients
250 0
50 100 150 200 No. of nonïzero coefficients
Prediction error
No. of estimated nonïzero coef.
`¾ N(0,2), ¡¾ U([ïm,m]), m=1, n=1000 300
300 250
Figure: β gaussien, ε ∼ U[−σ,σ]n , σ = 1 42 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Plan 1 Framework and Notations
Model Selection Variable selection Our idea 2 Choices and assumptions
Spherically symmetric distributions Canonical form Number of comparisons 3 Loss estimation
How does it work? Application Results 4 Conclusion 43 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Conclusion and future work
I
Robustness I I
I
Spherically symmetric distributions Canonical form
Adaptation to other loss functions and estimators I I
Improvement on Lasso estimation Adaptation to classification
I
Case where X is non orthogonal
I
Generalization to p ≥ n.
44 / 45
Framework and Notations
Choices and assumptions
Loss estimation
Conclusion
Thanks for your attention Questions?
45 / 45