A method for model selection - Aurélie Boisbunon

Mar 29, 2011 - Find a regression function r linking X with y : ... Special case of model selection: linear model family. Linear model .... logistic distribution,.
1MB taille 3 téléchargements 58 vues
A method for model selection March 29, 2011 Aurélie Boisbunon*, Stéphane Canu†, Dominique Fourdrinier*, William E. Strawderman‡, Martin T. Wells†† *Université de Rouen, †INSA de Rouen, ‡Rutgers University, ††Cornell University

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

1 Framework and Notations

Model Selection Variable selection Our idea 2 Choices and assumptions

Spherically symmetric distributions Canonical form Number of comparisons 3 Loss estimation

How does it work? Application Results 4 Conclusion

2 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Plan 1 Framework and Notations

Model Selection Variable selection Our idea 2 Choices and assumptions

Spherically symmetric distributions Canonical form Number of comparisons 3 Loss estimation

How does it work? Application Results 4 Conclusion 3 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Problem

Data y ∈ Rn , X = (x1 , . . . , xp ), xj ∈ Rn . n and p can be large (high dimension problem).

Aim x1

y1

x2

y2 . . .

xn

Find a regression function r linking X with y : . . .

yn

y = r (X ) + ε where ε ∈ Rn .

4 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Problem

Data y ∈ Rn , X = (x1 , . . . , xp ), xj ∈ Rn . n and p can be large (high dimension problem).

Aim x1

r

y1

. . . xn

Find a regression function r linking X with y :

y2

x2

. . . yn

y = r (X ) + ε where ε ∈ Rn .

4 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Model Selection Several models possible:

I

Which one is the “best” model?

I

How do we choose it?

I

Can we find an automatic way to choose for a new dataset? 5 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Variable selection

I

Special case of model selection: linear model family

Linear model y = X β + ε, I

I

β ∈ Rp

Search for the “best” submodel, i.e. of the “best” subset I ∗ ⊂ {1, . . . , p} of variables.  βj 6= 0 j ∈ I ∗ Assumption: βj = 0 j ∈ / I∗

6 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Litterature

Criteria for variable selection: I

empirical criteria I

I

Cross Validation (CV)

analytical criteria I

I

I

Akaike Information Criterion (AIC)/ Mallows Cp Bayes Information Criterion or Schwarz Criterion (BIC) ...

→ huge cost

→ over estimation → under estimation

7 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Litterature

Criteria for variable selection: I

empirical criteria I

I

Cross Validation (CV)

analytical criteria I

I

I

Akaike Information Criterion (AIC)/ Mallows Cp Bayes Information Criterion or Schwarz Criterion (BIC) ...

→ huge cost

→ over estimation → under estimation

7 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Our idea

Let k = #I be the number of selected variables:

Our criterion for estimating kˆ ∗ is based on: I

a large distributional assumption

I

loss estimation

8 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Our idea

}

Let k = #I be the number of selected variables:

Our criterion for estimating kˆ ∗ is based on: I

a large distributional assumption

I

loss estimation

8 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Plan 1 Framework and Notations

Model Selection Variable selection Our idea 2 Choices and assumptions

Spherically symmetric distributions Canonical form Number of comparisons 3 Loss estimation

How does it work? Application Results 4 Conclusion 9 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Spherically symmetric distributions

General assumption on error distribution: ε ∼ N (0, σ 2 In )

Estimated density of ozone level 20 15 Density

I

Figure: Daily ozone level (ppm) in California, 2008

10 5

I

I

Real life: most of the time non Gaussian In all cases: E[ε] = 0.

0 ï0.05

0

0.05 0.1 Ozone level (ppm) Repartition Estimated density Gaussian density

0.15

http://www.stat.ucla.edu/~nchristo/ statistics_c173_c273/ca_ozone.txt

10 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Spherically symmetric distributions Assumption 1: Distribution of ε ε follows a spherically symmetric distribution around 0. We write : ε ∼ s.s.(0). (εi i.i.d.) Properties : I

Generalization of N (0, σ 2 In ) I I

I

ε = RU I I

I

symmetry about 0 ε ∼ s.s.(0) ⇒ Y ∼ s.s.(X β) R =k ε k U ∼ US1

Distribution I I

f (ε) = g(k ε k) f not necessarily defined 11 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Spherically symmetric distributions Assumption 1: Distribution of ε ε follows a spherically symmetric distribution around 0. We write : ε ∼ s.s.(0). (εi i.i.d.) Properties : I

Generalization of N (0, σ 2 In ) I I

I

ε = RU I I

I

symmetry about 0 ε ∼ s.s.(0) ⇒ Y ∼ s.s.(X β) R =k ε k U ∼ US1

Distribution I I

f (ε) = g(k ε k) f not necessarily defined 11 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Spherically symmetric distributions

Examples: I

Gaussian distribution N (0, σ 2 In ),

I

Student t(ν), ν > 1,

I

hyperbolic secant distribution,

I

Kotz distribution,

I

logistic distribution,

I

(centered) Gaussian mixtures,

I

etc.

12 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Spherically symmetric distributions

Why this assumption? I

adds dependency, but no correlation

I

larger assumption on error distribution than N (0, σ 2 In ) ⇒ robustness

I

still an important constraint (symmetry) ⇒ allows easy computation

13 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Canonical form  Projection of Y by G =

Y ∈ Rn

G1 G2

 (orthogonal matrix)

Z ∈ Rp U ∈ Rn−p 14 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Canonical form

G

−→

Y = Xβ + ε

 where G =

G1 G2



Z U



Example: X = QR

such that:



 = GX β + Gε   G1 X β = +η Xβ  G2 θ = +η 0 I

G (n, n), G orthogonal

I

G1 (p, n), G1T ∈ span(X )

I

G2 (n − p, n), G2 ⊥X

G = QT . 15 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Canonical form Usual form  Y ∼ s.s.(X β) Estimation of β

Canonical form    Z θ ∼ s.s. U 0

Estimation of θ = G1 X β = Aβ

Example: Least-squares estimate ϕ0

= AβˆLS = A(X T X )−1 X T Y = A[(G1T A)T G1T A]−1 (G1T A)T Y = A[AT G1 G1T A]−1 AT G1 Y = AA−1 (AT )−1 AT Z = Z

16 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Canonical form Why use this form? I

Equivalent problems

I

“Pure” form

I

Existing work: estimation of the mean for s.s. distributions ⇒ link with linear models

I

additional data allows good variance estimation ⇒ robustness

BUT: Additional constraint Assumption 2 pλ}

= (Zj − λ sign(Zj ))1{|Zj |>λ}

Problem to solve Find the optimal value for hyperparameter λ. 22 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Plan 1 Framework and Notations

Model Selection Variable selection Our idea 2 Choices and assumptions

Spherically symmetric distributions Canonical form Number of comparisons 3 Loss estimation

How does it work? Application Results 4 Conclusion 23 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

The idea

I

Basic idea: evaluate the quality of ϕ by a loss function L(ϕ, θ)

I

Problem: θ unknown → L(ϕ, θ) =?

I

Common solution: empirical risk ˆ 2 /n (mean) based on data ||y − X β||

I

Our proposition: trade-off between the 2

Proposed criterion Loss estimation of parameter θ based only on the data.

24 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Recipe

Step 1a

Step 1b

Define an estimator ϕ∗ of θ.

Define a loss function L(ϕ∗ , θ).

Step 2 Define an estimator δ of loss L(ϕ∗ , θ). Step 3 ˆ = arg minλ∈R δ(λ) Find the minimum of δ : λ +

25 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Application Step 1a: Define an estimator of θ ϕlasso = Z + g(Z ) où gj (Z ) = −λsgn(Zj )1{|Zj |>λ} − Zj 1{|Zj |≤λ} Step 1b: Define a loss function of ϕlasso L(ϕ∗ , θ) =k ϕlasso − θ k2

I

common loss in regression

I

easy computations

I

X orthogonal ⇒k ϕ∗ − θ k2 =k βˆ∗ − β k2 26 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Step 2: Define a loss estimator of ϕlasso Definition: unbiased estimator   Eθ [δ0 ] = Eθ L(ϕlasso , θ) = Rθ (ϕlasso )

Eθ k ϕlasso − θ k2 

Eθ k Z − θ k2 

Eθ (Z − θ)T g(Z ) 



  = Eθ k Z + g(Z ) − θ k2   = Eθ k Z − θ k2 + k g(Z ) k2 +2(Z − θ)T g(Z )



=



= = where





 p k U k2 n−p

1 Eθ [k U k2 div g(Z )] n−p 1 Eθ [k U k2 (k − p)] (Stein’s equality) n−p k = #ˆI lasso 27 / 45

Framework and Notations

Choices and assumptions

Eθ k ϕlasso − θ k2 



Loss estimation

Conclusion

  = Eθ k Z + g(Z ) − θ k2 i h p k −p k U k2 + k g(Z ) k2 +2 k U k2 = Eθ n−p n−p p h 2k − p i X = Eθ k U k2 + Zj2 ∧ λ2 n−p j=1

where gj (Z ) = −λsgn(Zj )1{|Zj |>λ} − Zj 1{|Zj |≤λ} Unbiased estimator of L(ϕlasso , θ) p

δ0 (λ) =

X 2k − p k U k2 + Zj2 ∧ λ2 n−p j=1

28 / 45

Framework and Notations

Choices and assumptions

Loss estimation

I

Sometimes unbiased estimators are not satisfying (like βˆLS )

I

Correction with a small biais

I

Need for a criterion allowing comparison between δ0 and a corrective estimator δ: I

Loss: L(δ) = (δ − L(ϕlasso , θ))2

I

Risk: R(δ) = Eθ [(δ − L(ϕlasso , θ))2 ]

I

Aim: R(δ) ≤ R(δ0 )

Conclusion

29 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Corrective estimator: δγ (λ) = δ0 (λ)− k U k4 γ(Z ) where γ : Rp 7→ R is a correction fonction. I

I

a1 (James-Stein type) k Z k2 a2 γ2 (Z ) = P 2 2 (k + 1)Z(k ) + pj=k +1 Z(j) (proposed by W.E Strawderman) γ1 (Z ) =

Corrective estimator δγ (λ) = δ0 (λ)− k U k4

(k +

2 1)Z(k )

→ constant w.r.t λ → varies with λ

a2 P 2 + pj=k +1 Z(j)

30 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Corrective estimator: δγ (λ) = δ0 (λ)− k U k4 γ(Z ) where γ : Rp 7→ R is a correction fonction. I

I

a1 (James-Stein type) k Z k2 a2 γ2 (Z ) = P 2 2 (k + 1)Z(k ) + pj=k +1 Z(j) (proposed by W.E Strawderman) γ1 (Z ) =

Corrective estimator δγ (λ) = δ0 (λ)− k U k4

(k +

2 1)Z(k )

→ constant w.r.t λ → varies with λ

a2 P 2 + pj=k +1 Z(j)

30 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Summary

I

Estimator of θ:

I

Loss function for ϕlasso : L(ϕlasso , θ) =k ϕlasso − θ k2

I

Loss estimators

I

ϕlasso = Z + g(Z ),

2k −p n−p

Unbiased est.:

δ0 (λ) =

I

Corrective est.:

δγ (λ) = δ0 (λ)− k U

Problem to solve:

2 j=1 Zj 1{|Zj |≤λ} k4 (k +1)Z 2 +a2Pp 2 j=k +1 Z(j) (k )

k U k2 +

I

Pp

find optimal value of hyperparameter λ.

31 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Results Experimental protocol I I

n = 300, p = 250 ˜ 1,...,p , where Q ˜ is drawn from a QR decomposition of X =Q p Gaussian random vectors

I

βj ∼ N (0, 2) j = 1, . . . , k k ∈ {10, 20, . . . , 240} (very sparse to non sparse problems)   ε ∼ N (0, σ 2 In ) Usual assumption ε ∼ t(ν), ν = 4 Our assumption σ ∈ {0.5, 1, 3}  ε ∼ U[−σ,σ] Assumption violated r = 50 replicates de ε

I

a2 = −8

I

I

32 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Results

Comparisons I

AIC(λ) =

||y − X βˆlasso (λ)||2 + 2k σ ˆ2

I

BIC(λ) =

||y − X βˆlasso (λ)||2 + log(n)k σ ˆ2

where σ ˆ 2 = ||y − X βˆLS ||2

33 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Visualization: Criteria Number of real nonïzero coefficient: k=40 b0

1400

ba

2

loss (logarithmic scale)

1200

L(q,e) AIC BIC min(criteria)

1000 800 600 k=2 k=95

400 200

k=30 0

k=77 k=95 50

100 150 No of selected variables

200

250

Figure: ε gaussian, σ = 1, k = 40 34 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Visualization: empirical probabilities

Empirical probabilities

Error law = gauss(m=1), k = 40 1 b0 ba

0.5

2

AIC BIC 0 ï4

ï2

0

2

4

6

` real Figure: ε gaussian, σ = 1, k = 40 35 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Measures of quality

Contingency table :

real

βj = 0 βj 6= 0

2TP 2TP+FP+FN

estimated ˆ βj = 0 βˆj 6= 0 TN FP FN TP 0 ≤ F -score ≤ 1

I

F -score =

I

Prediction error = ||y new − X βˆ0lasso ||2

36 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

250 b0

200

b

a2

150

AIC BIC 1st/3rd quartile

100 50 0

0

50 100 150 200 No. of real nonïzero coefficients

250

1

1000

Fïscore

0.8

800

0.6 600 0.4 400

0.2 0

0

50 100 150 200 No. of nonïzero coefficients

250 0

50 100 150 200 No. of nonïzero coefficients

Prediction error

No. of estimated nonïzero coef.

`¾ N(0,2), ¡¾ N(0,m), m=1, n=300 300

200 250

Figure: ε ∼ N (0, σ 2 I), σ = 1 37 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

250 b0

200

ba2

150

AIC BIC 1st/3rd quartile

100 50 0

0

50 100 150 200 No. of real nonïzero coefficients

250

1

14000 12000

0.8 Fïscore

10000 0.6

8000

0.4

6000 4000

0.2 0

Prediction error

No. of estimated nonïzero coef.

`¾ N(0,2), ¡ ¾ t(i), m=1, n=300 300

2000 0

50 100 150 200 No. of nonïzero coefficients

250 0

50 100 150 200 No. of nonïzero coefficients

0 250

Figure: β gaussien, ε ∼ t(ν), σ = 1 38 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

250 b

0

200

b

a2

150

AIC BIC 1st/3rd quartile

100 50 0

0

50 100 150 200 No. of real nonïzero coefficients

250

1

300

Fïscore

0.8

250

0.6 200 0.4 150

0.2 0

0

50 100 150 200 No. of nonïzero coefficients

250 0

50 100 150 200 No. of nonïzero coefficients

Prediction error

No. of estimated nonïzero coef.

`¾ N(0,2), ¡¾ U([ïm,m]), m=1, n=300 300

100 250

Figure: β gaussien, ε ∼ U[−σ,σ]n , σ = 1 39 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

250 b0

200

b

a2

150

AIC BIC 1st/3rd quartile

100 50 0

0

50 100 150 200 No. of real nonïzero coefficients

250

1

1800

0.8

1600

0.6

1400

0.4

1200

0.2

1000

0

0

50 100 150 200 No. of nonïzero coefficients

250 0

50 100 150 200 No. of nonïzero coefficients

Prediction error

Fïscore

No. of estimated nonïzero coef.

`¾ N(0,2), ¡¾ N(0,m), m=1, n=1000 300

800 250

Figure: ε ∼ N (0, σ 2 I), σ = 1 40 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

No. of estimated nonïzero coef.

`¾ N(0,2), ¡ ¾ t(i), m=1, n=1000 300 250 b0

200

ba2

150

AIC BIC 1st/3rd quartile

100 50 0

0

50 100 150 200 No. of real nonïzero coefficients

250

5

x 10 3.5

1

Fïscore

2.5 0.6

2

0.4

1.5 1

0.2 0

Prediction error

3

0.8

0.5 0

50 100 150 200 No. of nonïzero coefficients

250 0

50 100 150 200 No. of nonïzero coefficients

0 250

Figure: β gaussien, ε ∼ t(ν), σ = 1 41 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

250 b

0

200

b

a2

150

AIC BIC 1st/3rd quartile

100 50 0

0

50 100 150 200 No. of real nonïzero coefficients

250

1

500

Fïscore

0.8

450

0.6 400 0.4 350

0.2 0

0

50 100 150 200 No. of nonïzero coefficients

250 0

50 100 150 200 No. of nonïzero coefficients

Prediction error

No. of estimated nonïzero coef.

`¾ N(0,2), ¡¾ U([ïm,m]), m=1, n=1000 300

300 250

Figure: β gaussien, ε ∼ U[−σ,σ]n , σ = 1 42 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Plan 1 Framework and Notations

Model Selection Variable selection Our idea 2 Choices and assumptions

Spherically symmetric distributions Canonical form Number of comparisons 3 Loss estimation

How does it work? Application Results 4 Conclusion 43 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Conclusion and future work

I

Robustness I I

I

Spherically symmetric distributions Canonical form

Adaptation to other loss functions and estimators I I

Improvement on Lasso estimation Adaptation to classification

I

Case where X is non orthogonal

I

Generalization to p ≥ n.

44 / 45

Framework and Notations

Choices and assumptions

Loss estimation

Conclusion

Thanks for your attention Questions?

45 / 45