A global procedure for variable selection - Aurélie Boisbunon

The variable selection problem consists in finding a balance between model ... be more robust than the ones using gaussian hypothesis (AIC, Cp, BIC) and more ...
453KB taille 1 téléchargements 40 vues
A global procedure for variable selection Aur´elie Boisbunon, St´ephane Canu, Dominique Fourdrinier Universit´e de Rouen and INSA de Rouen, Avenue de l’Universit´e - BP 12 ´ 76801 Saint-Etienne-du-Rouvray Cedex

Keywords: Model selection, variable selection, linear regression, loss estimation, lasso, MCP.

Abstract The variable selection problem consists in finding a balance between model complexity and data fitting. To do so we compare several strategies for exploring the space of the variables and we propose a new assessment of the model generalization performance. Because an exhaustive search over all variables subsets is computationally intractable, many popular search approaches relies on searching thought a single sequence of variables (Forward/backward, Stagewise, regularization path. . . ). But, due to a lack of exploration, these methods often fail at finding the right variables. We investigate on their respective ability to recover the true subset. To compare models generalization error we propose also a new improved Stein estimator of the risk, designed for distributions with spherical symmetry. it reveals itself to be more robust than the ones using gaussian hypothesis (AIC, Cp, BIC) and more stable than cross validation.

1

Introduction

In the field of statistical learning theory, the problem of model selection consists in comparing several possible submodels and determining which one is the “best”. As discussed in [3], there is now an admitted general rule saying that it is not possible to perform both a good selection, allowing us to understand the mechanism behind the data, and a good prediction at the same time. This general rule might be true, but it might also highlight the unability of current model selection procedures to reach both objectives. Let us start with a simple example model, taken from [17]. We consider the linear Gaussian regression y = Xβ + ε,

(1)

where (y, X) is the observed data composed of n ∈ {20, 60} observations and p explanatory variables, β is the unknown regression coefficient and the noise ε is drawn from N (0, σ 2 ) with σ ∈ {1, 3}. The coefficient vector β is set to (3, 1.5, 0, 0, 2, 0, 0, 0)t , and the variables in X are correlated by ρ(xi , xj ) = 0.5|i−j| . We run an exhaustive exploration and estimate β by least-squares on each subset. The prediction error ˆ 22 is computed to select the best submodel and compared to the classical criteria Akaike kXβ − X βk Information Criterion (AIC), Bayes Information Criterion (BIC), and Leave-One-Out Cross-Validation (LOOCV). This simple example illustrates the inefficiency of classical model selection procedures even in simple settings where the variance is low, while the prediction error, which they all try to estimate, behaves very nicely, even when dealing with large variance. Following the idea of multilevel inference in [10] and the discussion on good penalties/estimators in [6], we propose to compare several procedures of model selection: sparse regularization methods proposing a regularization path and an estimate of β at the same time (lasso or MCP for instance), and definition of a way to explore the submodels (Forward/Backward, regularization path, widen regularization path, etc) for which we estimate β by least-squares (or Ridge regression) on each subset. We separate both

1

Table 1: Frequency (%) of selection with exhaustive exploration. n = 20 n = 60 Criterion σ=3 σ=1 Pred. err 78.50 100.00 100.00 7.00 41.50 38.50 AIC/Cp BIC 8.50 60.00 76.00 0.00 41.00 43.00 LOOCV approaches and derive corresponding criteria for model evaluation based on loss estimation and in the same spirit as what is done in [14] and [16]. These criteria are built under the assumption that the noise ε is drawn from any spherically symmetric distribution. This brings distributional robustness as well as a way to handle some dependence between the components of ε.

2 2.1

Context and notations Framework and distribution assumption

In this work, we focus on the linear regression model stated in (1). In particular, we consider the case where X is a fixed design matrix and the number of variables is lower than the number of observations, that is p < n. The objective is to estimate the regression coefficient β ∈ Rp . Before going into further details on this estimation, let us say a few words on the noise distribution. Literature on the linear regression model often bases its theoretical background on the assumption that ε is composed of n independent observations of some known distribution P, generally a Gaussian distribution. On the contrary, some authors like [18] consider instead the worst-case analysis where ε is one observation of an n-dimensional random vector drawn from an unknown distribution Pn . This latter assumption is fully general and sometimes leads to loose bounds on the prediction error [see 9]. Our work relies on a tradeoff between these two assumptions. We indeed consider the n-variate ε to be drawn from an unknown distribution Pn , but we restrict the space of Pn to the family of spherically symmetric distributions , which we write ε ∼ SS n (0). The distribution Pn is thus characterized by a density, when it exists, of the form t 7→ f (ktk22 ), for a given function f mapping from Rn to R+ . According to the properties of spherical symmetry, we have that Y also follows a spherically symmetric distribution with mean µ = Xβ, that is to say Y ∼ SS n (Xβ). See [13] for a review on this family of distributions. Figure 1 shows some examples of such densities. Note that the Gaussian case is the only spherical law verifying independence between the components of Y and ε.

0.1

0.1

0

0 2

0 −2

−2 0

2

0.1

2

0 −2

−2 0

2

2

0 −2

−2 0

2

0.1

0

0 2

0 −2

−2 0

2

Figure 1: Examples of spherically symmetric densities for a two-dimensional random variable: Gaussian (top left), Student (top right), Kotz (bottom left), mixture of centered Gaussian (bottom right).

2.2

Two multistep approaches for two models

From our experience, we see two main approaches to solve model (1): either use a regularization method providing several sparse estimations of β depending on some hyperparameter, or define a way to explore

2

the possible submodels and use a good estimator of β on each submodel, such as Least-squares, ridge regression or a more robust estimator. Regularization methods consist in minimizing the sum of squared residuals penalized by some function ρ of the coefficient β. They are formalized as follows 1 min ky − Xβk22 + λρ(β), β 2

λ ≥ 0.

(2)

It is well known that taking ρ a non differentiable function in zero leads to sparse estimates of β. The number of non-zero coefficients is determined by the value of λ. Hence, choosing wisely several values of λ provides a regularization path adding one variable at a time. Among the most popular regularization methods we can cite lasso [17] and its efficient LARS algorithm [5], Smoothly Clipped Absolute Deviation (SCAD) by [6], Minimax Concave Penalty (MCP) by [20], and Adaptive Lasso [21]. We will only focus on two of them in the sequel, namely lasso and MCP, the first one because it has been the subject of many works giving conditions of recovery of the right subset for some of these methods [see 19], and the second one because it is almost unbiased, a nice property when the final objective is good prediction or good estimation. Their respective penalty function is of the following form ρlasso (β)

=

p X

|β|j

(3)

j=1

ρM CP (β)

=

p Z X

|βj | 0

j=1

  t 1− dt, aλ +

(4)

where 1{t∈E} = 1 if t ∈ E and 0 otherwise, t+ = max(t, 0), and a > 1 is an hyperparameter tuning the bias of MCP estimate for β. The second approach of model selection we mentioned does not solve model (1), but instead a reduction of (1) to a subset I of variables y = XI βI + ε. (5) We will refer to (5) as the reduced linear model. In this approach, the first step is to define how we want to explore the graph of submodels. This exploration step is driven by the size of the dataset at hand. If the number p of variables is small (p < 30), then exhaustive exploration ensures the recovery of the true subset, as soon as the true model belongs to the class of linear models on a subspace of {x1 , . . . , xp }. If p is moderate to large, there exist several options. The most early works surely are Forward Selection/Backward Elimination and other stepwise regression algorithms. They are however known to be unstable and it is not clear whether there exist conditions in which they can actually recover the true subset of variables (see 11, for a good review). More recent works are regularization path algorithms, provided by the optimization of (2). One interesting aspect of such algorithms is that they can generally be computed in O(p), compared to p(p + 1)/2 for stepwise algorithms and 2p different subsets for exhaustive exploration. Moreover, their ability to recover the true subset has been proved in some conditions. The last approach we consider in this work is what we call “widen path”. In a similar way to what is done in Monte-Carlo Tree Search algorithms [see 4],it consists in the following strategy: first run a regularization method to obtain the sequence of variables of the path; then define a decreasing probability measure on the order of the sequence, giving more weight to the first variables; finally, randomly generate r paths of 1 to p − 1 variables. Note that the two approaches we described rely on the principle of multilevel inference defined in [10]. Following the authors’ idea, we see model selection as a three-step-procedure. The first step is the exploration of the submodels or subsets, the second step is the estimation of the parameters on each subset, and the last one, which we have not discussed yet, is the evaluation of the estimations in order to select the best one. Regularization methods with sparse estimates performed the first two steps at the same time, explaining their popularity. In order to perform a full model evaluation, there should also be a pre-processing step, for instance filtering of the raw data, and a post-processing step, like estimating the parameter on the entire set of observations if this latter one was split. However, we will not treat them in this work. Let us now discuss the final step of model evaluation. To the best of our knowledge, the difference between models (1) and (5) is seldom take into account in the literature of model evaluation. This difference is however very important when using loss estimation, and especially with Stein’s identity because of the need of weakly differentiable estimators of β. We propose new criteria for either (1) or

3

(5). Moreover, our work stands as a trade-off between the classical Gaussian i.i.d assumption, as is done in [1], [14], [15] and their derivates, and the worst-case analysis, such as the Vapnik-Chervonenkis bound [18] . Indeed, we try to estimate the prediction error without specifying the form of the distribution and restricting it to the family of spherically symmetric distributions. This way we broaden the distribution assumption while the restriction of spherical symmetry would allow us to obtain tighter bounds on probabilities than in worst-case analysis. Next section gives some insights on how are obtained our criteria for model evaluation in both models (1) and (5).

3 Estimators of loss under the spherical assumption in two different settings In this section, we derive criteria for model evaluation based on loss estimation theory. The objective ˆ In the context of regression, is to evaluate the quality of an estimator βˆ by some loss function L(β, β). ˆ = kX βˆ − Xβk22 , allowing easy computation. As we choose the quadratic predictive loss function L(β, β) defined in [6], a good estimator of β should have a low bias in order to comply with an oracle property ˆ Indeed, assuming the true submodel I ∗ belongs to the set I of submodels defined by the for L(β, β). exploration step, then βˆ should have minimum prediction error for submodel I ∗ , ˆ ∗) β(I

=

ˆ I) b − Xβk22 arg min kX β(

(6)

b I∈I

or

βˆI ∗

=

arg min kXIbβˆIb − XI βI k22 ,

(7)

b I∈I

depending on the model considered. With this definition, the “best” subset would be the one minimizing ˆ and we need to estimate this loss based the predictive loss. However, as β is unknown, so is L(β, β) on data. Following what is done in [16] and its extension to the spherical case in [8], we can build ˆ is unbiased if it verifies the risk equality unbiased estimators. We say that an estimator δ0 of loss L(β, β) ˆ Eµ [δ0 ] = Eµ [L(β, β)], where Eµ is the expectation under the spherical assumption. For the construction of unbiased estimators, we need the following identity. Lemma 1 (Quadratic risk identity). Let Y ∈ Rn follow any spherically symmetric distribution around µ ∈ Rn , Y ∼ SS n (µ). Given µ ˆ ∈ Rn an estimator of µ, if µ ˆ is weakly differentiable with respect to Y , then we have the following equality Eµ [kˆ µ − µk22 ]

=

Eµ [kˆ µ − Y k22 ]   2divY µ ˆ−n +Eµ kY − µ ˆLS k22 , n−d

where d is the rank of the matrix used to compute the least-squares estimator (LSE) µ ˆLS and provided these expectations exist. Proof. The proof consists in developing the expression of the risk R(ˆ µ − µ)

= =

Eµ [kˆ µ − Y + Y − µk22 ]   Eµ kˆ µ − Y k22 + kY − µk22   +Eµ 2(Y − µ)t (ˆ µ−Y) .

(8)

From this development, it is straightforward to show the equality in risk given in Lemma 1 under the Gaussian assumption thanks to Stein’s identity for the last expectation [see 16]. The spherical case is a bit more complicated and we therefore use the trick of transforming model (1) or (5) into its canonical form. In order to be more general, we will design the matrix of explanatory variables by M , being either XI or X, and its dimension by d. We keep however the notation β for the parameter, but we restrict it to the space Rd .

4

 G1 be an orthogonal matrix such that Gt1 span the same column space than M , and G2 G2 M = 0. The canonical form of the linear model obtained by applying G to Y :       θ G1 µ Z + Gε, + Gε = = GY = 0 0 U 

Let G =

where Z = G1 Y , U = G2 Y , and θ = G1 µ. When ε ∼ SS n (0), so does Gε. Hence (Z, U ) ∼ SS n ((θ, 0)). Note that Z and U are dependent, except for the Gaussian distribution. Also, it can be easily proved that the canonical form of the least-squares estimator is Z. Indeed, replacing Y by its canonical equivalent and noting that there exists a matrix A ∈ Rd × Rd such that M = Gt1 A (since Gt1 and M span the same column space), we have that θˆLS

=

A(M t M )−1 M t Y

=

A(At G1 Gt1 A)−1 At G1 (Gt1 Gt2 ) (Z, U )t

=

A(At A)−1 At Z

=

Z.

With the equivalence between the canonical form and the linear model, we have Eµ [kY − µk2 ]

=

E(θ,0) [kGt (Z, U ) − Gt (θ, 0)k2 ]

=

E(θ,0) [k(Z, U ) − (θ, 0)k2 ]

=

E(θ,0) [kZ − θk2 + kU k2 ].

Now, from the spherical symmetry property, we know that E(θ,0) [kZ − θk2 ] = d E(θ,0) [kU k2 ]/(n − d). Furthermore, kY − M βˆLS k2

= = =

kY − Gt1 Zk2

    2

t Z Z t

G

− G

U 0 kU k2 .

Hence nkY − M βˆLS k2 /(n − d) is an unbiased estimator of kY − M βk2 even under the spherical case. Finally, [8] give and prove a theorem to extend Stein’s identity to the spherical case. Theorem 1 (Stein-type identity). Given (Z, U ) ∈ Rn a random vector following a spherically symmetric distribution around (θ, 0), and g : Rd 7→ Rd a weakly differentiable function, we have that Eθ [(Z − θ)t g(Z)] = Eθ [kU k2 divZ g(Z)/(n − d)],

(9)

provided both expectations exist. Remarking that (Y − M β)t (M βˆ − Y ) = (Z − θ)t (θˆ − Z) − kU k2 , we can develop the right-most term of (8). Eµ [(Y − M β)t (M βˆ − Y )] = Eθ [(Z − θ)t (θˆ − Z) − kU k2 ] = = =

Eθ [(divZ (θˆ − Z)/(n − d) − 1)kU k2 ]    ˆ −d divG1 Y (G1 M β) Eµ − 1 kY − M βˆLS k2 n−d ˆ − n)kY − M βˆLS k2 /(n − d)]. Eµ [divY (M β)

Substituting this expression and the estimation of the variance in (8), we obtain the desired result. This identity is stated in a general setting and is true for both models (1) and (5). In next sections, we will specify it for each model. In the normal case, [12] proved the inadmissibility of the unbiased estimator of loss of the Maximum Likelihood Estimator and the James-Stein estimator, and proposed a way to improve it. Following his work, we define a possible way of improvement.

5

Definition 1 (Condition of improvement). Let Y ∼ Pn (µ) and y be an observation of Y . Given δ0 and δ two estimators of the loss L(µ, µ ˆ), if P ⊂ SS, then δ is said to improve on δ0 in terms of quadratic risk if it verifies, provided both expectations exist, ∀β ∈ Rp

Eµ [(δ − L(µ, µ ˆ))2 ] ≤ Eµ [(δ0 − L(µ, µ ˆ))2 ],

where Eµ is the expectation with respect to Y , and if there exist at least one value of µ for which the inequality is strict.

This definition gives us guidelines so as to construct improved estimators of loss. We will propose one for each model in the following sections.

3.1

Full linear model

In this section, we consider the setting of model (1) and we estimate β by a sparse regularization method, ˆ in this context. such as Lasso or MCP. We first specify an unbiased estimator of loss L(β, β) Lemma 2. Let Y ∼ Pn (Xβ) and y be an observation of Y . If P belongs to the family of spherically symmetric distributions (P ⊂ SS), then the following estimator is an unbiased estimator of kX βˆ − M βk22 ˆLS 2 ˆ − n) ky − X β k2 , δ0 (y) = kX βˆ − yk22 + (2df (10) n−p where βˆ is the estimator of β we chose, βˆLS is the full least-squares estimator of β, and df is the ˆ = divy x = number of degrees of freedom estimated by the divergence of x with respect to y, that is df trace[(∂xi /∂yj )n i,j=1 ]. Proof. The proof is straightforward when we apply Lemma 1 with µ = Xβ, µ ˆ = X βˆ and d = p (the rank of X). Note that this unbiased estimator is equivalent to Mallows’ Cp and Akaike’s AIC under the Gaussian assumption, but here our distributional framework is much wider. We now propose the following improved estimator: ky − X βˆLS k4 δγ (y) = δ0 (y) − c1 (11) γ1 (y) P where c1 = 2(p−2−2k(k+1)/p)/[(n−p+4)(n−p+6)], γ1 (y) = k maxj≤p {zj2 \|zj | ≤ λ}+ j≤p zj2 1{|zj |≤λ} , and zj = (q j )t y, q j being the j th column of the matrix Q obtained by a QR decomposition of X. The additional term in δγ can be interpreted as a data-driven penalty. The correction function γ1 takes into account the information given by the variables considered as non relevant. The improvement over δ0 has been proven in [2].

3.2

Reduced linear model

We now consider the reduced model (5). We estimate β by the reduced least-squares estimator βˆIb = (XIbXIb)−1 XItby, where Ib is defined by the chosen exploration method (exhaustive, Forward Selection/Backward Elimination, regularization path, widen regularization path, etc). Lemma 3 gives an unbiased estimator of loss for such a setting. Lemma 3. Let Y ∼ Pn (XI βI ) and y be an observation of Y . If P belongs to the family of spherically symmetric distributions (P ⊂ SS), then the following estimator is an unbiased estimator of kXIbβˆIb − XI βI k22 k δ0 (y) = kXIbβˆIb − yk22 , (12) n−k b where βˆb is the least-squares estimator of β on subset Ib and k = #Ib is the size of subset I. I

6

Table 2: Procedures of model exploration and estimation. Acronym lasso-lasso mcp-mcp lasso-ls mcp-ls wide-ls back-ls all-ls

Exploration Lasso path MCP path Lasso path MCP path Widen path Backward elimination Exhaustive exploration

Estimation Lasso MCP Least-squares Least-squares Least-squares Least-squares Least-squares

Table 3: Model evaluation criteria. Criterion δ0 (reg. meth.) δ0 (LS) δF∗ W δγ

Expression b − n)ˆ kX βˆ − yk22 + (2df σ2 k kX bβˆb − yk22 n−k I I σ ˆI4b 2(k − 4)(n − k)2 k kXIbβˆIb − yk2 − n−k+2 (n − k + 4)(n − k + 6) kX bβˆbk22 I I 2(p − 2 − 2k(k + 1)/p)(n − p)2 σ ˆ4 2 2 ˆ b X kX β − yk2 + (2df − n)ˆ σ − (n − p + 4)(n − p + 6) kmax{zj2 \ βˆj = 0} + zj2 1{βˆj =0} j≤p

AIC BIC LOOCV

j≤p

2b 2 1 kX βˆ − yk22 + df σ ˆ n n 1 b kX βˆ − yk22 + (log n)df σ ˆ2 n 1X kxi βˆ(i) − yi k22 n i=1

ˆ we have equality between βˆ and the full Proof. Here, since we estimate β by least-squares on subset I, ˆ Moreover, divY µ LSE (which is also taken on subset I). ˆLS = trace(XIb(XItb)XIb))−1 XItb) = k. We thus obtain the result by applying Lemma 1 with µ = XI βI , µ ˆ = XIbβˆIb and d = k (the rank of XIb). As an improved estimator of loss, we consider the one defined in [7]. δF∗ W (y) = αδ0 (y) − c2

ky − XIbβˆIbk4 γ2 (X bβˆb) I

(13)

I

where α = (n − k)/(n − k + 2), c2 = 2(k − 4)/[(n − k + 4)(n − k + 6)], and γ2 (XIbβˆIb) = kXIbβˆIbk2 . The proof of its improvement over the unbiased estimator δ0 is given in [7].

4

Simulation study

In this section, we test the procedures described in 2 on the example given in the introduction. The objective is to determine whether there is one or several best procedures. We generated 200 exemples and iterated the process 10 times to obtain mean and standard deviations of the frequency of selection of the true submodel (here: I = {1; 2; 5}). The variance σ 2 of the noise ε is taken in {1, 9}, and the number n of observations in {20, 60}. Finally, we compared our model evaluation criteria δ0 , δF∗ W and δγ to AIC, BIC, Leave-One Out Cross-Validation (LOOCV). We remind their respective formula in table 3. Table 4 displays the pourcentage of selection of the true subset I = {1; 2; 5} by each criterion when the noise follows a Gaussian distribution. The first part of the table is the more difficult problem with high

7

Table 4: Frequency (%) of selection of true submodel under Gaussian distribution (standard deviation).

σ = 3, n = 20

σ = 1, n = 20

σ = 1, n = 60

lasso-lasso mcp-mcp lasso-ls mcp-ls wide-ls back-ls all-ls lasso-lasso mcp-mcp lasso-ls mcp-ls wide-ls back-ls all-ls lasso-lasso mcp-mcp lasso-ls mcp-ls wide-ls back-ls all-ls

δ0 1.2 (0.8) 2.3 (1.3) 0.0 (0.1) 0.0 (0.1) 0.0 (0.2) 0.0 (0.3) 0.0 (0.2) 0.0 (0.0) 0.0 (0.2) 1.7 (0.9) 4.7 (1.2) 9.0 (1.4) 77.0 (1.7) 79.4 (3.0) 4.6 (0.7) 8.2 (2.1) 24.8 (2.5) 32.0 (2.9) 49.4 (3.1) 89.2 (2.0) 94.8 (1.0)

δF∗ W 1.8 (0.9) 3.0 (1.0) 0.0 (0.2) 0.0 (0.3) 0.0 (0.3) 0.0 (0.3) 0.0 (0.2) 0.0 (0.3) 0.0 (0.4) 1.7 (0.9) 4.6 (1.3) 9.0 (1.4) 77.1 (1.9) 79.1 (3.4) 9.9 (1.4) 15.7 (1.9) 24.8 (2.5) 32.0 (2.9) 49.4 (3.1) 89.2 (2.0) 94.9 (1.0)

δγ 1.6 (1.0) 2.8 (1.4) 0.0 (0.2) 0.0 (0.3) 0.0 (0.8) 0.0 (0.3) 0.0 (0.4) 0.0 (0.0) 0.0 (0.2) 1.8 (0.9) 4.8 (1.2) 9.0 (1.3) 77.6 (1.9) 82.5 (3.6) 5.0 (0.9) 9.1 (2.2) 24.9 (2.8) 32.2 (2.7) 49.3 (3.2) 89.2 (2.0) 95.8 (1.1)

True loss 2.4 (0.7) 4.2 (1.1) 5.4 (1.5) 7.7 (1.5) 10.3 (3.1) 3.7 (2.0) 76.9 (2.8) 0.0 (0.6) 1.6 (0.7) 2.9 (1.2) 7.3 (1.6) 9.6 (1.3) 83.9 (2.4) 100.0 (0.0) 5.5 (1.2) 9.9 (1.5) 25.7 (2.9) 33.6 (3.3) 49.8 (3.0) 89.2 (2.0) 100.0 (0.0)

AIC 1.2 (0.8) 2.3 (1.3) 0.0 (0.6) 1.1 (0.7) 3.1 (1.2) 1.4 (1.1) 5.7 (0.9) 0.0 (0.0) 0.0 (0.2) 0.0 (0.4) 2.1 (1.1) 5.7 (1.1) 72.3 (2.4) 43.2 (3.5) 4.6 (0.7) 8.2 (2.1) 15.3 (2.7) 19.4 (3.3) 27.9 (2.8) 77.3 (2.5) 43.1 (2.8)

BIC 1.6 (0.9) 2.9 (1.3) 1.1 (0.7) 2.0 (0.5) 3.3 (1.1) 1.4 (1.1) 3.8 (1.1) 0.0 (0.2) 0.0 (0.3) 1.2 (0.6) 3.8 (1.0) 7.0 (0.9) 72.5 (2.3) 60.2 (3.5) 6.8 (1.2) 12.0 (1.8) 22.0 (3.0) 29.1 (3.7) 43.0 (2.9) 79.3 (2.2) 79.4 (2.3)

LOOCV 0.0 (0.2) 0.0 (0.5) 0.0 (0.7) 0.0 (1.1) 3.2 (1.2) 1.3 (0.8) 5.5 (1.6) 0.0 (0.0) 0.0 (0.0) 0.0 (0.3) 0.0 (0.6) 5.7 (0.8) 32.7 (3.7) 34.2 (4.5) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.3) 27.1 (2.9) 50.6 (2.8) 40.0 (2.9)

variance and small number of observations. In this setting, the results are bad whatever the procedure we use. Setting apart the procedure using exhaustive exploration, which is not scalable, the best results are obtained for the procedure wide path/LS/BIC, with a low 3.3% of recovery of the true subset. In the second problem with smaller variance, we can see that none of procedure with a regularization method (lasso-lasso and mcp-mcp) can recover the true subset, despite the correction of the bias in MCP. Results are slightly better when combining a regularization path with a Least-Squares estimator. Here, we see the slight advantage of MCP over lasso’s path, doubling the performance. However, MCP’s path still misses the true subset around 90% of the time. The best exploration for this problem is an exploration by Backward Elimination, recovering the true subset around 80% of the time. an evaluation thanks to one of our estimators of loss. The procedure is best when evaluated by our loss estimator δγ , and the results are almost as good as for an exhaustive exploration. For the last and the easiest problem with small variance and larger number of observations, the conclusion is basically the same than earlier. However, it is worth noting that there are better performances for the regularization paths than earlier, with even better results for the widen path. But again, the best recovery is obtained by backward elimination, least-squares estimation, and evaluation by loss estimators. Table 5 displays the same measure under a Student distribution of 5 degrees of freedom. The main difference occurs for the difficult problem. Indeed, the results are much better with backward elimination, reaching 20% of recovery against 3% with the Gaussian case.

5

Discussion

The variable selection problem has been analyzed and decomposed in three steps: model exploration (the way candidates models are proposed), parameter estimation (the way parameters of a given model are estimated) and evaluation (the way a candidate model statistical risk is estimated). For each of these three steps different strategies have been proposed and tested. For exploration we compared classical backward elimination to the popular regularization path strategies (generated by LASSO or MCP) and an improvement with collection of randomly generated paths. For estimation we compare the regularization path estimator with its least square counter part (LS).

8

Table 5: Frequency (%) of selection of true submodel under Student distribution (standard deviation).

σ = 3, n = 20

σ = 1, n = 20

σ = 1, n = 60

lasso-lasso mcp-mcp lasso-ls mcp-ls wide-ls back-ls all-ls lasso-lasso mcp-mcp lasso-ls mcp-ls wide-ls back-ls all-ls lasso-lasso mcp-mcp lasso-ls mcp-ls wide-ls back-ls all-ls

δ0 0.0 (0.4) 1.1 (0.7) 0.0 (0.5) 1.1 (0.7) 2.2 (0.8) 14.0 (2.3) 12.3 (2.6) 0.0 (0.3) 0.0 (0.4) 0.0 (0.4) 3.7 (1.1) 7.0 (1.8) 60.4 (2.5) 63.5 (3.9) 3.2 (1.0) 7.5 (1.8) 21.1 (3.2) 32.3 (4.2) 41.4 (4.1) 71.6 (3.3) 68.2 (2.1)

δF∗ W 0.0 (0.5) 1.5 (0.6) 0.0 (0.4) 1.1 (0.8) 2.4 (1.0) 14.4 (2.2) 12.7 (2.2) 0.0 (0.4) 0.0 (0.7) 0.0 (0.4) 3.8 (1.0) 7.1 (1.9) 60.7 (2.5) 63.2 (3.9) 7.8 (1.7) 15.1 (2.7) 21.1 (3.2) 32.3 (4.2) 41.4 (4.1) 71.8 (3.2) 68.2 (2.1)

δγ 0.0 (0.6) 1.5 (0.7) 0.0 (0.4) 1.4 (0.8) 2.5 (0.9) 14.5 (2.3) 13.2 (2.1) 0.0 (0.4) 0.0 (0.5) 0.0 (0.6) 3.9 (1.2) 7.1 (2.0) 60.8 (2.3) 67.1 (3.4) 3.6 (1.0) 8.1 (2.2) 22.3 (3.0) 33.9 (4.6) 44.0 (4.1) 75.5 (3.4) 69.2 (1.8)

True loss 1.7 (0.7) 2.4 (1.1) 3.7 (1.4) 5.4 (1.0) 8.0 (1.3) 23.9 (3.3) 72.9 (2.3) 0.0 (0.6) 1.7 (0.7) 2.1 (0.8) 6.3 (1.5) 9.1 (2.1) 69.5 (2.1) 95.8 (1.7) 4.5 (1.7) 8.7 (2.2) 26.0 (2.6) 38.1 (4.0) 50.5 (4.3) 83.2 (3.3) 98.6 (0.5)

AIC 0.0 (0.4) 1.1 (0.7) 0.0 (0.4) 1.3 (0.8) 3.0 (1.1) 19.7 (2.9) 16.1 (1.2) 0.0 (0.3) 0.0 (0.4) 0.0 (0.4) 1.7 (0.9) 5.8 (1.6) 59.3 (2.4) 36.2 (2.9) 3.2 (1.0) 7.5 (1.8) 14.9 (1.3) 20.6 (2.4) 27.2 (3.6) 71.8 (4.5) 41.5 (2.5)

BIC 1.0 (0.6) 1.6 (0.7) 0.0 (0.5) 1.7 (0.7) 3.5 (0.8) 19.9 (2.8) 20.3 (2.5) 0.0 (0.4) 0.0 (0.7) 0.0 (0.5) 3.0 (1.1) 6.9 (1.6) 59.8 (2.3) 49.7 (3.2) 5.3 (1.8) 11.6 (2.8) 21.6 (2.6) 31.5 (3.9) 42.0 (3.8) 73.7 (4.3) 73.8 (2.9)

LOOCV 0.0 (0.0) 0.0 (0.1) 0.0 (0.0) 0.0 (0.7) 3.3 (1.0) 9.4 (1.6) 13.2 (1.6) 0.0 (0.0) 0.0 (0.0) 0.0 (0.2) 0.0 (0.8) 5.8 (1.6) 25.6 (1.8) 29.7 (2.9) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.6) 26.2 (3.5) 48.2 (2.8) 39.0 (3.1)

The empirical evidence reported here leads us to some remarks. Regarding exploration: backward elimination outperforms regularization path. Regarding estimation: LS as a post selection estimator brings better results than regularization methods, even MCP which was design to correct lasso’s bias. This fact is always verified in our experiments. Regarding evaluation: with low noise level (σ = 1) δγ is the best estimator. Surprisingly, it improves δ0 in average but not significantly in the most favorable cases. BIC remains a good criterion, especially in the student case. Cross validation is bad.

References [1] H. Akaike. Information theory and an extension of the maximum likelihood principle. In Second international symposium on information theory, volume 1, pages 267–281. Akademiai Kiado, 1973. [2] Anonymous. Suppressed for anonymity, 2012. [3] L. Breiman. Heuristics of instability and stabilization in model selection. The Annals of Statistics, 24(6):2350–2383, 1996. [4] G. Chaslot, J.T. Saito, B. Bouzy, J. Uiterwijk, and H.J. Van Den Herik. Monte-carlo strategies for computer go. In Proceedings of the 18th BeNeLux Conference on Artificial Intelligence, Namur, Belgium, pages 83–91, 2006. [5] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32(2):407–451, 2004. [6] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360, 2001. [7] D. Fourdrinier and MT Wells. Comparaisons de procedures de selection d’un modele de regression: une approche decisionnelle. Comptes rendus de l’Acad´emie des sciences. S´erie 1, Math´ematique, 319(8):865–870, 1994.

9

[8] D. Fourdrinier and M.T. Wells. Estimation of a loss function for spherically symmetric distributions in the general linear model. The Annals of Statistics, 23(2):571–592, 1995. [9] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, 2008. [10] I. Guyon, A. Saffari, G. Dror, and G. Cawley. Model selection: Beyond the bayesian/frequentist divide. The Journal of Machine Learning Research, 11:61–87, 2010. [11] R.R. Hocking. A biometrics invited paper. the analysis and selection of variables in linear regression. Biometrics, 32(1):1–49, 1976. [12] I. Johnstone. On inadmissibility of some unbiased estimates of loss. Statistical Decision Theory and Related Topics, 4(1):361–379, 1988. [13] D. Kelker. Distribution theory of spherical distributions and a location-scale parameter generalization. Sankhy¯ a: The Indian Journal of Statistics, Series A, 32(4):419–430, 1970. [14] C.L. Mallows. Some comments on cp. Technometrics, pages 661–675, 1973. [15] G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464, 1978. [16] C.M. Stein. Estimation of the mean of a multivariate normal distribution. The Annals of Statistics, 9(6):1135–1151, 1981. [17] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288, 1996. [18] V.N. Vapnik and A.Y. Chervonenkis. On uniform convergence of the frequencies of events to their probabilities. Teoriya veroyatnostei i ee primeneniya, 16(2):264–279, 1971. [19] M.J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery usingconstrained quadratic programming (lasso). Information Theory, IEEE Transactions on, 55(5):2183– 2202, 2009. [20] C.H. Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894–942, 2010. [21] H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418–1429, 2006.

10