A trade-off between Gaussian and worst case ... - Aurélie Boisbunon

sists in comparing several possible submodels and determining which one is ..... the quality of an estimator ˆβ of the parameter β by the quadratic loss L(β, ˆβ) =.
585KB taille 2 téléchargements 35 vues
A trade-off between Gaussian and worst case analysis for model selection Aur´elie Boisbunon, St´ephane Canu, Dominique Fourdrinier Universit´e de Rouen and INSA de Rouen Avenue de l’Universit´e - BP 12 ´ 76801 Saint-Etienne-du-Rouvray Cedex Abstract This work proposes a new complexity-penalized procedure for model selection. It covers all model selection phases from model exploration to model evaluation and is designed to present good statistical and computational properties. It relies on the use of spherical symmetric distributions. They offer an interesting trade-off between Gaussian assumption and the robust alternative based on a worst-case analysis. Our procedure of selection consists in an almost unbiased estimator of the model parameter obtained with Minimax Concave Penalty at a reasonable computational cost. Then model exploration is performed through an efficient regularization path algorithm. Finally, for model evaluation, a new penalty is introduced based on an improved loss estimator derived within the spherically symmetric framework. Simulation results illustrates the performance of this procedure in terms of both selection and prediction error.

Keywords: Model selection, linear model, loss estimation, spherically symmetric distributions, regularization path, MCP, dependence.

1

Introduction

In the field of statistical learning theory, the problem of model selection consists in comparing several possible submodels and determining which one is the “best”. Following the idea exposed in Guyon et al. (2010), this definition of model selection implies at least three levels of inference: (a) the definition of a way to explore the possible submodels, (b) the estimation of the parameters for each submodel, and (c) the evaluation of the quality of these estimations, that is to say what is meant by “best” model. Each of these levels should be treated carefully and as part of a whole. The way level (a) is performed depends on the size of the datasets. For small datasets, exhaustive exploration gives us the insurance of not missing important information. However, as soon as we are dealing with larger scale, exhaustivity is not tractable anymore. Hence comes the need to wisely reduce the number of submodels to explore. An interesting 1

way to explore fewer models is to use regularization paths. Regarding the estimation of the model parameters and their evaluation, level (b), one must keep in mind what the objectives are: the selection alone, a good estimation in order to get a hint of the mechanism behind the data, or else a good prediction. The choice of the estimation should thus be consistent with the evaluation criterion reflecting the objective, in other words it verifies some oracle property. In this work, we are interested in good prediction, and consequently we propose the following procedure of model selection: explore the submodels thanks to a regularization path, estimate the model parameters by Minimax Concave Penalty (Zhang, 2010), which has a low bias compared to other sparse regularized methods, and evaluate with estimators of the prediction loss. This procedure also reflects our objective to solve model selection in reasonable computational time, in a similar spirit to what is proposed in Agarwal et al. (2011). Moreover, for the level of evaluation, most of the criteria in literature rely on the assumption of independence between the components of the model noise. Indeed, the noise is often assumed to be Gaussian, or to be drawn from another specified distribution. It is the case for Cp (Mallows, 1973), Akaike’s Information Criterion (Akaike, 1973), and Bayes Information Criterion (Schwarz, 1978), but also for later works such as Risk Inflation Criterion by Foster and George (1994), Burnham and Anderson (2002), and AICc in Hurvich and Tsai (1989), to name just a few. On the contrary, some authors like Vapnik and Chervonenkis (1971) do not make any distributional assumptions at all and propose probability bounds on the prediction error. However, in addition to the difficult computation of such bounds, they can be quite loose in practice (see Friedman et al., 2008). Our work stands as an intermediate between whole specification of the distribution and no distributional assumption. This is performed by assuming the noise to be multivariate spherically symmetric. No further assumption is needed on the form of the distribution. This way, we relax the independence property and handle some distributional robustness. Notwithstanding, the strong spherical assumption allows us to tighten bounds on probabilities. Note that the family of spherical laws gathers the well known examples of Student and Kotz distribution, but also the Gaussian distribution. Section 2 defines the assumptions and notations used in the paper and specifies our choices. Then, Section 3 exposes some notions of loss estimation theory and its derivation as an evaluation criterion for model selection. And finally, a simulation study is presented in section 4 where we compare our criteria to the classical AIC/Cp , BIC and leave-one-out cross-validation.

2 2.1

Context and notations Framework

In this article, we restrict our study to the linear regression framework. Given the observations y = (y1 , . . . , yn ) and X = (x1 , . . . , xp ), the latter one being the fixed design matrix composed of p explanatory variables xj ∈ Rn , p < n, we aim at estimating the regression coefficient β ∈ Rp such that Y = Xβ + ε,

2

(1)

where ε ∈ Rn is a noise vector. Frequently in literature, the components of ε are assumed to be n independent and identically distributed realizations of a (usually Gaussian) random variable. In our work, we assume instead that ε is an n-variate random vector drawn from an unknown distribution P with mean the n-dimensional vector zero, with no assumption on the independence between εj and εi , i 6= j. We restrict our study to the case where P belongs to the family of spherically symmetric distributions, and we write ε ∼ SS n (0). In other words, the density of P, when it exists, is of the form t 7→ f (ktk22 ), for a given function mapping from Rn to R+ . According to the properties of spherical symmetry, we have that Y also follows a spherically symmetric distribution with mean µ = Xβ, that is to say Y ∼ SS n (Xβ). Figure 1 shows some exemples of such densities, namely the Gaussian multivariate distribution, the Student multivariate distribution, and Kotz distribution. The Gaussian case is the only spherical law verifying the independence property. See Kelker (1970) for a review on this family of distributions. Gaussian 0.1 0

Kotz

Student 0.1

0.1

2

0

!2

!2

0

2

0

2

0

!2

!2

0

2

0

2

0

!2

!2

0

2

Figure 1: Examples of spherically symmetric densities.

2.2

Multi-level inference

In model (1), when the number p of variables in X is large, we further assume that only a subset I of these variables are relevant to explain Y . Hence, the problem is to simultaneously find those relevant variables xj and estimate their corresponding coefficient. This problem is complicated to solve and can be decomposed into subproblems, as presented in Guyon et al. (2010): (a) the exploration of the submodels, (b) the estimation of the parameters (here β) for each submodel, and (c) the evaluation of the quality of these estimations. Literature is abondant on methods of model selection, but most of these methods only solve one or two of these subproblems. We believe we can gain performance by thinking of these three subproblems as part of a whole and make wise choices on how to perform them according to the objective (selection, estimation, prediction). We are going to say a few words on some possible choices for each level of inference. Note however that this paper is not a review on model selection methods. As said earlier, we assume that p can be large. This fact immediately discards the use of exhaustive exploration for level (a). Early works considered ranking with statistical tests of significativity for each variable (Student t-tests) or for a group of variables (Fisher F-tests). It leads to methods such as Forward Selection or Backward Elimination. They were very appealing at first since they need p(p + 1)/2 comparison, compared to the 2p possible subsets in {1, . . . , p}. However, they require independence between the components of noise ε and it is not clear whether there exist conditions in which they can actually recover the “right” subset of variables (see Hocking, 1976, for a good review). Another 3

approach is the use of regularization path. Regularization path can generally be computed in O(p) and there have been many studies on the conditions of recovery of the right subset for some of these methods, like Least Angle Regression algorithm (Efron et al., 2004) which computes lasso’s regularization path (Tibshirani, 1996). Among these studies we can cite Wainwright (2009). More recent algorithms try to compute regularization paths even faster than O(p), for very large p (see Morioka and Satoh, 2011). Regularization path algorithms are associated to an estimator of β and regularization methods often solve level (a) and (b) at the same time. Such methods generally consist in the minimization of the sum of squared residuals regularized by some penality ρ on the coefficient β, measuring the complexity of the solution. They are formalized as follows 1 min ky − Xβk22 + λρ(β), β 2

λ ≥ 0.

(2)

The most popular is surely by far the Least Absolute Shrinkage and Selection Operator, or lasso, proposed by Tibshirani (1996), and also known as soft threshold (Donoho and Johnstone, 1994) in the Signal Processing commnunauty. We can also cite other methods such as Smoothly Clipped Absolute Deviation (SCAD) by Fan and Li (2001) and Minimax Concave Penalty (MCP) by Zhang (2010), among others. The penalty of these methods are of the following form ρ

lasso

(β)

=

ρSCAD (β)

=

ρM CP (β)

=

=

p X

|β|j j=1 p Z |βj | X j=1 0 p Z |βj | X j=1

0

p X

(

j=1

(3)  1{t≤λ} +  1−

 aλ − t 1{t>λ} dt, (a − 1)λ

 t dt, aλ + !

βj2 |βj | − aλ

a>2

(4)

a>1

aλ 1{|βj |≤aλ} + 1{|βj |>aλ} 2

) ,

(5)

where 1{t∈E} = 1 if t ∈ E and 0 otherwise, and t+ = max(t, 0). Figure 2 shows the shape of these penalty functions. a=2 2

1.5

1.5

1.5

1 0.5 0 −4

j

p(|! |)

2

p(|!j|)

p(|!j|)

a=2.5 2

1 0.5

−2

0 !j

2

4

0 −4

1 0.5

−2

0 !j

2

4

0 −4

−2

0 !j

2

4

Figure 2: Penalty function for the lasso (left panel), SCAD (middle panel) and MCP (right panel) Now, as mentioned in the introduction, the objective of the present work is good prediction, that is to say we wish to minimize the error kX βˆ − Xβk2 . 4

Hence, and as defined in Fan and Li (2001), a good estimator of β should have a low bias in order to comply with an oracle property. Indeed, assuming the true submodel I ∗ belongs to the set I of submodels defined by level (a), then βˆ should have minimum prediction error for submodel I ∗ , βˆI ∗ = arg min kX βˆI − Xβk2 .

(6)

I∈I

Among the three estimators we proposed, MCP is the one with lower bias, as can be seen on Figure 3 in the case where X is orthogonal. This is the reason for which we choose to estimate β thanks to MCP. Note however that there might be other good choices, such as Elastic-Net (Zou and Hastie, 2005) or Adaptive Lasso (Zou, 2006), and further investigation is needed to compare their bias. SCAD, a=2.5

MCP, a=2 3

2

2

2

1

1

1

0

βJ

0

MCP

3

βSCAD j

βlasso j

Lasso 3

0

−1

−1

−1

−2

−2

−2

−3

−3

−2

−1

0 βLS

1

2

−2

−1

0 βLS

j

j

1

2

−3

−2

−1

0 βLS

1

2

j

Figure 3: Estimators of lasso (left panel), SCAD (middle panel) and MCP (right panel) with respect to the least-squares estimator in the orthogonal design case (X t X = Ip ) Finally, the evaluation level (c) consists in finding the minimum argument of the right-hand side of (6). This can be performed by estimating either the pre2 ˆ diction error kX β−Xβk , its risk or a bound on its risk and then minimizing the corresponding estimator. Well-known criteria estimating the prediction error or its risk are Cp (Mallows, 1973), AIC (Akaike, 1973), BIC (Schwarz, 1978), and their respective improvements. Their performance have been proven under the Gaussian assumption with i.i.d samples, or at least under some known distribution, but little is known in theory outside this context. On the contrary, the Vapnik-Chervonenkis bound (Vapnik and Chervonenkis, 1971) is independent of the noise probability and considers the worst-case scenario. This bound is hard to compute in practice and can be quite loose, as observed in Friedman et al. (2008). Our work stands as a trade-off between those two approaches. Indeed, we try to estimate the prediction error without specifying the form of the distribution and restricting it to the family of spherically symmetric distributions. This framework allows us to prove that some of the classical estimators, such as Cp and AIC, can be easily extended to spherical case. In the following section, we present some notions of loss estimation theory for spherically symmetric distributions and derive estimators of the prediction loss. For generality purpose, we consider any regularization method satisfying (2).

5

3

Loss estimation for model evaluation

The original context from which loss estimation was initially developed is not so far away from model selection. It has been used in the estimation of the mean µ ∈ Rn of a multivariate random variable Y µ ∈ Rn . In particular, Stein (1955) proved, thanks to its analog risk estimation, the inadmissibility of the classical Maximum Likelihood Estimator (MLE) for a multivariate Gaussian distribution and when the dimension of Y is greater than 3. He then proved, in the same way, the improvement of the James-Stein estimator over the MLE, in James and Stein (1961) and Stein (1981). Later works generalized this approach to spherically symmetric distributions, see for instance Brandwein and Strawderman (1990) and Fourdrinier and Wells (1995). In the context of linear regression, the mean vector is µ = Xβ. We evaluate ˆ = the quality of an estimator βˆ of the parameter β by the quadratic loss L(β, β) 2 ˆ ˆ kX β−Xβk , which is also known as prediction error. A good estimator β should ˆ therefore comes the thus have low quadratic loss. As β is unknown, so is L(β, β), need to estimate the latter one as well based on the observed data. Following Johnstone’s definition in Johnstone (1988), δ0 is an unbiased estimator of loss if it verifies ˆ Eµ [δ0 ] = Eµ [L(β, β)], (7) where Eµ is the expectation under the distribution of Y , and assuming both expectations exist. This allows us to define an unbiased estimator of loss in a similar way as what is done in Mallows (1973), except that we generalize it to spherically symmetric distributions. Lemma 1. Let Y ∼ Pn (Xβ) and y be an observation of Y . If P belongs to the family of spherically symmetric distributions (P ⊂ SS), then the following estimator is an unbiased estimator of kX βˆ − Xβk2 ˆ − n) δ0 (y) = kX βˆ − yk2 + (2divy (X β)

ky − X βˆLS k2 , n−p

(8)

where βˆ is the estimator of β we chose, βˆLS is the least-squares estimator of β, and divy x = trace[(∂xi /∂yj )ni,j=1 ] is the divergence of x with respect to y. Proof. The proof is detailed in Appendix A. It consists in developing the expec2 ˆ tation Eµ [kX β−Xβk ] and using a Stein-type identity for spherically symmetric distributions. Note that the estimator δ0 is similar to AIC and Cp up to some constant, when we take σ ˆ 2 = ky − X βˆLS k2 /(n − p) as an estimator of the variance and ˆ as an estimator of the number of degrees of freedom. The main divy (X β) difference is that here we derived δ0 for any spherically symmetric distribution. In the normal case, Johnstone (1988) proved the inadmissibility of the unbiased estimator of loss of the MLE and the James-Stein estimator, and proposed a way to improve it. Following his work, we define what we mean by improvement. Definition 1. Let Y ∼ Pn (Xβ) and y be an observation of Y . Given δ0 and δ two estimators of the loss kX βˆ − Xβk2 , if P ⊂ SS, then δ is said to improve

6

on δ0 in terms of quadratic risk if it verifies, provided both expectations exist, ∀β ∈ Rp

Eµ [(δ(Y ) − kX βˆ − Xβk2 )2 ] ≤ Eµ [(δ0 (Y ) − kX βˆ − Xβk2 )2 ],

(9)

where Eµ is the expectation with respect to Y , and if there exist at least one value of β for which the inequality is strict. In our context of model selection, this definition of improvement suggests that if an estimator of loss is closer to the true loss, then we can hope its minimum will be closer to the true loss minimum, and hence the selection should be closer to the oracle. In Fourdrinier and Wells (1995), the authors propose an improvement of the form δγ (y) = δ0 (y) − ky − X βˆLS k4 γ(X βˆLS ), (10) γ(·) being a correction function such that γ(t) ≤ c1 /ktk2 , with c1 a constant. This form of improvement can be seen as a data-driven penalty. For model evaluation, γ(X βˆLS ) should not be constant for all the estimators βˆ of β as it would lead to the same model selection as δ0 . We propose the following correction function  −1 X γ(X βˆLS ) = c2 k max{zj2 \ |zj | ≤ λ} + zj2 1{|zj |≤λ} , (11) j≤p

j≤p

where zj = (q j )t X βˆLS , q j is the j th column of the matrix Q computed by the QR factorization of X, k is the number of variables selected by the regularization method, λ its corresponding hyperparameter in equation (2), and c2 is a constant. The correction γ in (11) is a function of both the number of selected variables and the information contained in the non-selected variables. This reflects our belief that the non-selected variables can help evaluating how good the selection is. In order to improve on δ0 , c2 should be chosen such that (9) holds and thus should verify the following lemma. Lemma 2. Let Y ∼ Pn (Xβ) and y be an observation of Y . If P ⊂ SS, then the estimator δγ in (10) with the correction function (11) improves on the unbiased estimator δ0 if c2 verifies ( b )||Y − X βˆLS ||6 /d(Z)] p−2 2 Eµ [(n − 2df 4 − 0 ≤ |c2 | ≤ (n − p + 4) n − p + 6 n − p Eµ [||Y − X βˆLS ||8 /d2 (Z)] h i 2 Eµ k(k + 1)Z(k+1) ||Y − X βˆLS ||8 /d3 (Z)  2 h i − (12) ,  n−p+6 Eµ ||Y − X βˆLS ||8 /d2 (Z) where Z(k+1) = maxj≤p {Zj2 \ |Zj | ≤ λ}, d(Z) is the denominator of (11), and b is the estimated number of degrees of freedom. df Proof. The proof is given in Appendix B. Now, the optimal solution for (9) is obtain at half the upper bound in (12). Approximating the expectations in (12) (see Appendix C for more details), we come to the constant c2 =

2(p − 2 − 2k(k + 1)/p) . (n − p + 4)(n − p + 6) 7

(13)

The model selection criterion we propose is thus δγ (y)

ˆLS 2 ˆ − n) ky − X β k = kX βˆ − yk2 + (2divy (X β) n−p −

2(p − 2 − 2k(k + 1)/p) ky − X βˆLS k4 X · (14) . (n − p + 4)(n − p + 6) kmax{z 2 \ |zj | ≤ λ} + zj2 1{|zj |≤λ} j j≤p

4

j≤p

Simulation study

In this section, we study the behaviour of our criteria in terms of both selection and prediction. The following example is inspired by Fourdrinier and Wells (1994). We observe the vector y of size n = 40 and X containing p = 5 explanatory variables. We consider two situations: the first one when X is an orthogonal matrix, a classical assumption in wavelets for instance, and the second one when X is a general matrix. The regression coefficient β is set to (2, 0, 0, 4, 0)t , where xt is the transpose of x. The error vector ε is drawn from three different distributions: the Gaussian distribution, which is the usual assumption for most criteria, the Kotz distribution and the Student distribution with ν = 5 degrees of freedom, both corresponding to our assumption of spherical symmetry. We replicate the error vector 5000 times leading to 5000 different regularization paths. The estimator of loss δγ is the one described in (14) with the choices of the corrective function in (11) and (13). We compare its prediction error mean, as well as its selection to the classical Akaike’s Information Criterion (AIC) and Schwarz’ Information Criterion (BIC) with a Gaussian assumption, and the distribution-free leave-one-out cross-validation (LOOCV). We also show the ˆ β) = kX βˆ − Xβk2 . The results selection performed by the true loss loss L(β, for the unbiased estimator δ0 and Cp are not displayed here since they are the same than AIC. Table 1 shows the mean of the prediction error over the 5000 replicates. Tables 2 to 7 present for each method the empirical probabilities (or frequency) of selecting the subsets over the 5000 replicates. We iterated the experience ten times to estimate the means and standard deviations shown in the tables. Only the most voted subsets are displayed. Distr. X orthog.

X gen.

Gaussian Kotz Student Gaussian Kotz Student

δγ

AIC

BIC

4.6951 5.0382 20.1077 4.0109 4.1354 21.7597

4.6109 4.9777 20.9984 3.7927 3.9595 24.9795

5.0646 5.4508 17.4365 3.8565 4.0110 22.2883

LOOCV 3.8979 4.2460 48.1016 4.0920 5.3717 25.5217

Table 1: Prediction errors We can see from Table 1 that all criteria have similar prediction error. The advantage goes to our rule δγ for the general design case with Student errors, while AIC and LOOCV obtain the best results for almost all the other cases. 8

This can be explained by the fact that they selected more variables, as we can see in the tables displaying the frequency of selection. It is also worth noting that when the noise is Student the performance in prediction error are four times worse than for the other distributions. This might be caused by a larger variance. Looking now at the frequencies of selection of each subset, we notice two different behaviours. The first one occurs for the orthogonal design case, where our rule δγ outperforms all the other criteria in recovering the true subset, and it the closer to the true loss. For the general design case however, BIC yields better results than our rule δγ but both are pretty close compared to AIC and LOOCV. In that case also, something strange happens with the true loss: it is not so often minimum at the true sparsity. These results are very surprising from an oracle point of view, and unfortunately we are not able to explain it as of now. We can only assume that MCP’s bias might still be too large. Combining the information displayed in the tables, we can see that our criterion δγ manages to recover the true sparsity quite often, even more often than BIC when the matrix X is orthogonal, at the cost of a very small difference in prediction error. We can thus have the best of both worlds: a good sparsity recovery for a low prediction loss. This conclusion does not depend on the noise distribution. The improvement of our criterion over the others is not significantly different for other spherical noise. This confirms the fact that some of the classical criteria, as we proved it for AIC, but maybe the same could be done for BIC, can handle a relaxation on the distribution noise with the spherical case. Subset

δγ

∅ {4} {1,4} {1,2,4} {1,3,4} {1,4,5} {1,2,3,4} {1,2,4,5} {1,3,4,5} {1,2,3,4,5}

0.00 (0.13) 30.31 (0.41) 36.55 (0.59) 1.28 (0.13) 1.36 (0.15) 1.82 (0.14) 5.44 (0.25) 2.94 (0.20) 2.75 (0.24) 10.24 (0.43)

AIC 0.00 24.26 32.33 6.66 6.68 6.50 4.20 4.26 4.12 5.27

BIC

(0.11) (0.41) (0.54) (0.47) (0.16) (0.46) (0.20) (0.23) (0.27) (0.26)

4.28 38.37 32.31 4.53 4.60 4.50 2.03 2.08 1.97 1.48

(0.21) (0.64) (0.44) (0.31) (0.18) (0.27) (0.21) (0.16) (0.18) (0.11)

LOOCV

true loss

1.60 (1.26) 18.10(18.49) 24.59(15.29) 8.32 (5.62) 7.95 (4.82) 8.16 (4.50) 3.16 (1.31) 3.50 (1.41) 3.30 (1.61) 9.43 (4.01)

0.00 (0.02) 14.15 (0.36) 54.35 (0.52) 7.61 (0.33) 7.53 (0.37) 7.68 (0.28) 2.00 (0.21) 1.95 (0.15) 1.92 (0.14) 1.04 (0.08)

Table 2: Frequency (%) of selection for X orthogonal with Gaussian noise.

Improvement over the unbiased estimator could be even larger if we consider ˆ 4 γ) for instance, as a more general form of correction, like δ ∗ = α(δ0 − ky − X βk was done in Fourdrinier and Wells (1994). But the most striking result is the low empirical probabilities of all the methods. The real loss L(ˆ µ, µ) manages to select the right subset only around half of the time in the orthogonal design case, bounding from above the probability of selection with our criterion, and behaves poorly in the general design case. On the contrary, the results in Fourdrinier and Wells (1994) showed that systematic exploration with subset selection leads to select {1, 4} with an empirical probability of 83% for their corrective estimator (which is slightly different from the one in (10)), that is to say more than 30% the probability obtained with MCP. This important difference is a consequence 9

Subset

δγ

∅ {4} {1,4} {1,2,4} {1,3,4} {1,4,5} {1,2,3,4} {1,2,4,5} {1,3,4,5} {1,2,3,4,5}

1.15 (0.19) 31.06 (0.79) 35.16 (0.76) 1.30 (0.20) 1.23 (0.14) 1.74 (0.17) 5.49 (0.45) 2.77 (0.28) 2.74 (0.16) 10.19 (0.33)

AIC 0.00 25.09 31.11 6.39 6.50 6.30 4.29 4.15 4.19 5.32

BIC

(0.11) (0.59) (0.81) (0.35) (0.36) (0.36) (0.32) (0.24) (0.30) (0.20)

5.19 38.90 30.76 4.32 4.46 4.43 2.06 1.97 2.06 1.51

(0.37) (0.66) (0.85) (0.30) (0.24) (0.31) (0.21) (0.22) (0.18) (0.16)

LOOCV

true loss

3.45 (2.86) 23.66(14.34) 27.63(16.94) 5.82 (2.87) 6.25 (2.92) 4.98 (2.41) 2.36 (0.81) 2.66 (1.57) 2.69 (1.19) 10.96 (6.91)

0.00 (0.02) 15.56 (0.33) 52.47 (0.59) 7.42 (0.35) 7.70 (0.36) 7.56 (0.23) 2.01 (0.16) 2.05 (0.22) 2.06 (0.12) 1.14 (0.17)

Table 3: Frequency (%) of selection for X orthogonal with Kotz noise. Subset

δγ

∅ {4} {1,4} {3,4} {1,3,4} {1,4,5} {1,2,3,4} {1,2,4,5} {1,3,4,5} {1,2,3,4,5}

12.83 (0.32) 20.12 (0.38) 34.21 (0.32) 2.23 (0.27) 0.00 (0.09) 1.32 (0.09) 4.26 (0.25) 2.25 (0.11) 2.34 (0.11) 8.18 (0.41)

AIC 12.95 16.44 29.64 1.50 5.25 5.06 3.38 3.39 3.46 4.43

BIC

(0.29) (0.35) (0.37) (0.27) (0.36) (0.32) (0.23) (0.16) (0.16) (0.19)

24.72 23.16 31.39 0.00 3.11 3.14 1.41 1.40 1.41 1.04

(0.38) (0.64) (0.36) (0.17) (0.26) (0.18) (0.11) (0.14) (0.14) (0.12)

LOOCV

true loss

9.16 ( 3.87) 26.17(16.09) 19.99(14.08) 2.26 ( 0.88) 3.51 ( 3.21) 3.88 ( 3.43) 3.07 ( 1.80) 2.94 ( 1.66) 3.29 ( 1.75) 10.66 ( 4.91)

11.92 (0.45) 15.58 (0.60) 46.91 (0.37) 1.15 (0.13) 4.85 (0.14) 4.80 (0.22) 1.42 (0.11) 1.47 (0.15) 1.47 (0.11) 0.00 (0.14)

Table 4: Frequency (%) of selection for X orthogonal with Student noise. of the regularization path, which sometimes introduces irrelevant variables first and fails to select the right subset. It also explains the selection by all criteria of subsets of size 3 or bigger containing the two relevant variables. Subset

δγ

{1,4} {1,2,4} {1,3,4} {1,4,5} {1,2,3,4} {1,2,4,5} {1,3,4,5} {1,2,3,4,5}

59.27 (0.61) 9.12 (0.38) 11.08 (0.42) 5.67 (0.38) 1.87 (0.18) 3.23 (0.28) 4.31 (0.27) 5.45 (0.32)

AIC 42.34 12.20 15.29 6.09 8.53 4.30 5.59 5.66

(0.59) (0.52) (0.30) (0.43) (0.30) (0.33) (0.19) (0.23)

BIC

LOOCV

63.12 (0.65) 10.18 (0.32) 12.76 (0.32) 4.34 (0.26) 4.16 (0.18) 1.74 (0.17) 2.27 (0.15) 1.44 (0.19)

29.09(28.57) 11.57 (7.55) 15.07 (8.84) 7.40 (3.92) 9.03 (6.26) 5.18 (3.69) 6.49 (4.55) 16.16 (9.72)

true loss 23.65 13.06 16.79 7.87 12.16 6.67 8.56 11.23

(0.67) (0.45) (0.35) (0.33) (0.51) (0.33) (0.40) (0.19)

Table 5: Frequency (%) of selection for X general with Gaussian noise.

5

Discussion

In this work we proposed a complete procedure of selection exploring the submodels through a regularization path, estimating the parameter by Minimax Concave Penalty, and evaluating each solution thanks to loss estimation. The novelty of this procedure is the relaxation of the usual i.i.d. assumption with

10

Subset

δγ

{1,4} {1,2,4} {1,3,4} {1,4,5} {1,2,3,4} {1,2,4,5} {1,3,4,5} {1,2,3,4,5}

58.82 (0.87) 9.38 (0.47) 11.06 (0.39) 5.62 (0.25) 1.85 (0.17) 3.34 (0.31) 4.45 (0.18) 5.48 (0.23)

AIC 42.34 12.23 15.24 6.03 8.43 4.33 5.66 5.73

(1.01) (0.43) (0.46) (0.28) (0.37) (0.39) (0.30) (0.18)

BIC

LOOCV

62.87 (0.66) 10.28 (0.36) 12.75 (0.28) 4.28 (0.20) 4.12 (0.25) 1.80 (0.12) 2.34 (0.15) 1.56 (0.14)

31.71(28.47) 13.51 (6.78) 16.50 (8.65) 8.31 (4.17) 7.72 (4.19) 4.99 (2.56) 6.06 (3.53) 11.19 (7.10)

true loss 23.25 13.06 16.76 7.92 12.55 6.59 8.57 11.31

(0.61) (0.30) (0.52) (0.40) (0.57) (0.32) (0.30) (0.10)

Table 6: Frequency (%) of selection for X general with Kotz noise. Subset

δγ

{1,4} {1,2,4} {1,3,4} {1,4,5} {1,2,3,4} {1,2,4,5} {1,3,4,5} {1,2,3,4,5}

56.32 (0.48) 9.29 (0.57) 11.61 (0.23) 5.62 (0.26) 2.16 (0.16) 3.33 (0.22) 4.04 (0.32) 6.34 (0.41)

AIC 42.43 11.94 15.14 6.03 8.09 4.19 5.25 5.81

(0.41) (0.36) (0.28) (0.24) (0.30) (0.31) (0.40) (0.42)

BIC

LOOCV

62.39 (0.63) 9.99 (0.57) 12.73 (0.33) 4.08 (0.15) 3.78 (0.30) 1.68 (0.19) 2.14 (0.25) 1.34 (0.19)

25.97(18.03) 11.19 (6.20) 12.68 (6.62) 6.84 (3.41) 8.45 (4.25) 4.98 (2.14) 6.45 (2.90) 22.38 (9.83)

true loss 23.04 13.28 17.02 7.91 12.28 6.45 8.58 10.83

(0.44) (0.28) (0.53) (0.32) (0.54) (0.26) (0.31) (0.34)

Table 7: Frequency (%) of selection for X general with Student noise. no need to specify the form of the noise distribution. The only assumption we need is that the distribution of the noise is spherically symmetric, which greatly generalizes the Gaussian distribution. The small simulation study we run shows that our procedure yields a better recovery of the true sparsity while very slightly increasing the prediction error. Even better results might be obtained by improving on the loss estimators we proposed. Also the spherical assumption is only valid for the evaluation of the solutions. We could still gain performance by adapting regularization methods to such a context, leading to a possibly better regularization path and a better estimation. But above all our goal in this work was to try clarifying the way model selection is performed as we hope it will help reformulate it as a better-posed problem, which we believe is not the case up to now.

References Agarwal, A., Duchi, J. C., Bartlett, P. L., and Levrard, C. (2011). Oracle inequalities for computationally budgeted model selection. Journal of Machine Learning Research - Proceedings Track, 19:69–86. Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Second international symposium on information theory, volume 1, pages 267–281. Akademiai Kiado. Brandwein, A. and Strawderman, W. (1990). Stein estimation: The spherically symmetric case. Statistical Science, 5(3):356–369.

11

Burnham, K. and Anderson, D. (2002). Model Selection and Multimodel Inference: a Practical Information-Theoretic Approach. Springer Verlag. Donoho, D. and Johnstone, J. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81(3):425. Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32(2):407–451. Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360. Foster, D. and George, E. (1994). The risk inflation criterion for multiple regression. The Annals of Statistics, pages 1947–1975. Fourdrinier, D. and Strawderman, W. (1996). A paradox concerning shrinkage estimators: should a known scale parameter be replaced by an estimated value in the shrinkage factor? Journal of multivariate analysis, 59(2):109–140. Fourdrinier, D. and Wells, M. (1994). Comparaisons de procedures de selection d’un modele de regression: une approche decisionnelle. Comptes rendus de l’Acad´emie des sciences. S´erie 1, Math´ematique, 319(8):865–870. Fourdrinier, D. and Wells, M. (1995). Estimation of a loss function for spherically symmetric distributions in the general linear model. The Annals of Statistics, 23(2):571–592. Friedman, J., Hastie, T., and Tibshirani, R. (2008). The elements of statistical learning. Guyon, I., Saffari, A., Dror, G., and Cawley, G. (2010). Model selection: Beyond the bayesian/frequentist divide. The Journal of Machine Learning Research, 11:61–87. Hocking, R. (1976). A biometrics invited paper. the analysis and selection of variables in linear regression. Biometrics, 32(1):1–49. Hurvich, C. and Tsai, C. (1989). Regression and time series model selection in small samples. Biometrika, 76(2):297–307. James, W. and Stein, C. (1961). Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability: Held at the Statistical Laboratory, University of California, June 20-July 30, 1960, page 361. Univ of California Press. Johnstone, I. (1988). On inadmissibility of some unbiased estimates of loss. Statistical Decision Theory and Related Topics, 4(1):361–379. Kelker, D. (1970). Distribution theory of spherical distributions and a locationscale parameter generalization. Sankhy¯ a: The Indian Journal of Statistics, Series A, 32(4):419–430. Mallows, C. (1973). Some comments on cp. Technometrics, pages 661–675.

12

Morioka, N. and Satoh, S. (2011). Generalized lasso based approximation of sparse coding for visual recognition. In Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., and Weinberger, K., editors, Advances in Neural Information Processing Systems 24, pages 181–189. Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464. Stein, C. (1955). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 197–206. Stein, C. (1981). Estimation of the mean of a multivariate normal distribution. The Annals of Statistics, 9(6):1135–1151. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288. Vapnik, V. and Chervonenkis, A. (1971). On uniform convergence of the frequencies of events to their probabilities. Teoriya veroyatnostei i ee primeneniya, 16(2):264–279. Wainwright, M. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using-constrained quadratic programming (lasso). Information Theory, IEEE Transactions on, 55(5):2183–2202. Zhang, C. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894–942. Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418–1429. Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320.

A

Unbiased estimators of loss for spherically symmetric distributions

This appendix shows how we obtained the unbiased estimator of loss δ0 in (8). ˆ namely R(β) ˆ = Eµ [kX βˆ − The idea is to expand the quadratic risk of β, 2 Xβk ], and to express it as a function of X, βˆ and Y only. ˆ R(β)

= Eµ [kX βˆ − Xβk2 ] = Eµ [kX βˆ − Y + Y − Xβk2 ] =

Eµ [kX βˆ − Y k2 + kY − Xβk2 + 2(Y − Xβ)t (X βˆ − Y )]

(15)

For the multivariate normal case, a classical unbiased estimate of the variance σ 2 = Eµ [kY − Xβk2 /n] is ky − X βˆLS k2 /(n − p). For the right most term in (15), we recall Stein’s identity.

13

Theorem 1 (Stein’s identity). Given Y ∈ Rn a random vector following a multivariate Gaussian distribution Nn (µ, σ 2 In ), and g : Rn 7→ Rn a weakly differentiable function, we have that Eµ [(Y − µ)t g(Y )] = σ 2 Eµ [divY g(Y )],

(16)

provided both expectations exist. Given the independence between ky − X βˆLS k2 /(n − p) and Y , substituting σ by the expectation of its unbiased estimator and applying Stein’s identity with g(Y ) = X βˆ − Y yields the following expression of the risk 2

ˆ R(β)

= Eµ [kX βˆ − Y k2 + nkY − X βˆLS k2 /(n − p) + 2divY g(Y )kY − X βˆLS k2 /(n − p)] " # n + 2divY (X βˆ − Y ) 2 LS 2 ˆ ˆ = Eµ kX β − Y k + kY − X β k . (17) n−p

Hence the result is obtained by taking what is inside the brackets as an unbiased ˆ = df b. estimator of loss, with divY (X β) Now, for the spherical case, we use  thetrick of transforming the linear model G1 into its canonical form. Let G = be an orthogonal matrix such that G2 Gt1 span the same column space than X, and G2 X = 0. The canonical form of (1) is obtained by applying G to Y :   G1 Xβ GY = + Gε. 0 Let Z = G1 Y , U = G2 Y , and θ = G1 Xβ, we define the canonical form of (1) by     Z θ = + Gε. (18) U 0 When ε ∼ SS(0), so does Gε. Hence (Z, U ) ∼ SS((θ, 0)). Note that Z and U are dependent, except for the Gaussian distribution. Also, it can be easily proved that the canonical form of the least-squares estimator is Z. Indeed, replacing Y by its canonical equivalent and noting that there exists a matrix A ∈ Rp ×Rp such that X = Gt1 A (since Gt1 and X span the same column space), we have that θˆLS

=

A(X t X)−1 X t Y t

G1 Gt1 A)−1 At G1 (Gt1

=

A(A

=

A(At A)−1 At Z,

=

Z.

Gt2 )



Z U



since Gt1 G1 = Ip and Gt1 G2 = 0

In order to show that δ0 is an unbiased estimator of the quadratic loss under spherical symmetry, we need to show that E[kY − X βˆLS k2 /(n − p)] is still equal to E[kY − µk2 /n], and we need an extension of Stein’s identity for the spherical case.

14

To do so, we will use the canonical form. Indeed, we have Eµ [kY − µk2 ]

= E(θ,0) [kGt (Z, U ) − Gt (θ, 0)k2 ] = E(θ,0) [k(Z, U ) − (θ, 0)k2 ] = E(θ,0) [kZ − θk2 + kU k2 ].

Now, from the spherical symmetry property, we know that E(θ,0) [kZ − θk2 ] = p E(θ,0) [kU k2 ]/(n − p). Furthermore,

    2

t Z Z t

= kU k2 . G − G kY − X βˆLS k2 = kY − Gt1 Zk2 =

U 0 Hence nkY − X βˆLS k2 /(n − p) is an unbiased estimator of kY − Xβk2 even under the spherical case. Finally, Fourdrinier and Wells (1995) give a theorem to extend Stein’s identity to the spherical case. Theorem 2 (Stein-type identity). Given (Z, U ) ∈ Rn a random vector following a spherically symmetric distribution around (θ, 0), and g : Rp 7→ Rp a weakly differentiable function, we have that Eθ [(Z − θ)t g(Z)] = Eθ [kU k2 divZ g(Z)/(n − p)],

(19)

provided both expectations exist. Remarking that (Y − Xβ)t (X βˆ − Y ) = (Z − θ)t (θˆ − Z) − kU k2 , we can develop the right-most term of (15). Eµ [(Y − Xβ)t (X βˆ − Y )]

= Eθ [(Z − θ)t (θˆ − Z) − kU k2 ] = Eθ [kU k2 divZ g(Z)/(n − p) − kU k2 ] = Eθ [(divZ (θˆ − Z)/(n − p) − 1)kU k2 ] = Eµ [(divG1 Y (G1 X βˆ − G1 Y )/(n − p) − 1)kY − X βˆLS k2 ] ˆ − n)/(n − p) − 1)kY − X βˆLS k2 ]. = Eµ [(divY (X β)

B

Proof of the bound on c2

In this appendix, we prove how we came to the bound on c2 given in (12). This proof is entirely done in the canonical form of the linear model, given in equation (18). In order to simplify the notation, we will assume that vector Z is reordered such that |Z(1) | ≥ · · · ≥ |Z(p) |. The loss estimator δγ improves on the unbiased estimator δ0 if, for p ≥ 5, if we have the following inequality: dθ = Eθ

n 2  2  δγ − ||θˆ − θ||2 − δ0 − ||θˆ − θ||2 ≤0

15

(20)

Developing the expression of dθ in (20), we obtain n o2 n o2  4 2 2 ˆ ˆ − δ0 − ||θ − θ|| dθ = Eθ δ0 − k U k γ(Z) − ||θ − θ||    8 2 2 4 ˆ = Eθ k U k γ (Z) − 2 δ0 − ||θ − θ|| ||U || γ(Z)   8 2 4 2 4 = Eθ k U k γ (Z) − 2δ0 ||U || γ(Z) + 2||Z + g(Z) − θ|| ||U || γ(Z)   p + 2 div g(Z)  = Eθ k U k8 γ 2 (Z) − 2 k U k2 + k g(Z) k2 ||U ||4 γ(Z) + n−p  2||Z − θ||2 ||U ||4 γ(Z) + 2||g(Z)||2 ||U ||4 γ(Z) + 4(Z − θ)t g(Z)||U ||4 γ(Z) , where g(Z) = θˆ − Z. Fourdrinier and Wells (1995) give the two following lemmas to extend Stein’s identity. Lemma 3. For every twice weakly differentiable function γ(Θ 7→ R),   h i 1 p 6 8 2 4 ||U || γ(Z) + ||U || ∆γ(Z) . Eθ ||Z − θ|| ||U || γ(Z) = Eθ n−p+4 (n − p + 4)(n − p + 6) We can derive the following corollary from this lemma by taking γ = Lemma 4. For every twice weakly differentiable function g(Θ 7→ Θ),    h i 1 ||U ||6 γ(Z) divZ g(Z) + g(Z) · ∇γ(Z) . Eθ (Z − θ)t g(Z)||U ||4 γ(Z) = Eθ n−p+4 Applying these lemmas in dθ , we obtain   2 dθ = Eθ ∆γ(Z) + γ 2 (Z) k U k8 (n − p + 4)(n − p + 6)     −2  4 p + 2 div g(Z) γ(Z) + g(Z) · ∇γ(Z) ||U ||6 + n−p+4 n−p h i = Eθ ED1 ||U ||8 + ED2 ||U ||6 with ED1

=

ED2

=

2 ∆γ(Z) + γ 2 (Z) (n − p + 4)(n − p + 6)  −2(p + 2div g(Z))  4 γ(Z) + g(Z) · ∇γ(Z) . n−p+4 n−p

(21) (22)

Now, for the correction function in (11), and assuming Z has been reordered, we have that  0 −2c2 0, . . . , 0, (k + 1)Z(k+1) , Z(k+2) , . . . , Z(p) ∇γ(Z) = h i2 Pp 2 2 (k + 1)Z(k+1) + i=k+1 Z(i) ∆γ(Z)

=

2 4c2 k(k + 1)Z(k+1) −2c2 (p − 2) + h i2 h i3 . Pp Pp 2 2 2 2 (k + 1)Z(k+1) + i=k+2 Z(i) (k + 1)Z(k+1) + i=k+2 Z(i)

16

We thus obtain: ED1

4c2 1 = − · (n − p + 4)(n − p + 6) d2 (Z)

p − 2 − 2k(k + 1)

2 Z(k+1)

d(Z)

! +

c22 d2 (Z)

b) 8c2 (n − 2df 1 · (n − p + 4)(n − p) d(Z) Pp 2 2 where d(Z) = (k + 1)Z(k+1) + i=k+2 Z(i) . Now solving inequality (20) which reduces to a second-order equation with respect to c2 , we can actually improve on δ0 if c2 lies between 0 and h i  2  p−2 Eθ k(k + 1)Z(k+1) ||U ||8 /d3 (Z) 4 2 − (n − p + 4)  n − p + 6 n − p + 6 Eθ [||U ||8 /d2 (Z)] ) b )||U ||6 /d(Z)] 2 Eθ [(n + 2df − , n−p Eθ [||U ||8 /d2 (Z)] ED2

=

and the minimum of the risk difference is reached at half the second value.

C

sec:Approximation of the optimal value for c2

h i   2 In this appendix, we seek bound for the terms Eθ k(k + 1)Z(k+1) ||U ||8 /d3 (Z) /Eθ ||U ||8 /d2 (Z) b )||U ||6 /d(Z)]/Eθ [||U ||8 /d2 (Z)]. The first one is easy. Indeed, and Eθ [(n − 2df Pp 2 2 2 by definition, we have that d(Z) = (k + 1)Z(k+1) + i=k+2 Z(i) ≤ pZ(k+1) and 2 hence 1/d(Z) ≥ 1/pZ(k+1) . Combining these two elements with the facts that 2 (k + 1)Z(k+1) /d(Z) ≤ 1, we obtain the following inequality: h i 2 Eθ (k + 1)Z(k+1) /d(Z) × ||U ||8 /d2 (Z) k+1 ≤ ≤1 p Eθ [||U ||8 /d2 (Z)]

(23)

The second term is very challenging. We use again the fact that (k + 1)λ2 ≤ d(Z) ≤ pλ2 , leading to: (k + 1)λ2

Eθ [||U ||6 /d(Z)] Eθ [||U ||6 /d(Z)] Eθ [||U ||6 /d(Z)] ≤ ≤ pλ2 8 8 2 Eθ [||U || /d(Z)] Eθ [||U || /d (Z)] Eθ [||U ||8 /d(Z)]

Moreover, by definition, there exists a constant c1 such that γ(Z) ≤ c1 /||Z||2 . Hence 6 2 Eθ [||U ||6 /d(Z)] (k + 1) 2 Eθ [||U ||6 /||Z||2 ] 2 Eθ [||U || /||Z|| ] λ ≤ ≤ c pλ 1 c1 Eθ [||U ||8 /||Z||2 ] Eθ [||U ||8 /d2 (Z)] Eθ [||U ||8 /||Z||2 ]

Fourdrinier and Strawderman (1996) give the following proposition for bounds on such expectations. Proposition 1. Let q be an integer such that (p − n)/2 < q. If p ≥ 6 then, for

17

any θ ∈ Rp , we have    n n − p Γ Γ +q   R2q n + 2q − 2 2 2     × ER     n + 2q − 4 n−p n p−2 2 2 R + ||θ|| Γ Γ +q p − 4 2 2   ||U ||2q ≤ Eθ ||Z||2    n n − p Γ Γ +q   R2q n + 2q − 2 2 2 .   ≤  × ER     (n + 2q − 2)(p − 6) n−p n p−2 2 2 ||θ|| R + Γ Γ +q (p − 2)2 2 2 Taking q = 3 and q = 4 respectively, and using the fact that Γ(n) = (n − 1)Γ(n − 1), we obtain the following inequality:    ER   (k + 1)λ2 n + 4 c1 n−p+6

 R6   n+2 2 2 R + ||θ|| p−4

  ER  

  R8   (n + 6)(p − 6) 2 2 R + ||θ|| (p − 2)2   ER  



n+4 Eθ [||U ||6 /||Z||2 ] ≤ c1 pλ2 8 2 Eθ [||U || /d(Z)||Z|| ] n−p+6



 R6   (n + 4)(p − 6) 2 2 ||θ|| R + (p − 2)2   .

 ER  

 R8   n+4 R2 + ||θ||2 p−4

From simulations, we have negligible terms for     ER  

 R6   n+2 ||θ||2 R2 + p−4



 ER   

and

 R   (n + 4)(p − 6) 2 R2 + ||θ|| 2 (p − 2)   .

8

 ER  

 6

 R   (n + 6)(p − 6) 2 2 R + ||θ|| 2 (p − 2)

 ER  

Hence the proposition for c2 in (13).

18

 R8   n + 4 2 2 R + ||θ|| p−4