Goodness-of-fit tests for high-dimensional Gaussian ... - Project Euclid

rem 4.3. We distinguish between the values of k: â¢ When k â¤ pÎ³ for some Î³ < 1/2, if n is large enough to satisfy the assumption of Proposition 4.1, the quantities ...

Télécharger le PDF

423KB taille 3 téléchargements 243 vues

commentaire

Report

The Annals of Statistics 2010, Vol. 38, No. 2, 704–752 DOI: 10.1214/08-AOS629 © Institute of Mathematical Statistics, 2010

GOODNESS-OF-FIT TESTS FOR HIGH-DIMENSIONAL GAUSSIAN LINEAR MODELS B Y N ICOLAS V ERZELEN AND FANNY V ILLERS Université Paris-Sud and INRIA Saclay and INRA, Laboratoire de biométrie Let (Y, (Xi )1≤i≤p ) be a real zero mean Gaussian vector and V be a subset of {1, . . . , p}. Suppose we are given n i.i.d. replications of this vector. We propose a new test for testing that Y is independent of (Xi )i∈{1,...,p}\V conditionally to (Xi )i∈V against the general alternative that it is not. This procedure does not depend on any prior information on the covariance of X or the variance of Y and applies in a high-dimensional setting. It straightforwardly extends to test the neighborhood of a Gaussian graphical model. The procedure is based on a model of Gaussian regression with random Gaussian covariates. We give nonasymptotic properties of the test and we prove that it is rate optimal [up to a possible log(n) factor] over various classes of alternatives under some additional assumptions. Moreover, it allows us to derive nonasymptotic minimax rates of testing in this random design setting. Finally, we carry out a simulation study in order to evaluate the performance of our procedure.

1. Introduction. We consider the following regression model: (1.1)

Y=

p

θi Xi + ,

i=1 Rp .

In the sequel, we note I := {1, . . . , p}. where θ is an unknown vector of The vector X := (Xi )1≤i≤p follows a real zero mean Gaussian distribution with nonsingular covariance matrix , and is a real zero mean Gaussian random variable independent of X. Straightforwardly, the variance of corresponds to the conditional variance of Y given X, var(Y |X). The variable selection problem for this model in a high-dimensional setting has recently attracted a lot of attention. A large number of papers are now devoted to the design of new algorithms and estimators which are computationally feasible and are proven to converge (see, for instance, the works of Meinshausen and Bühlmann [19], Candès and Tao [5], Zhao and Yu [29], Zou and Hastie [30], Bühlmann and Kalisch [4] or Zhao and Huang [28]). A common drawback of the previously mentioned estimation procedures is that they require restrictive conditions on the covariance matrix in order to behave well. Our issue is the natural testing counterpart of this variable selection problem; we aim at defining a Received December 2007; revised May 2008. AMS 2000 subject classifications. Primary 62J05; secondary 62G10, 62H20. Key words and phrases. Gaussian graphical models, linear regression, multiple testing, ellipsoid, adaptive testing, minimax hypothesis testing, minimax separation rate, goodness-of-fit.

704

TESTS FOR HIGH-DIMENSIONAL MODELS

705

computationally feasible testing procedure that achieves an optimal rate for any covariance matrix . 1.1. Presentation of the main results. We are given n i.i.d. replications of the vector (Y, X). Let us respectively note Y and Xi . The vectors of the n observations of Y and Xi , for any i ∈ I . Let V be a subset of I , then XV refers to the set {Xi , i ∈ V } and θV stands for the sequence (θi )i∈V . We first propose a collection of testing procedures Tα of the null hypothesis “θI \V = 0” against the general alternative “θI \V = 0.” These procedures are based on the ideas of Baraud et al. [3] in a random design. Their definitions are very flexible as they require no prior knowledge of the covariance of X, the variance of nor the variance of Y . Note that the property “θI \V = 0” is equivalent to “Y is independent of XI \V conditionally to XV .” Hence, it also permits to test conditional independences and applies for testing the graph of Gaussian graphical model (see below). Contrary to most approaches in this setting (e.g., Drton and Pearlman [8]), we are able to consider the difficult case of tests in a high-dimensional setting: the number of covariates p is possibly much larger than the number of observations n. Such situations arise in many statistical applications like in genomics or biomedical imaging. To our knowledge, the only testing procedures (e.g., [21]) that could handle highdimensional alternatives lack theoretical justifications. In this paper, we exhibit some tests Tα that are both computationally amenable and optimal in the minimax sense. From a theoretical perspective, we are able to control the Family Wise Error Rate (FWER) of our testing procedures Tα . Moreover, we derive a general nonasymptotic upper bound for their power. Contrary to the various rates of convergence obtained in the estimation setting (e.g., [5] or [19]), our upper bound holds for any covariance matrix . Then we derive from it nonasymptotic minimax rates of testing in the Gaussian random design framework. If the minimax rates are known for a long time in the fixed design Gaussian regression framework (e.g., [2]), they were unknown in our setting. For instance, if at most k components of θ are nonzero and if k is much smaller than p, we prove that the minimax rates of testing is of order k log(p) when the covariates Xi are independent. If the covariates are dependent, we n derive faster minimax rates. To our knowledge, these are the first results for testing or estimation issues that illustrate minimax rates for dependent covariates. Afterward, we show analogous results when k is large or when the vector θ belongs to some ellipsoid or some collection of ellipsoids. For any of these alternatives, we exhibit some procedure Tα that achieves the optimal rate [at a possible log(n) factor]. Finally, we illustrate the performance of the procedure on simulated examples. 1.2. Application to Gaussian Graphical Models (GGM). Our work was originally motivated by the following question: let (Zj )j ∈J be a random vector which

706

N. VERZELEN AND F. VILLERS

follows a zero mean Gaussian distribution whose covariance matrix is nonsingular. We observe n i.i.d. replications of this vector Z and we are given a graph G = (, E) where = {1, . . . , |J |} and E is a set of edges in × . How can we test that Z is an undirected Gaussian graphical model (GGM) with respect to the graph G ? The random vector Z is a GGM with respect to the graph G = (, E) if for any couple (i, j ) which is not contained in the edge set E, Zi and Zj are independent, given the remaining variables. See Lauritzen [17] for definitions and main properties of GGM. Interest in these models has grown as they allow the description of dependence structure in high-dimensional data. As such, they are widely used in spatial statistics [7, 20] or probabilistic expert systems [6]. More recently, they have been applied to the analysis of microarray data. The challenge is to infer the network regulating the expression of the genes using only a small sample of data (see, for instance, Schäfer and Strimmer [21], Kishino and Waddell [15] or Wille et al. [26]). This issue has motivated the research for new estimation procedures to handle GGM in a high-dimensional setting. It is beyond the scope of this paper to give an exhaustive review of these. Many of these graph estimation methods are based on multiple testing procedures (see, for instance, Schäfer and Strimmer [21] or Wille and Bühlmann [25]). Other methods are based on variable selection for high-dimensional data we previously mentioned. For instance, Meinshausen and Bühlmann [19] proposed a computationally feasible model selection algorithm using Lasso penalization. Huang et al. [11] and Yuan and Lin [27] extend this method to infer directly the inverse covariance matrix −1 by minimizing the log-likelihood penalized by the l 1 norm. While the issue of graph and covariance estimation is extensively studied, few theoretical results are proved for the problem of hypothesis testing of GGM in a high-dimensional setting. We believe that this issue is significant for two reasons: first, when considering a gene regulation network, the biologists often have a previous knowledge of the graph and may want to test if the microarray data match with their model. Second, when applying an estimation method in a high-dimensional setting, it could be useful to test the estimated graph as some of these methods are too conservative. Admittedly, some of the previously mentioned estimation methods are based on multiple testing. However, as they are constructed for an estimation purpose, most of them do not take into account some previous knowledge about the graph. This is, for instance, the case for the approaches of Drton and Perlman [8] and Schäfer and Strimmer [21]. Some of the other existing procedures cannot be applied in a highdimensional setting (|J | ≥ n). Finally, most of them lack theoretical justification in a nonasymptotic way. In a subsequent paper [23] we define a test of graph based on the present work. It benefits the ability of handling high-dimensional GGM and has minimax properties. Moreover, we show numerical evidence of its efficiency (see [23] for more details). In this article, we shall only present the idea underlying our approach.

TESTS FOR HIGH-DIMENSIONAL MODELS

707

For any j ∈ J , we note N(j ) the set of neighbors of j in the graph G . Testing that Z is a GGM with respect to G is equivalent to testing that the random variable Zj conditionally to (Zl )l∈N(j ) is independent of (Zl )l∈J \(N(j )∪{j }) for any j ∈ J . As Z follows a Gaussian distribution, the distribution of Zj conditionally to the other variables decomposes as follows:

Zj =

θk Zk + j ,

k∈J \{j }

where j is normal and independent of (Zk )k∈J \{j } . Then, the statement of conditional independency is equivalent to θJ \{j }∪N(j ) = 0. This approach based on conditional regression is also used for estimation by Meinshausen and Bühlmann [19]. 1.3. Organization of the paper. In Section 2, we present the approach of our procedure and connect it with the fixed design framework. Moreover, we define the notion of minimax rates of testing in this setting and gather the main notation. We define the testing procedures Tα in Section 3, and we nonasymptotically characterise the set of vectors θ over which the test Tα is powerful. In Sections 4 and 5, we apply our procedure to define tests and study their optimality for two different classes of alternatives. More precisely, in Section 4 we test θ = 0 against the class of θ whose components equal 0, except at most k of them (k is supposed small). We define a test which under mild conditions achieves the minimax rate of testing. When the covariates are independent, it is interesting to note that the minimax rates exhibit the same ranges in our statistical model (1.1) and in our fixed design regression model (2.1). In Section 5, we define two procedures that achieve the simultaneous minimax rates of testing over large classes of ellipsoids [to sometimes the price of a log(p) factor]. Moreover, we show that the problem of adaptation over classes of ellipsoids is impossible without a loss in efficiency. This was previously pointed out in [22] in fixed design regression framework. The simulation studies are presented in Section 6. Finally, Sections 7, 8 and the Appendix contain the proofs. 2. Description of the approach. 2.1. Connection with tests in fixed design regression. Our work is directly inspired by the testing procedure of Baraud et al. [3] in fixed design regression framework. Contrary to model (1.1), the problem of hypothesis testing in fixed design regression has been extensively studied. This is why we will use the results in this framework as a benchmark for the theoretical bounds in our model (1.1). Let us define this second regression model: (2.1)

Yi = fi + σ i ,

i ∈ {1, . . . , N},

where f is an unknown vector of RN , σ some unknown positive number and the i ’s a sequence of i.i.d. standard Gaussian random variables. The problem at hand

708

N. VERZELEN AND F. VILLERS

is testing that f belongs to a linear subspace of RN against the alternative that it does not. We refer to [3] for a short review of nonparametric tests in this framework. Moreover, we are interested in the performance of the procedures from a minimax perspective. To our knowledge, there have been no results in model (1.1). However, there are numerous papers on this issue in the fixed design regression model. First, we refer to the seminal work of Ingster [12–14] who gives asymptotic minimax rates over nonparametric alternatives. Our work is closely related to the results of Baraud [2] where he gives nonasymptotic minimax rates of testing over ellipsoids or sparse signals. Throughout the paper, we highlight the link between the minimax rates in fixed and in random design. 2.2. Principle of our testing procedure. Let us briefly describe the idea underlying our testing procedure. A formal definition will follow in Section 3.1. Let m be a subset of I \ V . We respectively define SV and SV ∪m as the linear subspaces of Rp such that θI \V = 0, respectively θI \(V ∪m) = 0. We note d and Dm for the cardinalities of V and m, and Nm refers to Nm = n − d − Dm . If Nm > 0, we define the Fisher statistic φm by (2.2)

φm (Y, X) :=

Nm V ∪m Y − V Y2n , Dm Y − V ∪m Y2n

where V refers to the orthogonal projection onto the space generated by the vectors (Xi )i∈V and · n is the canonical norm in Rn . We define the test statistic φm,α (Y, X) as (2.3)

φm,α (Y, X) = φm (Y, X) − F¯D−1 (α), m ,Nm

where F¯Dm ,Nm (u) denotes the probability for a Fisher variable with D and N degrees of freedom to be larger than u. Let us consider a finite collection M of nonempty subsets of I \ V such that for each m ∈ M, Nm > 0. Our testing procedure consists of doing a Fisher test for each m ∈ M. We define {αm , m ∈ M} a suitable collection of numbers in ]0, 1[ (which possibly depends on X). For each m ∈ M, we do the Fisher test φm of level αm of H0 : θ ∈ SV

against the alternative

H1,m : θ ∈ SV ∪m \ SV ,

and we decide to reject the null hypothesis if one of those Fisher tests does. The main advantage of our procedure is that it is very flexible in the choices of the model m ∈ M and in the choices of the weights {αm }. Consequently, if we choose a suitable collection M, the test is powerful over a large class of alternatives as shown in Sections 3.3, 4 and 5. Finally, let us mention that our procedure easily extends to the case where the expectation of the random vector (Y, X) is unknown. Let X and Y denote the projections of X and Y onto the unit vector 1. Then one only has to apply the procedure to (Y − Y, X − X) and to replace d by d + 1. The properties of the test remain unchanged and one can adapt all the proofs to the price of more technicalities.

709

TESTS FOR HIGH-DIMENSIONAL MODELS

2.3. Minimax rates of testing. In order to examine the quality of our tests, we will compare their performance with the minimax rates of testing. That is why we now define precisely what we mean by the (α, δ)-minimax rate of testing over a set . We endow Rp with the Euclidean norm, θ := θ θ = var 2

(2.4)

t

p

θi X i .

i=1

As and X are independent, we derive from the definition of · 2 that var(Y ) = θ2 +var(Y |X). Let us remark that var(Y |X) does not depend on X. If θ varies, either the quantity var(Y ) or var(Y |X) has to vary. In the sequel, we suppose that var(Y ) is fixed. We briefly justify this choice in Section 4.2. Consequently, if θ 2 is increasing, then var(Y |X) has to decrease so that the sum remains constant. Let α be a number in ]0; 1[ and let δ be a number in ]0; 1 − α[ (typically small). For a given vector θ , matrix and var(Y ), we denote by Pθ the joint distribution of (Y, X). For the sake of simplicity, we do not emphasize the dependence of Pθ on var(Y ) or . Let ψα be a test of level α of the hypothesis “θ = 0” against the hypothesis “θ ∈ \ 0.” In our framework, it is natural to measure the performance of ψα using the quantity ρ(ψα , , δ, var(Y ), ) defined by

ρ(ψα , , δ, var(Y ), ) := inf ρ > 0, inf Pθ (ψα = 1), θ ∈ and

θ 2 ≥ ρ2 ≥ 1 − δ , var(Y ) − θ 2 where the quantity rs/n (θ ) :=

(2.5)

θ2 var(Y ) − θ2

appears naturally as it corresponds to the ratio θ2 / var(Y |X) which is the quantity of information brought by X (i.e., the signal) over the conditional variance of Y (i.e., the noise). We aim at describing the quantity (2.6)

inf ρ(ψα , , δ, var(Y ), ) := ρ(, α, δ, var(Y ), ), ψα

where the infimum is taken over all the level-α tests ψα . We call this quantity the (α, δ)-minimax rate of testing over . A dual notion of this ρ function is the function β . For any ⊂ Rp and α ∈ ]0, 1[, we denote by β () the quantity β () := inf sup Pθ [ψα = 0], ψα θ ∈

where the infimum is taken over all level-α tests ψα and where we recall that refers to the covariance matrix of X.

710

N. VERZELEN AND F. VILLERS

2.4. Notation. We recall the main notation that we shall use throughout the paper. In the sequel, n stands for the number of independent observations, and p is the number of covariates. Moreover, XV stands for the collection (Xi )i∈V of the covariates that correspond to the null hypothesis, and d is the cardinality of the set V . The models m are subsets of I ⊂ V , and we note Dm , their cardinality. Tα stands for our testing procedure of level α. The statistics φm and the test φm,α are respectively defined in (2.2) and (2.3). Finally, the norm · is introduced in (2.4). For x, y ∈ R, we set x ∧ y := inf{x, y},

x ∨ y := sup{x, y}.

For any u ∈ R, F¯D,N (u) denotes the probability for a Fisher variable with D and N degrees of freedom to be larger than u. In the sequel, L, L1 , L2 , . . . denote constants that may vary from line to line. The notation L(·) specifies the dependency on some quantities. For the sake of simplicity, we only give the orders of magnitude in the results and we refer to the proofs for explicit constants. 3. The testing procedure. 3.1. Description of the procedure. Let us first fix some level α ∈ ]0, 1[. Throughout this paper, we suppose that n ≥ d + 2. Let us consider a finite collection M of nonempty subsets of I \V such that for all m ∈ M, 1 ≤ Dm ≤ n−d −1. We introduce the following test of level α. We reject H0 : “θ ∈ SV ” when the statistic Tα := sup {φm (Y, X) − F¯ −1 (αm (X))} (3.1) Dm ,Nm

m∈M

is positive where the collection of weights {αm (X), m ∈ M} is chosen according to one of the two following procedures: P1 : The αm s do not depend on X and satisfy the equality

(3.2)

αm = α.

m∈M

P2 : For all m ∈ M, αm (X) = qX,α , the α-quantile of the distribution of the random variable V ∪m () − V ()2n /Dm ¯ inf FDm ,Nm (3.3) , m∈M − V ∪m ()2n /Nm conditionally to X. Note that it is easy to compute the quantity qX,α . Let Z be a standard Gaussian random vector of size n independent of X. As is independent of X, the distribution of (3.3) conditionally to X is the same as the distribution of

V ∪m (Z) − V (Z)2 /Dm inf F¯Dm ,Nm m∈M Z − V ∪m (Z)2 /Nm

711

TESTS FOR HIGH-DIMENSIONAL MODELS

conditionally to X. Hence, we can easily work out its quantile using Monte Carlo method. Clearly, the computational complexity of the procedure is linear with respect to the size of the collection of models M even when using procedure P2 . Consequently, when we apply our procedure to high-dimensional data as in Section 6 or in [23], we favor collections M whose size is linear with respect to the number of covariates p. 3.2. Comparison of procedures P1 and P2 . We respectively refer to Tα1 and Tα2 for the tests (3.1) associated with procedure P1 and P2 . First, we are able to control the behavior of the test under the null hypothesis. P ROPOSITION 3.1. therefore satisfies

The test Tα1 corresponds to a Bonferroni procedure and Pθ (Tα > 0) ≤

αm ≤ α,

m∈M

whereas the test Tα2 has the property to be exactly of the size α Pθ (Tα > 0) = α. The proof is given in the Appendix. Moreover, the test Tα2 is more powerful than the corresponding test Tα1 defined with weights αm = α/|M|. P ROPOSITION 3.2. For any parameter θ that does not belong to SV , the procedure Tα1 with weights αm = α/|M| and the procedure Tα2 satisfy (3.4)

Pθ Tα2 (X, Y) > 0|X ≥ Pθ Tα1 (X, Y) > 0|X

X a.s.

Again, the proof is given in the Appendix. On the one hand, the choice of procedure P1 allows one to avoid the computation of the quantile qX,α and possibly permits one to give a Bayesian flavor to the choice of the weights. On the other hand, procedure P2 is more powerful than the corresponding test with procedure P1 . We will illustrate these considerations in Section 6. In Sections 3.3, 4 and 5 we study the power and rates of testing of Tα with procedure P1 . 3.3. Power of the test. We aim at describing a set of vectors θ in Rp over which the test defined in Section 3 with procedure P1 is powerful. Since procedure P2 is more powerful than procedure P1 with αm = α/|M|, the test with procedure P2 will also be powerful on this set of θ . Let α and δ be two numbers in ]0, 1[, and let {αm , m ∈ M} be weights such that m∈M αm ≤ α. We define hypothesis (HM ) as follows: (HM )

for all m ∈ M,

αm ≥ exp(−Nm /10)

and

δ ≥ exp 2(−Nm /21).

For typical choices of the collections M and {αm , m ∈ M}, these conditions are fulfilled as discussed in Sections 4 and 5. Let us now turn to the main result.

712

N. VERZELEN AND F. VILLERS

T HEOREM 3.3. Let Tα be the test procedure defined by (3.1). We assume that n > d + 2 and that assumption (HM ) holds. Then, Pθ (Tα > 0) ≥ 1 − δ for all θ belonging to the set

FM (δ) := θ ∈ Rp , ∃m ∈ M:

where

(m) := L1 Dm log (3.5)

+ L2

var(Y |XV ) − var(Y |XV ∪m ) ≥ (m) , var(Y |XV ∪m ) 2

1+

αm δ

Dm Nm

2 Dm 1+2 log Nm αm δ

(n − d).

This result is similar to Theorem 1 in [3] in fixed design regression framework and the same comment also holds; the test Tα under procedure P1 has a power comparable to the best of the tests among the family {φm,α , m ∈ M}. Indeed, let us assume, for instance, that V = {0} and that the αm are chosen to be equal to α/|M|. The test Tα defined by (3.1) is equivalent to doing several tests of θ = 0 against θ ∈ Sm at level αm for m ∈ M and it rejects the null hypothesis if one of those tests does. From Theorem 3.3, we know that under hypothesis (HM ) this test has a power greater than 1 − δ over the set of vectors θ belonging to m∈M Fm (δ, αm ) where Fm (δ, αm ) is the set of vectors θ ∈ Rp such that (3.6)

var(Y ) − var(Y |Xm ) L(Dm , Nm ) ≥ var(Y |Xm ) n

Dm log

2 αm δ

+ log

2 αm δ

.

The quantity, L(Dm , Nm ) behaves like a constant if the ratio Dm /Nm is bounded. Let us compare this result with the set of θ over which the Fisher test φm,α at level α has a power greater than 1 − δ. Applying Theorem 3.3, we know that it contains Fm (δ, α). Moreover, the following proposition shows that it is not much larger than Fm (δ, α): P ROPOSITION 3.4.

Let δ ∈ ]0, 1 − α[. If

√ Dm var(Y ) − var(Y |Xm ) ≤ L(α, δ) , var(Y |Xm ) n

then Pθ (φm,α > 0) ≤ 1 − δ. The proof is postponed to Section 8 and is based on a lower bound of the minimax rate of testing. Fm (δ, α) and Fm (δ, αm ) defined in (3.6) differ from the fact that log(1/α) is replaced by log(1/αm ). For the main applications that we will study in Sections 4–6, the ratio log(1/αm )/ log(1/α) is of order log(n), log log n, or k log(ep/k) where

TESTS FOR HIGH-DIMENSIONAL MODELS

713

k is a “small” integer. Thus, for each δ ∈ ]0, 1 − α[, the test based on Tα has a power greater than 1 − δ over a class of vectors which is close to m∈M Fm (δ, α). It follows that for each θ = 0 the power of this test under Pθ is comparable to the best of the tests among the family {φm,α , m ∈ M}. In the next two sections, we use this theorem to establish rates of testing against different types of alternatives. First, we give an upper bound for the rate of testing θ = 0 against a class of θ for which a lot of components are equal to 0. In Section 5, we study the rates of testing and simultaneous rates of testing θ = 0 against classes of ellipsoids. For the sake of simplicity, we will only consider the case V = {0}. Nevertheless, the procedure Tα defined in (3.1) applies in the same way when one considers a more complex null hypothesis and the rates of testing are unchanged except that we have to replace n by n − d and var(Y ) by var(Y |XV ). 4. Detecting nonzero coordinates. Let us fix an integer k between 1 and p. In this section, we are interested in testing θ = 0 against the class of θ with a most k nonzero components. This typically corresponds to the situation encountered when considering tests of neighborhoods for large sparse graphs. As the graph is assumed to be sparse, only a small number of neighbors are missing under the alternative hypothesis. For each pair of integers (k, p) with k ≤ p, let M(k, p) be the class of all subsets of I = {1, . . . , p} of cardinality k. The set [k, p] stands for the subset of vectors θ ∈ Rp , such that at most k coordinates of θ are nonzero. First, we define a test Tα of the form (3.1) with procedure P1 , and we derive an upper bound for the rate of testing of Tα against the alternative θ ∈ [k, p]. Then we show that this procedure is rate optimal when all the covariates are independent. Finally, we study the optimality of the test when k = 1 for some examples of covariance matrix . 4.1. Rate of testing of Tα . P ROPOSITION 4.1. We consider the set of models M = M(k, p). We use the test Tα under procedure P1 , and we take the weights αm all equal to α/|M|. Let us suppose that n satisfies

(4.1)

n ≥ L log

2 ep + k log αδ k

.

Let us set the quantity (4.2)

2 ρk,p,n := L(α, δ)

For any θ in [k, p], such that

θ 2 var(Y )−θ 2

k log(ep/k) . n 2 ≥ ρk,p,n , Pθ (Tα > 0) ≥ 1 − δ.

714

N. VERZELEN AND F. VILLERS

We recall that the norm · is defined in (2.4) and equals var(Y ) − var(Y |X). This proposition easily follows from Theorem 3.3 and its proof is given in Section 7. Note that the upper bound does not directly depend on the covariance matrix of the vector X. Moreover, hypothesis (4.1) corresponds to the minimal assumption needed for consistency and type-oracle inequalities in the estimation setting as pointed out by Wainwright ([24], Theorem 2) and Giraud ([10], Section 3.1). Hence, we conjecture that hypothesis (4.1) is minimal so that Proposition 4.1 holds. We will further discuss the bound (4.2) after deriving lower bounds for the minimax rate of testing. 4.2. Minimax lower bounds for independent covariates. In the statistical framework considered here, the problem of giving minimax rates of testing under no prior knowledge of the covariance of X and of var(Y ) is open. This is why we shall only derive lower bounds when var(Y ) and the covariance matrix of X are known. In this section, we give nonasymptotic lower bounds for the (α, δ)minimax rate of testing over the set [k, p] when the covariance matrix of X is the identity matrix (except Proposition 4.2). As these bounds coincide with the upper bound obtained in Section 4.1, this will show that our test Tα is rate optimal. We first give a lower bound for the (α, δ)-minimax rate of detection of all p nonzero coordinates for any covariance matrix . P ROPOSITION 4.2. that

2 such Let us suppose that var(Y ) is known. Let us set ρp,n 2 := L(α, δ) ρp,n

(4.3)

√ p . n

Then for all ρ < ρp,n ,

θ2 = ρ2 var(Y ) − θ2 where we recall that is the covariance matrix of X. β

θ ∈ [p, p],

≥ δ,

If n ≥ (1 + γ )p for some γ > 0, Theorem 3.3 shows that the test φI ,α defined in (2.3) has power greater than δ over the vectors θ that satisfy √ p θ2 . ≥ L(γ , α, δ) 2 var(Y ) − θ n √ Hence, p/n is the minimax rate of testing [p, p] at least when the number of observations is larger than the number of covariates. This is coherent with the minimax rate obtained in the fixed design framework (e.g., [2]). When p becomes larger we do not think that the lower bound given in Proposition 4.2 is still sharp. Note that this minimax rate of testing holds for any covariance matrix contrary to Theorem 4.3. We now turn to the lower bound for the (α, δ)-minimax rate of testing against θ ∈ [k, p].

TESTS FOR HIGH-DIMENSIONAL MODELS 2 Let us set ρk,p,n such that

T HEOREM 4.3.

(4.4)

2 ρk,p,n

715

k p p := L(α, δ) log 1 + 2 + 2 2 . n k k

We suppose that the covariance of X is the identity matrix I . Then, for all ρ < ρk,p,n ,

βI

θ ∈ [k, p],

where the quantity var(Y ) is known. If α + δ ≤ 53%, then one has

θ 2 = ρ2 var(Y ) − θ2

k p 2 ρk,p,n log 1 + 2 ∨ ≥ 2n k

> δ,

p . k2

This result implies the following lower bound for the minimax rate of testing: 2 . ρ([k, p], α, δ, var(Y ), I ) ≥ ρk,p,n

The proof is given in Section 8. To the price of more technicalities, it is possible to prove that the lower bound still holds if the variables (Xi ) are independent with known variances possibly different. Theorem 4.3 recovers approximately the lower bounds for the minimax rates of testing in signal detection framework obtained by Baraud [2]. The main difference lies in the fact that we suppose var(Y ) known which in the signal detection framework translates in the fact that we would know the quantity f 2 + σ 2 . We are now in position to compare the results of Proposition 4.1 and Theorem 4.3. We distinguish between the values of k: • When k ≤ p γ for some γ < 1/2, if n is large enough to satisfy the assumption 2 2 of Proposition 4.1, the quantities ρk,p,n and ρk,p,n are both of the order k log(p) n times a constant (which depends on γ , α and δ). This shows that the lower bound given in Theorem 4.3 is sharp. Additionally, in this case, the procedure Tα defined in Proposition 4.1 follows approximately the minimax rate of testing. We recall that our procedure Tα does not depend on the knowledge of var(Y ) and corr(X). In applications, a small k typically corresponds to testing a Gaussian graphical model with respect to a graph G when the number of nodes is large and the graph is supposed to be sparse. When n does not satisfy the assumption of Proposition 4.1, we believe that our lower bound is not sharp anymore. √ • When p ≤ k ≤ p, the lower bound and the upper bound do not coincide anymore. Nevertheless, if n ≥ (1 + γ )p for some γ > 0, Theorem 3.3 shows that the test φI ,α defined in (2.3) has power greater than δ over the vectors θ that satisfy √ p θ 2 ≥ L(γ , α, δ) (4.5) . 2 var(Y ) − θ n

716

N. VERZELEN AND F. VILLERS

This upper bound and the lower bound do not depend on k. Here again, the lower bound obtained in Theorem 4.3 is sharp and the test φI ,α defined previously is √ √ rate optimal. The fact that the rate of testing stabilizes around p/n for k > p also appears in signal detection and there is a discussion of this phenomenon in [2]. √ √ • When k < p and k is close to p, the lower bound and the upper bound given by Proposition 4.1 differ from at most a log(p) factor. For instance, if k is of or√ √ der p/ log p, the lower bound in Theorem 4.3 is of order p log log p/ log p, √ and the upper bound is of order p. We do not know if any of this bound is sharp or if the minimax rates of testing coincide when var(Y ) is fixed and when it is not fixed. All in all, the minimax rates of testing exhibit the same range of rates in our framework as in signal detection [2] when the covariates are independent. Moreover, this implies that the minimax rate of testing is slower when the (Xi )i∈I are independent than for any other form of dependence. Indeed, the upper bounds obtained in Proposition 4.1 and in (4.5) do not depend on the covariance of X. Then a natural question arises: is the test statistic Tα rate optimal for other correlation of X? We will partially answer this question when testing against the alternative θ ∈ [1, p]. 4.3. Minimax rates for dependent covariates. In this section, we look for the minimax rate of testing θ = 0 against θ ∈ [1, p] when the covariates Xi are no longer independent. We know that this rate is between the orders n1 which is the minimax rate of testing when we know which coordinate is nonzero and log(p) n , the minimax rate of testing for independent covariates. P ROPOSITION 4.4. that for any i = j ,

Let us suppose that there exists a positive number c such | corr(Xi , Xj )| ≤ c

2 and that α + δ ≤ 53%. We define ρ1,p,n,c as

2 := ρ1,p,n,c

(4.6)

L 1 log(p) ∧ . n c

Then for any ρ < ρ1,p,n,c ,

β

θ2 θ ∈ [1, p], = ρ2 var(Y ) − θ2

where refers to the covariance matrix of X.

≥ δ,

TESTS FOR HIGH-DIMENSIONAL MODELS

717

R EMARK . If the correlation between the covariates is smaller than 1/ log(p), then the minimax rate of testing is of the same order as in the independent case. If the correlation between the covariates is larger, we show in the following proposition that under some additional assumption, the rate is faster. P ROPOSITION 4.5. Let us suppose that the correlation between Xi and Xj is exactly c > 0 for any i = j . Moreover, we assume that n satisfies the following condition: p (4.7) . n ≥ L 1 + log αδ Let introduce the random variable Xp+1 := δ < 60% the test Tα defined by Tα =

1 p √ Xi i=1 var(Xi ) . p

If α < 60% and

sup φ{i},α/(2p) ∨ φ{p+1},α/2

1≤i≤p

satisfies P0 (Tα > 0) ≤ α

and

Pθ (Tα > 0) ≥ 1 − δ

for any θ in [1, p] such that

θ2 L(α, δ) 1 ≥ log p ∧ . 2 var(Y ) − θ n c Consequently, when the correlation between Xi and Xj is a positive constant c, . When the correlation coeffithe minimax rate of testing is of order log(p)∧(1/c) n cient c is small, the minimax rate of testing coincides with the independent case and when c is larger those rates differ. Therefore, the test Tα defined in Proposition 4.1 is not rate optimal when the correlation is known and is large. Indeed, when the correlation between the covariates is large, the test statistics φ{m},αm defining Tα are highly correlated. The choice of the weights αm in procedure P1 corresponds to a Bonferroni procedure which is precisely known to behave badly when the tests are positively correlated. This example illustrates the limits of procedure P1 . However, it is not very realistic to suppose that the covariates have a constant correlation, for instance, when one considers a GGM. Indeed, we expect that the correlation between two covariates is large if they are neighbors in the graph and smaller if they are far (w.r.t. the graph distance). This is why we derive lower bounds of the rate of testing for other kinds of correlation matrices often used to model stationary processes. P ROPOSITION 4.6. Let X1 , . . . , Xp form a stationary process on the onedimensional torus. More precisely, the correlation between Xi and Xj is a function of |i − j |p where | · |p refers to the toroidal distance defined by |i − j |p := (|i − j |) ∧ (p − |i − j |).

718

N. VERZELEN AND F. VILLERS

1 (w) and 2 (t), respectively, refer to the correlation matrix of X such that corr(Xi , Xj ) = exp(−w|i − j |p )

where w > 0,

corr(Xi , Xj ) = (1 + |i − j |p )−t

where t > 0.

2 2 (w) and ρ1,p,n, (t) such that Let us set ρ1,p,n, 1 2

1 1 − e−w log 1 + L(α, δ)p , n 1 + e−w ⎧ 1 p(t − 1) ⎪ ⎪ log 1 + L(α, δ) , ⎪ ⎪ ⎪ n t + 1 ⎪ ⎪ ⎨1 p 2 (t) := ρ1,p,n, log 1 + L(α, δ) , 2 ⎪ n 1 + 2 log(p − 1) ⎪ ⎪ ⎪ ⎪

⎪ 1 ⎪ ⎩ log 1 + L(α, δ)p t 2−t (1 − t) , n

2 (w) := ρ1,p,n, 1

if t > 1, if t = 1, if 0 < t < 1.

2 Then for any ρ 2 < ρ1,p,n, (w), 1

β1 (w)

θ2 θ ∈ [1, p], = ρ2 var(Y ) − θ 2

≥ δ,

2 (t), and for any ρ 2 < ρ1,p,n, 2

β2 (t)

θ ∈ [1, p],

θ2 = ρ2 var(Y ) − θ2

≥ δ.

If the range ω is larger than 1/p γ or if the range t is larger than γ for some γ < 1, these lower bounds are of order logn p . As a consequence, for any of these correlation models the minimax rate of testing is of the same order as the minimax rate of testing for independent covariates. This means that our test Tα defined in Proposition 4.1 is rate-optimal for these correlations matrices. However, if ω is smaller than 1/p or if t is smaller than 1/ log(p), we recover the parametric rates 1/n which is achieved by the test φ{p+1},α . This comes from the fact that the correlation corr(X1 , Xi ) does not converge to zero for such choices of ω or t. We omit the details since the arguments are similar for the proof of Proposition 4.5. To conclude, when k ≤ pγ (for γ ≤ 1/2), the test Tα defined in Proposition 4.1 is approximately (α, δ)-minimax against the alternative θ ∈ [k, p] when neither var(Y ) nor the covariance matrix of X is fixed. Indeed, the rate of testing of Tα coincides (up to a constant) with the supremum of the minimax rates of testing on [k, p] over all possible covariance matrices : ρ([k, p], α, δ) :=

sup var(Y )>0,>0

ρ([k, p], α, δ, var(Y ), ),

719

TESTS FOR HIGH-DIMENSIONAL MODELS

where the supremum is taken over all positive var(Y ) and every positive definite √ matrix . When k ≥ p and when n ≥ (1 + γ )p (for γ > 0), the test defined in (4.5) has the same behavior. However, our procedure does not adapt to ; for some correlation matrices (as shown, for instance, in Proposition 4.5), Tα with procedure P1 is not rate optimal. Nevertheless, we believe and will illustrate in Section 6 that procedure P2 slightly improves the power of the test when the covariates are correlated. 5. Rates of testing on “ellipsoids” and adaptation. In this section, we define tests Tα of the form (3.1) in order to test simultaneously θ = 0 against θ belonging to some classes of ellipsoids. We will study their rates and show that they are optimal at sometimes the price of a log p factor. For any nonincreasing sequence (ai )1≤i≤p+1 such that a1 = 1 and ap+1 = 0 and any R > 0, we define the ellipsoid Ea (R) by

(5.1)

Ea (R) := θ ∈ R , p

p var(Y |Xmi−1 ) − var(Y |Xmi )

ai2

i=1

≤ R var(Y |X) , 2

where mi refers to the set {1, . . . , i} and m0 = ∅. Let us explain why we call this set an ellipsoid. Assume for instance that the (Xi ) are independent identically distributed with variance one. In this case, the difference var(Y |Xmi−1 ) − var(Y |Xmi ) equals |θi |2 , and the definition of Ea (R) translates in

Ea (R) = θ ∈ R , p

p |θi |2 i=1

ai2

≤ R var(Y |X) . 2

The main difference between this definition and the classical definition of an ellipsoid in the fixed design regression framework (as, for instance, in [2]) is the presence of the term var(Y |X). We added this quantity in order to be able to derive lower bounds of the minimax rate. If the Xi are not i.i.d. with unit variance, it is always possible to create a sequence Xi of i.i.d. standard Gaussian variables by orthogonalizing the Xi using the Gram–Schmidt process. If we call θ the vector in Rp such that Xθ = X θ , it is straightforward to show that var(Y |Xmi−1 ) − var(Y |Xmi ) = |θi |2 . We can then express Ea (R) using the coordinates of θ as previously;

Ea (R) = θ ∈ R

p

p |θ |2

i , 2 a i i=1

≤ R var(Y |X) . 2

The main advantage of Definition 5.1 is that it does not directly depend on the covariance of X. In the sequel we also consider the special case of ellipsoids with polynomial decay,

(5.2)

Es (R) :=

θ ∈R , p

p var(Y |Xmi−1 ) − var(Y |Xmi ) i=1

i −2s var(Y |X)

≤R

2

,

720

N. VERZELEN AND F. VILLERS

where s > 0 and R > 0. First, we define two test procedures of the form (3.1) and evaluate their power respectively on the ellipsoids Ea (R) and on the ellipsoids Es (R). Then we give some lower bounds for the (α, δ)-simultaneous minimax rates of testing. Extensions to more general lp balls with 0 < p < 2 are possible to the price of more technicalities by adapting the results of Section 4 in Baraud [2]. These alternatives correspond to the situation where we are given an order of relevance on the covariates that are not in the null hypothesis. This order could either be provided by a previous knowledge of the model or by a model selection algorithm such as LARS (least angle regression) introduced by Efron et al. [9]. We apply this last method to build a collection of models for our testing procedure (3.1) in [23]. 5.1. Simultaneous rates of testing of Tα over classes of ellipsoids. First, we define a procedure of the form (3.1) in order to test if θ = 0 against θ belongs to any of the ellipsoids Ea (R). For any x > 0, [x] denotes the integer part of x. We choose the class of models M and the weights αm as follows:

• If n < 2p, we take the set M to be 1≤k≤[n/2] mk , and all the weights αm are equal to α/|M|. • If n ≥ 2p, we take the set M to be 1≤k≤p mk . αmp equals α/2 and for any k between 1 and p − 1, and αmk is chosen to be α/(2(p − 1)). As previously, we bound the power of the tests Tα from a nonasymptotic point of view. P ROPOSITION 5.1.

Let us assume that

n ≥ L 1 + log

(5.3)

1 αδ

.

For any ellipsoid Ea (R), the test Tα defined by (3.1) with procedure P1 and with the class of models given just above satisfies P0 (Tα ≤ 0) ≥ 1 − α, and Pθ (Tα > 0) ≥ 1 − δ for all θ ∈ Ea (R) such that

√ i θ 2 2 2 ≥ L(α, δ) log n inf ai+1 R + 2 1≤i≤[n/2] var(Y ) − θ n

(5.4) if n < 2p, or (5.5)

θ2 ≥ L(α, δ) var(Y ) − θ2

if n ≥ 2p.

log p

inf

1≤i≤p−1

2 ai+1 R2

√ √ p i + ∧ n n

TESTS FOR HIGH-DIMENSIONAL MODELS √

721

All in all, for large values of n, the rate of testing is of order sup1≤i≤p [ai2 R 2 ∧ i log(p) ]. n

We show in the next subsection that the minimax rate of testing for an ellipsoid is of order √ i 2 2 . sup ai R ∧ n 1≤i≤p √ Moreover, we prove in Proposition 5.6 that a loss in log log p is unavoidable if one considers the simultaneous minimax rates of testing over a family of nested √ ellipsoids. Nevertheless, we do not know if the term log(p) is optimal for testing simultaneously against all the ellipsoids Ea (R) for all sequences (ai ) and all R > 0. When n is smaller than 2p, we obtain comparable results except that we are unable to consider alternatives in large dimensions in the infimum (5.5). We now turn to define a procedure of the form (3.1) in order to test simultaneously that θ = 0 against θ belongs to any of the Es (R). For this, we introduce the following collection of models M and weights αm :

• If n < 2p, we take the set M to be mk where k belongs to {2j , j ≥ 0} ∩ are chosen to be α/|M|. {1, . . . , [n/2]}, and all the weights αm • If n ≥ 2p, we take the set M to be mk where k belongs to ({2j , j ≥ 0} ∩ {1, . . . , p}) ∪ {p}, αmp equals α/2 and for any k in the model between 1 and p − 1, αmk is chosen to be α/(2(|M| − 1)). P ROPOSITION 5.2.

Let us assume that

1 n ≥ L 1 + log αδ

(5.6)

√ and that R 2 ≥ log log n/n. For any s > 0, the test procedure Tα defined by (3.1) with procedure P1 and with a class of models given just above satisfies P0 (Tα > 0) ≥ 1 − α, and Pθ (Tα > 0) ≥ 1 − δ for any θ ∈ Es (R) such that √ θ2 log log n 4s/(1+4s) 2/(1+4s) ≥ L(α, δ) R var(Y ) − θ 2 n (5.7) log log n 2 −2s + R (n/2) + n if n < 2p or

(5.8)

θ2 var(Y ) − θ 2

≥ L(α, δ) if n ≥ 2p.

√

R

2/(1+4s)

log log p n

4s/(1+4s)

√ p log log p + ∧ n n

722

N. VERZELEN AND F. VILLERS

Again, we retrieve similar results to those of Corollary 2 in [3] in the fixed design regression framework. For s > 1/4 and n < 2p, the rate of testing is of order √ log log n 4s/(1+4s) ) . We show in the next subsection that the logarithmic factor is ( n due to the adaptive property √ of the test. If s ≤ 1/4,√the rate is of order n−2s . When p n ≥ 2p, the rate is of order ( lognlog p )4s/(1+4s) ∧ ( n ), and we mention at the end of the next subsection that it is optimal. Here again, it is possible to define these tests with procedure P2 in order to improve the power of the test (see Section 6 for numerical results). 5.2. Minimax lower bounds. We first establish the (α, δ)-minimax rate of testing over an ellipsoid when the variance of Y and the covariance matrix of X are known. P ROPOSITION 5.3. ber R. We introduce

Let us set the sequence (ai )1≤i≤p+1 and the positive num-

(5.9)

2 2 (R) := sup [ρi,n ∧ ai2 R 2 ], ρa,n 1≤i≤p

2 is defined by (4.3); then for any nonsingular covariance matrix we where ρi,n have

β

θ ∈ Ea (R),

θ 2 2 ≥ ρa,n (R) var(Y ) − θ2

≥ δ,

where the quantity var(Y ) is fixed. If α + δ ≤ 47%, then √ i 2 2 2 ∧ ai R . ρa,n (R) ≥ sup 1≤i≤p n This lower bound is once more analogous to the one in the fixed design regression framework. Contrary to the lower bounds obtained in the previous section, it does not depend on the covariance of the covariates. We now look for an upper bound of the minimax rate of testing over a given ellipsoid. First, we need to define the quantity D ∗ as √ i ∗ 2 2 D := inf 1 ≤ i ≤ p, ai R ≤ n with the convention that inf ∅ = p. 1 )]. If R 2 > P ROPOSITION 5.4. Let us assume that n ≥ L log[1 + log( αδ ≤ n/2; the test φmD∗ ,α defined by (2.3) satisfies

D∗

P0 [φmD∗ ,α = 1] ≤ α

and

Pθ [φmD∗ ,α = 0] ≤ δ

1 n

and

723

TESTS FOR HIGH-DIMENSIONAL MODELS

for all θ ∈ Ea (R) such that

√ i θ 2 2 ≥ L(α, δ) sup R . ∧ a i var(Y ) − θ 2 1≤i≤p n

If n ≥ 2D ∗ , the rates of testing on an ellipsoid are analogous to the rates on an ellipsoid in fixed design regression framework (see, for instance, [2]). If D ∗ is large and n is small, the bounds in Propositions 5.3 and 5.4 do not coincide. In this case, we do not know if this comes from the fact that the test in Proposition 5.4 does not depend on the knowledge of var(Y ) or if one of the bounds in Propositions 5.3 and 5.4 is not sharp. We are now interested in computing lower bounds for rates of testing simultaneously over a family of ellipsoids in order to compare them with rates obtained in Section 5.1. First, we need a lower bound for the minimax simultaneous rate of testing over nested linear spaces. We recall that for any D ∈ {1, . . . , p}, SmD stands for the linear spaces of vectors θ such that only their D first coordinates are possibly nonzero. For D ≥ 2, let us set √ √ log log(D + 1) D 2 . ρ¯D,n := L(α, δ) n

P ROPOSITION 5.5. (5.10)

Then the following lower bound holds: βI

1≤D≤p

θ ∈ SmD ,

θ2 2 = rD var(Y ) − θ2

≥δ

if for all D between 1 and p, rD ≤ ρ¯D,n . Using this proposition, it is possible to get a lower bound for the simultaneous rate of testing over a family of nested ellipsoids. P ROPOSITION 5.6.

We fix a sequence (ai )1≤i≤p+1 . For each R > 0, let us set 2 2 2 ρ¯a,R,n := sup [ρ¯D,n ∧ (R 2 aD )],

(5.11)

1≤D≤p

where ρ¯D,n is given by (5.10). Then, for any non-singular covariance matrix of the vector X, β

R>0

θ 2 2 θ ∈ Ea (R), ≤ ρ¯a,R,n var(Y ) − θ2

≥ δ.

724

N. VERZELEN AND F. VILLERS

This proposition shows that the problem of adaptation is impossible in this setting: it is impossible to define a test which is simultaneously minimax over a class of nested ellipsoids (for R > 0). This is also the case in fixed design √ as proved by [22] for the case of Besov bodies. The loss of a term of the order log log p/n is unavoidable. As a special case of Proposition 5.6, it is possible to compute a lower bound for the simultaneous minimax rate over Es (R) where R describes the positive numbers. After some calculation, we find a lower bound of order √ √ log log p 4s/(1+4s) p log log p . ∧ n n This shows √ that the power of the test Tα obtained in (5.8) for n ≥ 2p is optimal 2 when R ≥ log log n/n. However, when n < 2p and s ≤ 1/4, we do not know if the rate n−2s is optimal or not. To conclude, when n ≥ 2p the test Tα defined in Proposition 5.2 achieves the simultaneous minimax rate over the classes of ellipsoids Es (R). On the other hand, the test Tα defined in Proposition 5.1 is not √rate optimal simultaneously over all the ellipsoids Ea (R) and suffers a loss of a log p factor even when n ≥ 2p. 6. Simulations studies. The purpose of this simulation study is threefold. First, we illustrate the theoretical results established in previous sections. Second, we show that our procedure is easy to implement for different choices of collections M and is computationally feasible even when p is large. Our third purpose is to compare the efficiency of procedures P1 and P2 . Indeed, for a given collection M, we know from Section 3.2 that the test (3.1) based on procedure P2 is more powerful than the corresponding test based on P1 . However, the computation of the quantity qX,α is possibly time consuming and we therefore want to know if the benefit in power is worth the computational burden. To our knowledge, when the number of covariates p is larger than the number of observations n there is no test with which we can compare our procedure. 6.1. Simulation experiments. We consider the regression model (1.1) with I = {1, . . . , p} and test the null hypothesis “θ = 0” which is equivalent to “Y is independent of X” at level α = 5%. Let (Xi )1≤i≤p be a collection of p Gaussian variables with unit variance. The random variable is defined as folp lows: Y = i=1 θi Xi + ε where ε is a zero mean Gaussian variable with variance 1 − θ2 independent of X. We consider two simulation experiments described below: 1. First simulation experiment: The correlation between Xi and Xj is a constant c for any i = j . Moreover, in this experiment the parameter θ is chosen such that only one of its components is possibly nonzero. This corresponds to the situation considered in Section 4. First, the number of covariates p is fixed

725

TESTS FOR HIGH-DIMENSIONAL MODELS

equal to 30, and the number of observations n is taken equal to 10 and 15. We choose for c three different values 0, 0.1 and 0.8, allowing thus the comparison of the procedures for independent, weakly and highly correlated covariates. We estimate the size of the test by taking θ1 = 0 and the power by taking for θ1 the values 0.8 and 0.9. Theses choices of θ lead to a small and a large signal/noise ratio rs/n defined in (2.5) and equal in this experiment to θ12 /(1 − θ12 ). Second, we examine the behavior of the tests when p increases and when the covariates are highly correlated: p equals 100 and 500, n equals 10 and 15, θ1 is set to 0 and 0.8, and c is chosen to be 0.8. 2. Second simulation experiment: The covariates (Xi )1≤i≤p are independent. The number of covariates p equals 500 and the number of observations n equals 50 and 100. We set for any i ∈ {1, . . . , p}, θi = Ri −s . We estimate the size of the test by taking R = 0 and the power by taking for (R, s) the value (0.2, 0.5) which corresponds to a slow decrease of the (θi )1≤i≤p . It was pointed out in the beginning of Section 5 that |θi |2 equals var(Y |Xmi−1 ) − var(Y |Xmi ). Thus |θi |2 represents the benefit in term of conditional variance brought by the variable Xi . We use our testing procedure defined in (3.1) with different collections M and different choices for the weights {αm , m ∈ M}. The collections M: we define three classes. Let us set Jn,p = p ∧ [ n2 ] where [x] denotes the integer part of x, and let us define

M1 := {i}, 1 ≤ i ≤ p ,

M2 := mk = {1, 2, . . . , k}, 1 ≤ k ≤ Jn,p ,

M3 := mk = {1, 2, . . . , k}, k ∈ {2j , j ≥ 0} ∩ {1, . . . , Jn,p } .

We evaluate the performance of our testing procedure with M = M1 in the first simulation experiment and M = M2 and M3 in the second simulation experiment. The cardinality of these three collections is smaller than p, and the computational complexity of the testing procedures is at most linear in p. The collections {αm , m ∈ M}: We consider procedures P1 and P2 defined in Section 3. When we are using the procedure P1 , the αm s equal α/|M| where |M| denotes the cardinality of the collection M. The quantity qX,α that occurs in the procedure P2 is computed by simulation. We use 1000 simulations for the estimation of qX,α . In the sequel we note TMi ,Pj , the test (3.1), with collection Mi and procedure Pj . In the first experiment, when p is large, we also consider two other tests: 1. The first test is φ{1},α (2.3) of the hypothesis θ1 = 0 against the alternative θ1 = 0. This test corresponds to the single test when we know which coordinate is nonzero.

726

N. VERZELEN AND F. VILLERS

p

2. The second test is φ{p+1},α where Xp+1 := p1 i=1 Xi . Adapting the proof of Proposition 4.5, we know that this test is approximately minimax on [1, p] if the correlation between the covariates is constant and large. Contrary to our procedures, these two tests are based on the knowledge of var(X) (and eventually θ ). We only use them as a benchmark to evaluate the performance of our procedure. We aim at showing that our test with procedure P2 is as powerful as φ{p+1},α and is close to the test φ{1},α . We estimate the size and the power of the testing procedures with 1000 simulations. For each simulation, we simulate the gaussian vector (X1 , . . . , Xp ) and then simulate the variable Y as described in the two simulation experiments. 6.2. Results of the simulation. The results of the first simulation experiment for c = 0 are given in Table 1. As expected, the power of the tests increases with the number of observations n and with the signal/noise ratio rs/n . If the signal/noise ratio is large enough, we obtain powerful tests even if the number of covariates p is larger than the number of observations. In Table 2 we present results of the first simulation experiment for θ1 = 0.8 when c varies. Let us first compare the results for independent, weakly and highly correlated covariates when using procedure P1 . The size and the power of the test for weakly correlated covariates are similar to the size and the power obtained in the independent case. Hence, we recover the remark following Proposition 4.4: when the correlation coefficient between the covariates is small, the minimax rate is of the same order as in the independent case. The test for highly correlated covariates is more powerful than the test for independent covariates, recovering thus the remark TABLE 1 First simulation study, independent case: p = 30, c = 0. Percentages of rejection and value of the signal/noise ratio rs/n n

TM1 ,P1

TM1 ,P2

Null hypothesis is true, θ1 = 0 10 15

0.043 0.044

0.045 0.049

Null hypothesis is false, θ1 = 0.8, rs/n = 1.78 10 15

0.48 0.81

0.48 0.81

Null hypothesis is false, θ1 = 0.9, rs/n = 4.26 10 15

0.86 0.99

0.86 0.99

727

TESTS FOR HIGH-DIMENSIONAL MODELS TABLE 2 First simulation study, independent and dependent case: p = 30, c = 0, 0.1, 0.8. Frequencies of rejection c=0

c = 0.1

n

TM1 ,P1

TM1 ,P2

10 15

0.043 0.044

0.045 0.049

10 15

0.48 0.81

0.48 0.81

TM1 ,P1

c = 0.8

TM1 ,P2

TM1 ,P1

TM1 ,P2

0.018 0.019

0.045 0.052

0.64 0.89

0.77 0.94

Null hypothesis is true, θ1 = 0 0.042 0.058

0.04 0.06

Null hypothesis is false, θ1 = 0.8 0.49 0.81

0.49 0.82

following Theorem 4.3: the worst case from a minimax rate perspective is the case where the covariates are independent. Let us now compare procedures P1 and P2 . In the case of independent or weakly correlated covariates, they give similar results. For highly correlated covariates, the power of TM1 ,P2 is much larger than the one of TM1 ,P1 . In Table 3 we present results of the multiple testing procedures and of the two tests, φ{1},α and φ{p+1},α , when c = 0.8 and the number of covariates p is large. For p = 500 and n = 15, one test takes less than one second with procedure P1 and less than 30 s with procedure P2 . As expected, procedure P1 is too conservative when p increases. For p = 100, the power of the test based on procedure P1 is smaller than the power of the test φ{p+1},α , and this difference increases when p is larger. The test based on procedure P2 is as powerful as φ{p+1},α , and its power is close to the one of φ{1},α . We recall that this last test is based on the knowledge of the nonzero component of θ contrary to ours. Moreover, the test φ{p+1},α was shown in Proposition 4.5 to be optimal for this particular correlation TABLE 3 First simulation study, dependent case: c = 0.8. Frequencies of rejection p = 100 n

TM1 ,P1

TM1 ,P2

p = 500

φ{1},α

φ{p+1},α

TM1 ,P1

TM1 ,P2

φ{1},α

φ{p+1},α

0.044 0.040

0.040 0.042

0.040 0.034

0.76 0.94

0.91 0.99

0.77 0.94

Null hypothesis is true, θ1 = 0 10 15

0.01 0.016

0.056 0.053

0.051 0.047

10 15

0.60 0.85

0.77 0.92

0.91 0.99

0.045 0.053

0.009 0.011

Null hypothesis is false, θ1 = 0.8 0.79 0.92

0.52 0.77

728

N. VERZELEN AND F. VILLERS TABLE 4 Second simulation study. Frequencies of rejection

n

TM2 ,P2

TM2 ,P1

TM3 ,P1

TM3 ,P2

0.036 0.042

0.059 0.059

Null hypothesis is true, R = 0 50 100

0.013 0.009

0.052 0.059

Null hypothesis is false, R = 0.2, s = 0.5 50 100

0.17 0.42

0.33 0.66

0.31 0.62

0.38 0.69

setting. Hence, procedure P2 seems to achieve the optimal rate in this situation. Thus, we advise to use in practice procedure P2 if the number of covariates p is large because procedure P1 becomes too conservative, especially if the covariates are correlated. The results of the second simulation experiment are given in Table 4. As expected, procedure P2 improves the power of the test and the test TM3 ,P2 has the greatest power. In this setting, one should prefer the collection M3 to M2 . This was previously pointed out in Section 5 from a theoretical point of view. Although TM3 ,P1 is conservative, it is a good compromise for practical issues: it is very easy and fast to implement, and its performances are good. 7. Proofs of Theorem 3.3, Propositions 4.1, 4.5, 5.1, 5.2 and 5.4. P ROOF OF T HEOREM 3.3. In a nutshell, we shall prove that conditionally to the design X the distribution of the test Tα is the same as the test introduced by Baraud et al. [3]. Hence, we may apply their non asymptotic upper bound for the power. Distribution of φm (Y, X). First, we derive the distribution of the test statistic φm (Y, X) under Pθ . The distribution of Y conditionally to the set of variables (XV ∪m ) is of the form (7.1)

Y=

i∈V ∪m

θiV ∪m Xi + V ∪m ,

where the vector θ V ∪m is constant and V ∪m is a zero mean Gaussian variable independent of XV ∪m whose variance is var(Y |XV ∪m ). As a consequence, Y − V ∪m Y2n is exactly (V ∪m)⊥ V ∪m 2n where (V ∪m)⊥ denotes the orthogonal projection along the space generated by (Xi )i∈V ∪m . Using the same decomposition of Y one simplifies the numerator of φm (Y, X): V ∪m Y − V Y2n

2 V ∪m V ∪m , = θi (Xi − V Xi ) + V ⊥ ∩(V ∪m) i∈V ∪m

n

729

TESTS FOR HIGH-DIMENSIONAL MODELS

where V ⊥ ∩(V ∪m) is the orthogonal projection onto the intersection between the space generated by (Xi )i∈V ∪m and the orthogonal of the space generated by (Xi )i∈V . For any i ∈ m, let us consider the conditional distribution of Xi with respect to XV , Xi =

(7.2)

V ,i

j ∈V

θjV ,i

θj Xj + iV ,

iV

are constants and is a zero-mean normal Gaussian random variable where whose variance is var(Xi |XV ) and which is independent of XV . This enables us to express Xi − V Xi = V ⊥ ∩(V ∪m) Vi Therefore, we decompose φm (Y, X) in (7.3)

for all i ∈ m.

Nm V ⊥ ∩(V ∪m) ( i∈m θiV ∪m Vi + V ∪m )2n . φm (Y, X) = Dm (V ∪m)⊥ V ∪m 2n

(1) (2) (1) and Zm where Zm refers to the numerator Let us define the random variable Zm (2) of (7.3) divided by Nm and Zm to the denominator divided by Dm . We now prove (1) (2) that Zm and Zm are independent. The variables ( Vj )j ∈m are σ (XV ∪m )-measurable as linear combinations of elements in XV ∪m . Moreover, V ∪m follows a zero mean normal distribution with covariance matrix var(Y |XV ∪m )In and is independent of XV ∪m . As a consequence, (1) (2) conditionally to XV ∪m , Zm and Zm are independent by Cochran’s theorem as they correspond to projections onto two sets orthogonal from each other. (1) As Vj is a linear combination of the columns of XV ∪m , Zm follows a noncentral χ 2 distribution conditionally to XV ∪m :

V ∪m V 2 (1)

j ∈m θj (V ∪m)∩V ⊥ j n 2 , Dm . Zm |XV ∪m ∼ var(Y |XV ∪m )χ

var(Y |XV ∪m )

V ∪m V 2 j ∈m θj (V ∪m)∩V ⊥ j n

2 (X We denote by am this noncentrality parameV ∪m ) := var(Y |XV ∪m ) ter. Power of Tα conditionally to XV ∪m . Conditionally to XV ∪m our test statistic φm (Y, X) is the same as that proposed by Baraud et al. [3] with n − d data and σ 2 = var(Y |XV ∪m ). Arguing as in their proof of Theorem 1, there exists some ¯ m (δ) such that the procedure accepts the hypothesis with probability quantity 2 (X ¯ m (δ): not larger than δ/2 if am V ∪m ) >

2 (U ) D log ¯ m (δ) := 2.5 1 + Km m

(7.4)

+ 2.5[km Km (U ) ∨ 5] log

4 αm δ 4 αm δ

1+

Dm Nm

1+

2Dm , Nm

730

N. VERZELEN AND F. VILLERS

where Um := log(1/αm ), U := log(2/δ), km := 2 exp(4Um /Nm ) and

Km (u) := 1 + 2

u u + 2km . Nm Nm

Consequently, we have (7.5)

2 ¯ m (δ)} ≤ δ/2. (XV ∪m ) ≥ Pθ (Tα ≤ 0|XV ∪m )1{am

Let us derive the distribution of the noncentral parameter am (XV ∪m ). First, we simplify the projection term as Vj is a linear combination of elements of XV ∪m : (V ∪m)∩V ⊥ Vj = V ∪m Vj − V Vj = V ⊥ Vj . 2 as Let us define κm

2 κm

:=

var(

V ∪m V ) j ∈m θj j

var(Y |XV ∪m )

.

As the variable j ∈m θjV ∪m Vj is independent of XV , and as almost surely the dimension of the vector space generated by XV is d, we get

V ∪m V 2 j ∈m θj V ⊥ j n

var(Y |XV ∪m )

2 2 ∼ κm χ (n − d).

Hence, applying for instance Lemma 1 in [16], we get Pθ

2 am (XV ∪m ) 2 κm

≥ (n − d) − 2 (n − d)U ≤ δ/2.

Let us gather (7.5) with this last bound. If 2 κm ≥ m (δ) :=

(7.6)

¯ m (δ) √ , (n − d)(1 − 2 U/(n − d))

then it holds that Pθ (Tα ≤ 0)

2 2 ¯ m (δ) + Pθ [am ¯ m (δ)] (XV ∪m ) > (XV ∪m ) ≤ ≤ Pθ Tα ≤ 0, am 2 ¯ m (δ)|XV ∪m ]} ≤ Eθ {Pθ [Tα ≤ 0, am (XV ∪m ) >

+ Pθ

2 am (XV ∪m ) 2 κm

≥ (n − d) − 2 (n − d)U

≤ δ. 2 . Let us now compute the quantity κ 2 in order to simplify Computation of κm m condition (7.6). Let us first express var(Y |XV ) in terms of var(Y |Xm∪V ) using the

731

TESTS FOR HIGH-DIMENSIONAL MODELS

decomposition (7.1) of Y . var(Y |XV ) = var

j ∈V ∪m

= var

(7.7)

j ∈V ∪m

= var

j ∈V ∪m

θjV ∪m Xj + V ∪m XV θjV ∪m Xj

XV + var( V ∪m |XV )

θjV ∪m Xj |XV + var(Y |XV ∪m )

as V ∪m is independent of XV ∪m . Now using the definition of jV in (7.2), it turns out that var

j ∈V ∪m

θjV ∪m Xj

XV

= var

j ∈m

= var

(7.8)

j ∈m

= var

j ∈m

as the

(jV )j ∈m

θjV ∪m Xj

XV

θjV ∪m jV XV θjV ∪m jV

are independent of XV . Gathering formulae (7.7) and (7.8), we get

var(Y |XV ) − var(Y |XV ∪m ) . var(Y |XV ∪m ) Under assumption (HM ), Um ≤ Nm /10 for all m ∈ M and U ≤ Nm /21. Hence, the terms U/Nm , Um /Nm , km and Km (U ) behave like constants and it follows from (7.6) that (m) ≤ (m) which completes the proof.

(7.9)

2 = κm

P ROOF OF P ROPOSITION 4.1. We first recall the classical upper bound for the binomial coefficient (see, for instance, (2.9) in [18]), ep k log |M(k, p)| = log ≤ k log . p k As a consequence, log(1/αm ) ≤ log(1/α) + k log( ep k ). Assumption (4.1) with L = 21 therefore implies hypothesis (HM ). Hence, we are in position to apply the second result of Theorem 3.3. Moreover, the assumption on n implies that n ≥ 21k and Dm /Nm is thus smaller than 1/20 for any model m in M(k, p). Formula (3.5) in Theorem 3.3 then translates into √

ep 2 2 + k log (m) ≤ 1 + 0.05 L1 k log k αδ

+ 1.1L2

ep 2 k log + log k αδ

n,

732

N. VERZELEN AND F. VILLERS

and it follows that Proposition 4.1 holds. P ROOF OF P ROPOSITION 4.5. We fix the constant L in hypothesis (4.7) to be 21 log(4e) ∨ C2 log(4) where the universal constant C2 is defined later in the proof. This choice of constants allows the procedure [sup1≤i≤p φ{i},α/(2p) ] to satisfy hypothesis (HM ). An argument similar to the proof of Proposition 4.1 allows to show easily that there exists a universal constant C such that if we set ρ12

(7.10)

4p C(log(p) + log(4/(αδ))) C = log := , n n αδ

≥ ρ12 implies that Pθ (Tα > 0) ≥ 1 − δ. Here, the factor 4 in the then var(Yθ )−θ2 logarithm comes from the fact that some weights αm equal α/(2p). 2 Let ρ 2 and λ2 be two positive numbers such that var(Yλ )−λ2 = ρ 2 and let θ ∈ 2

[1, p] such that θ2 = λ2 . As corr(Xi , Xj ) = c for any i = j , it follows that 1−c 2 2 2 var(Xp+1 ) = c + 1−c p and cov(Y, Xp+1 ) = θ [c + p ] : var(Y ) − var(Y |Xp+1 ) (c + (1 − c)/p)λ2 = . var(Y |Xp+1 ) var(Y ) − (c + (1 − c)/p)λ2

We now apply Theorem 3.3 to φ{p+1},α/2 under hypothesis (HM ). There exists a universal constant C2 such that Pθ (φ{p+1},α/2 > 0) ≥ 1 − δ if

4 (c + (1 − c)/p)λ2 C2 log ≥ . 2 var(Y ) − (c + (1 − c)/p)λ n αδ This last condition is implied by

4 cλ2 C2 log ≥ 2 var(Y ) − cλ n αδ

which is equivalent to

(7.11)

λ2 C2 4 . ≥ log var(Y ) cn + cC2 log(4/(αδ)) αδ

4p 4 )/ log( 4p Let us assume that c ≥ log( αδ αδ ). As n ≥ 2C2 log( αδ ) (hypothesis (4.7) 4 and definition of L), nc ≥ 2C2 log( αδ ). As a consequence, condition (7.11) is implied by

2C2 4 ρ ≥ . log nc αδ 2

(7.12)

Combining (7.10) and (7.12) allows us to conclude that Pθ (Tα > 0) ≥ 1 − δ if

ρ2 ≥

4p 4 L 1 log ∧ log n αδ c αδ

.

TESTS FOR HIGH-DIMENSIONAL MODELS

733

P ROOF OF P ROPOSITION 5.1. We fix the constant L to 42 log(80) in hypothesis (5.3). It follows that (5.3) implies

(7.13)

n ≥ 42 log

40 2 ∨ log α δ

.

First, we check that the test Tα satisfies condition (HM ). As the dimension of each model is smaller than n/2, for any model m in M, Nm is larger than n/2. Moreover, for any model m in M, αm is larger than α/(2|M|) and |M| is smaller than n/2. As a consequence, the first condition of (HM ) is implied by the inequality

n ≥ 20 log

(7.14)

n . α

Hypothesis (7.13) implies that n/2 ≥ 20 log( 40 α ). Moreover, for any n > 0 it holds n that n/2 ≥ 20 log( 40 ). Combining these two lower bounds enables to obtain (7.14). The second condition of (HM ) holds if n ≥ 42 log( 2δ ) which is a consequence of hypothesis (7.13). We first consider the case n < 2p and apply Theorem 3.3 under hypothesis (HM ) to Tα . Pθ (Tα > 0) ≥ 1 − δ for all θ ∈ Rp such that ∃i ∈ {1, . . . , [n/2]}, var(Y ) − var(Y |Xmi ) var(Y |Xmi ) (7.15) √ i log(2[n/2]/(αδ)) + log(2[n/2]/(αδ)) , ≥C n where C is an universal constant. Let θ be an element of Ea (R) that satisfies

θ2 ≥ (1 + C) var(Y |Xmi ) − var(Y |X) √ i log(n/(αδ)) + log(n/(αδ)) + (1 + C) var(Y |X) n for some 1 ≤ i ≤ [n/2]. By hypothesis (5.3), it holds that √ i log(n/(αδ)) + log(n/(αδ)) ≤1 n for any i between 1 and [n/2]. It is then straightforward to check that θ satisfies (7.15). As θ belongs to the set Ea (R), var(Y |Xmi ) − var(Y |X) 2 var(Y |X) = ai+1

p var(Y |Xmj −1 ) − var(Y |Xmj ) j =i+1

2 ≤ ai+1 var(Y |X)R 2 .

2 var(Y |X) ai+1

734

N. VERZELEN AND F. VILLERS

Hence, if θ belongs to Ea (R) and satisfies

θ ≥ (1 + C) var(Y |X) 2

2 ai+1 R2

√ i log(n/(αδ)) n 1 + + log , n n αδ

then Pθ (Tα ≤ 0) ≤ δ. Gathering this condition for any i between 1 and [n/2] allows us to conclude that if θ satisfies θ2 var(Y ) − θ 2

≥ (1 + C)

√

2 ai+1 R2 +

inf

1≤i≤[n/2]

i log(n/(αδ)) n 1 + log n n αδ

,

then Pθ (Tα ≤ 0) ≤ δ. Let us now turn to the case n ≥ 2p. Let us consider Tα as the supremum of p − 1 tests of level α/2(p − 1) and one test of level α/2. By considering the p − 1 first tests, we obtain as in the previous case that Pθ (Tα ≤ 0) ≤ δ if θ2 var(Y ) − θ 2 ≥ (1 + C)

inf

1≤i≤(p−1)

2 ai+1 R2

√ i log(p/(αδ)) p 1 + + log . n n αδ

On the other hand, using the last test statistic φI ,α/2 , Pθ (Tα ≤ 0) ≤ δ, if √ θ 2 p log(2/(αδ)) + log(2/(αδ)) . ≥C 2 var(Y ) − θ n Gathering these two conditions we prove (5.5). P ROOF OF P ROPOSITION 5.2. The approach behind this proof is similar to the one for Proposition 5.1. We fix the constant L in assumption (5.6), as in the previous proof. Hence, the collection of models M and the weights αm satisfy hypothesis (HM ) as in the previous proof. Let us give a sharper upper bound on |M|: (7.16)

|M| ≤ 1 + log(n/2 ∧ p)/ log(2) ≤ log(n ∧ 2p)/ log(2).

We deduce from (7.16) that there exists a constant L(α, δ) only depending on α and δ such that for all m ∈ M,

log

1 αm δ

≤ L(α, δ) log log(n ∧ p).

First, let us consider the case n < 2p. We apply Theorem 3.3 under assumption (HM ). As in the proof of Proposition 5.1, we obtain that Pθ (Tα > 0) ≥ 1 − δ

735

TESTS FOR HIGH-DIMENSIONAL MODELS

if θ2 var(Y ) − θ2 ≥ L(α, δ) It is worth

R (i + 1) 2

inf

i∈{2j ,j ≥0}∩{1,...,[n/2]} √ noting that R 2 i −2s ≤ i logn log n

√ +

i log log n log log n + . n n

if and only if

R2n i≥i = √ log log n ∗

−2s

2/(1+4s)

.

Under the assumption on R, i ∗ is larger than one. Let us distinguish between two cases. If there exists i in {2j , j ≥ 0} ∩ {1, . . . , [n/2]} such that i ∗ ≤ i , one can take i ≤ 2i ∗ and then √ i log log n 2 −2s inf R i + n i∈{2j ,j ≥0}∩{1,...,[n/2]}

(7.17)

≤2

i log log n n

√ √ 2/(1+4s) log log n 4s/(1+4s) ≤ 2 2R . n Else, we take i ∈ {2j , j ≥ 0} ∩ {1, . . . , [n/2]} such that n/4 ≤ i ≤ n/2. Since i ≤ (i ∗ ∧ n/2) we obtain that √ i log log n 2 −2s R i + inf n i∈{2j ,j ≥0}∩{1,...,[n/2]} (7.18) −2s n . ≤ 2R 2 i −2s ≤ 2R 2 2 Gathering inequalities (7.17) and (7.18) we prove (5.7). We now turn to the case n ≥ 2p. As in the proof of Proposition 5.1, we divide the proof into two parts: first we give an upper bound of the power for the |M| − 1 first tests which define Tα , and then we give an upper bound for the last test φI ,α/2 . Combining these two inequalities allows us to prove (5.8). P ROOF OF P ROPOSITION 5.4. We fix the constant L in the assumption as in the two previous proofs. We first note that the assumption on R 2 implies that D ∗ ≥ 2. As Nm is larger than n/2, the φmD∗ test clearly satisfies condition (HM ). As a consequence, we may apply Theorem 3.3. Hence, Pθ (Tα∗ ≤ 0) ≤ δ for any θ such that √ var(Y ) − var(Y |XmD∗ ) D∗ (7.19) ≥ L(α, δ) . var(Y |XmD∗ ) n

736

N. VERZELEN AND F. VILLERS

Now, we use the same sketch as in the proof of Proposition 5.1. For any θ ∈ Ea (R), condition (7.19) is equivalent to √

D∗ 2 θ ≥ var(Y |XmD∗ ) − var(Y |X) 1 + L(α, δ) n (7.20) √ D∗ . + var(Y |X)L(α, δ) n Moreover, as θ belongs to Ea (R), 2 2 2 2 var(Y |XmD∗ ) − var(Y |X) ≤ aD ∗ +1 R var(Y |X) ≤ aD ∗ var(Y |X)R .

As

√ D ∗ /n is smaller than one, condition (7.20) is implied by √

2 D∗ θ2 2 ≥ 1 + L(α, δ) a . ∗R + D var(Y ) − θ 2 n √

√

∗

i 2 R 2 is smaller than D which is smaller sup 2 2 As aD ∗ 1≤i≤p [ n ∧ ai R ], it turns out n ∗ that Pθ (Tα = 0) ≤ δ for any θ belonging to Ea (R) such that √

θ 2 i 2 2 ≥ 2 1 + L(α, δ) sup R . ∧ a i var(Y ) − θ 2 1≤i≤p n

8. Proofs of Theorem 4.3, Propositions 3.4, 4.2, 4.4, 4.6, 5.3, 5.5 and 5.6. Throughout this section, we shall use the notations η := 2(1 − α − δ) and L(η) := log(1+2η2 ) . 2 P ROOF OF T HEOREM 4.3. This proof follows the general method for obtaining lower bounds described in Section 7.1 in Baraud [2]. We first remind the reader of the main arguments of the approach applied to our model. Let ρ be some positive number and μρ be some probability measure on

[k, p, ρ] := θ ∈ [k, p], We define Pμρ = “θ = 0.” Then

!

θ2 =ρ . var(Y ) − θ 2

Pθ dμρ (θ ) and α the set of level-α tests of hypothesis

βI ([k, p, ρ]) ≥ inf Pμρ [φα = 0] φα ∈α

(8.1)

≥1−α−

sup A,P0 (A)≤α

|Pμρ (A) − P0 (A)|

≥ 1 − α − 12 Pμρ − P0 TV , where Pμρ − P0 TV denotes the total variation norm between the probabilities Pμρ and P0 . If we suppose that Pμρ is absolutely continuous with respect to P0 ,

TESTS FOR HIGH-DIMENSIONAL MODELS

737

we can upper bound the norm in total variation between these two probabilities as follows. We define dPμρ Lμρ (Y, X) := (Y, X). dP0 Then we get the upper bound Pμρ − P0 TV =

"

|Lμρ (Y, X) − 1| dP0 (Y, X)

1/2

≤ E0 [L2μρ (Y, X)] − 1

.

Thus, we deduce from (8.1) that βI ([k, p, ρ]) ≥ 1 − α −

1/2 2 1 . 2 E0 [Lμρ (Y, X)] − 1

If we find a number ρ ∗ = ρ ∗ (η) such that

log E0 [L2μρ ∗ (Y, X)] ≤ L(η),

(8.2) then for any ρ ≤ ρ ∗ ,

η = δ. 2 To apply this method, we first have to define a suitable prior μρ on [k, p, ρ]. # be some random variable uniformly distributed over M(k, p) and for each Let m m ∈ M(k, p), let m = (jm )j ∈m be a sequence of independent Rademacher ran# are independent. dom variables. We assume that for all m ∈ M(k, p), m and m # m θ = j ∈m Let ρ be given and μρ be the distribution of the random variable # # λj ej where βI ([k, p, ρ]) ≥ 1 − α −

λ2 :=

var(Y )ρ 2 k(1 + ρ 2 )

and where (ej )j ∈I is the orthonormal family of vectors of Rp defined by (ej )i = 1

if i = j

and

(ei )j = 0

otherwise.

Straightforwardly, μρ is supported by [k, p, ρ]. For any m in M(k, p) and any vector (ζjm )j ∈m with values in {−1; 1}, let μm,ζ m ,ρ be the Dirac measure on m For any m in M(k, p), μm,ρ denotes the distribution of the ranj ∈m λζj ej . dom variable j ∈m λζjm ej where (ζjm ) is a sequence of independent Rademacher random variables. These definitions easily imply

1 Lμρ (Y, X) = p

Lμm,ρ (Y, X)

k m∈M(k,p)

=

1 2k

p

k m∈M(k,p) ζ m ∈{−1,1}k

Lμm,ζ m ρ (Y, X).

738

N. VERZELEN AND F. VILLERS

We aim at bounding the quantity E0 (L2μρ ) and obtaining an inequality of the form (8.2). First, we work out Lμm,ζ m ,ρ : Lμm,ζ m ,ρ (Y, X)

= (8.3)

1 2 1 − λ k/(var(Y ))

n/2

exp −

λ2 k Y2n 2 var(Y )(var(Y ) − λ2 k)

+λ

j ∈m

−λ

2

ζjm

j,j ∈m

Y, Xj n var(Y ) − λ2 k ζjm ζjm

Xj , Xj n 2(var(Y ) − λ2 k)

,

where ·n refers to the canonical inner product in Rn . Let us fix m1 and m2 in M(k, p) and two vectors ζ 1 and ζ 2 , respectively, associated to m1 and m2 . We aim at computing the quantity E0 (Lμm ,ζ 1 ,ρ (Y, X) × 1 Lμm ,ζ 2 ,ρ (Y, X)). First, we decompose the set m1 ∪ m2 into four sets (which pos2 sibly are empty): m1 \ m2 , m2 \ m1 , m3 and m4 where m3 and m4 are defined by m3 := {j ∈ m1 ∩ m2 |ζj1 = ζj2 }, m4 := {j ∈ m1 ∩ m2 |ζj1 = −ζj2 }. For the sake of simplicity, we reorder the elements of m1 ∪ m2 from 1 to |m1 ∪ m2 | such that the first elements belong to m1 \ m2 , then to m2 \ m1 and so on. Moreover, we define the vector ζ ∈ R|m1 ∪m2 | such that ζj = ζj1 if j ∈ m1 and ζj = ζj2 if j ∈ m2 \ m1 . Using this notation, we compute the expectation of Lm1 ,ζ 1 ,ρ (Y, X)Lm2 ,ζ 2 ,ρ (Y, X): (8.4)

E0 (Lμm

1 1 ,ζ ,ρ

(Y, X)Lμm

2 2 ,ζ ,ρ

(Y, X))

1 = var(Y )(1 − λ2 k/(var(Y )))2

n/2

|A|−n/2 ,

where | · | refers to the determinant and A is a symmetric square matrix of size |m1 ∪ m2 | + 1 such that

A[1, j ] :=

⎧ ⎪ var(Y ) + λ2 k ⎪ ⎪ , ⎪ ⎪ ⎪ var(Y )(var(Y ) − λ2 k) ⎪ ⎪ ⎪ ⎪ λζj −1 ⎨

−

,

var(Y ) − λ k ⎪ ⎪ ⎪ λζj −1 ⎪ ⎪ ⎪ , −2 ⎪ ⎪ ⎪ var(Y ) − λ2 k ⎪ ⎩ 0,

2

if j = 1, if (j − 1) ∈ m1 m2 , if (j − 1) ∈ m3 , if (j − 1) ∈ m4 ,

739

TESTS FOR HIGH-DIMENSIONAL MODELS

where m1 m2 refers to (m1 ∪ m2 ) \ (m1 ∩ m2 ). For any i > 1 and j > 1, A satisfies ⎧ ζi−1 ζj −1 ⎪ ⎪ + δi,j , λ2 if (i − 1, j − 1) ∈ (m1 \ m2 ) × m1 , ⎪ ⎪ var(Y ) − λ2 k ⎪ ⎪ ⎪ ⎪ ζi−1 ζj −1 ⎪ ⎪ ⎪ λ2 if (i − 1, j − 1) ∈ (m2 \ m1 ) + δi,j , ⎪ ⎪ ⎪ var(Y ) − λ2 k ⎪ ⎪ ⎪ ⎪ × (m2 \ m1 ∪ m3 ), ⎨ ζ ζ A[i, j ] := i−1 j −1 ⎪ −λ2 , if (i − 1, j − 1) ∈ (m2 \ m1 ) × m4 , ⎪ ⎪ var(Y ) − λ2 k ⎪ ⎪ ⎪ ⎪ ζi−1 ζj −1 ⎪ ⎪ ⎪ + δi,j , if (i − 1, j − 1) ∈ [m3 × m3 ] 2λ2 ⎪ ⎪ ⎪ var(Y ) − λ2 k ⎪ ⎪ ⎪ ⎪ ∪ [m4 × m4 ], ⎪ ⎩ 0, else, where δi,j is the indicator function of i = j . After some linear transformation on the lines of the matrix A, it is possible to express its determinant into var(Y ) + λ2 k I|m1 ∪m2 | + C , var(Y )(var(Y ) − λ2 k) where I|m1 ∪m2 | is the identity matrix of size |m1 ∪ m2 |. C is a symmetric matrix of size |m1 ∪ m2 | such that for any (i, j ), |A| =

C[i, j ] = ζi ζj D[i, j ] and D is a block symmetric matrix defined by ⎡

⎢ ⎢ ⎢ ⎢ D := ⎢ ⎢ ⎢ ⎣

λ4 k var2 (Y )−λ4 k 2 −λ2 var(Y ) var2 (Y )−λ4 k 2 −λ2 var(Y )+λ2 k λ2 var(Y )−λ2 k

−λ2 var(Y ) var2 (Y )−λ4 k 2 λ4 k var2 (Y )−λ4 k 2 −λ2 var(Y )+λ2 k −λ2 var(Y )−λ2 k

−λ2 var(Y )+λ2 k −λ2 var(Y )+λ2 k −2λ2 var(Y )+λ2 k

0

λ2 var(Y )−λ2 k −λ2 var(Y )−λ2 k

0 2λ2 var(Y )−λ2 k

⎤ ⎥ ⎥ ⎥ ⎥ ⎥. ⎥ ⎥ ⎦

Each block corresponds to one of the four previously defined subsets of m1 ∪ m2 (i.e., m1 \m2 , m2 \m1 , m3 and m4 ). The matrix D is of rank, at most, four. By computing its nonzero eigenvalues, it is then straightforward to derive the determinant of A, [var(Y ) − λ2 (2|m3 | − |m1 ∩ m2 |)]2 . var(Y )(var(Y ) − λ2 k)2 Gathering this equality with (8.4) yields |A| =

(8.5)

E0 (Lμm

1 1 ,ζ ,ρ

(Y, X)Lμm

2 2 ,ζ ,ρ

(Y, X))

1 = 2 1 − λ (2|m3 | − |m1 ∩ m2 |)/(var(Y ))

n

.

740

N. VERZELEN AND F. VILLERS

Then, we take the expectation with respect to ζ 1 , ζ 2 , m1 and m2 . When m1 and m2 are fixed the expression (8.5) depends on ζ 1 and ζ 2 only toward the cardinality of m3 . As ζ 1 and ζ 2 correspond to independent Rademacher variables, the random variable 2|m3 | − |m1 ∩ m2 | follows the distribution of Z, a sum of |m1 ∩ m2 | independent Rademacher variables and

(8.6)

E0 (Lμm1 ,ρ (Y, X)Lμm2 ,ρ (Y, X)) = E0

1 2 1 − λ Z/var(Y )

n

.

When Z is nonpositive, this expression is smaller than one. Alternatively, when Z is nonnegative,

1 2 1 − λ Z/var(Y )

n

1 = exp n log 2 1 − λ Z/var(Y )

≤ exp n

λ2 Z/var(Y ) 1 − λ2 Z/var(Y )

λ2 Z/var(Y ) ≤ exp n , 1 − λ2 k/var(Y ) as log(1 + x) ≤ x and as Z is smaller than k. We define an event A such that {Z > 0} ⊂ A ⊂ {Z ≥ 0}, and P(A) = 12 . This is always possible as the random variable Z is symmetric. As a consequence, on the event Ac , the quantity (8.6) is smaller or equal to one. All in all, we bound (8.6) by E0 (Lμm1 ,ρ (Y, X)Lμm2 ,ρ (Y, X))

λ2 Z/var(Y ) 1 ≤ + E0 1A exp n 2 1 − λ2 k/var(Y )

(8.7)

,

where 1A is the indicator function of the event A. We now apply Hölder’s inequality with a parameter v ∈ ]0; 1] which will be fixed later:

λ2 Z/var(Y ) E0 1A exp n 1 − λ2 k/var(Y )

≤ P(A)

1−v

(8.8)

≤

1−v

1 2

n λ2 Z/var(Y ) E0 exp v 1 − λ2 k/var(Y )

cosh

nλ2 v(var(Y ) − λ2 k)

v

|m1 ∩m2 |v

.

Gathering inequalities (8.7) and (8.8) yields E0 [L2μρ (Y, X)] 1−v 1 p 2

1 1 ≤ + 2 2

k

m1 ,m2

nλ2 cosh v(var(Y ) − λ2 k) ∈M(k,p)

|m1 ∩m2 |v

.

741

TESTS FOR HIGH-DIMENSIONAL MODELS

Following the approach of Baraud [2] in Section 7.2, we note that if m1 and m2 are taken uniformly and independently in M(k, p), then |m1 ∩ m2 | is distributed as a hypergeometric distribution with parameters p, k and k/p. Thus we derive that (8.9)

1−v

vT

nλ2 E cosh v(var(Y ) − λ2 k)

1 1 + 2 2

E0 [L2μρ (Y, X)] ≤

,

where T is a random variable distributed according to a hypergeometric distribution with parameters p, k and k/p. We know from Aldous (see [1], page 173) that T has the same distribution as the random variable E(W |Bp ) where W is a binomial random variable of parameters k, k/p and Bp , some suitable σ -algebra. By a convexity argument, we then upper bound (8.9): E0 [L2μρ (Y, X)] 1−v

=

nλ2 E cosh v(var(Y ) − λ2 k)

1 1 ≤ + 2 2

1−v

1 1 + 2 2

1+

1−v

1 1 = + 2 2

vW

nλ2 k cosh p v(var(Y ) − λ2 k)

v

k

−1

nλ2 k exp k log 1 + cosh p v(var(Y ) − λ2 k)

v

−1

.

To get the upper bound on the total variation distance appearing in (8.1), we aim at constraining this last expression to be smaller than 1 + η2 . This is equivalent to the following inequality:

2v exp k log 1 + (8.10)

v

−1

≤ 1 + 2η2 .

We now choose v = equivalent to

(8.11)

nλ2 k k cosh p vk(var(Y ) − λ2 k)

L(η) log(2)

∧ 1. If v is strictly smaller than one, then (8.10) is

nλ2 k k cosh k log 1 + p vk(var(Y ) − λ2 k)

v

−1

≤

log(1 + 2η2 ) . 2

It is straightforward to show that this last inequality also implies (8.10) if v equals one. We now suppose that

nλ2 1/v ≤ log (1 + u) + (1 + u)2/v − 1 , 2 v(var(Y ) − λ k) √ where u = pLk 2(η) . Using the classical equality cosh[log(1 + x + 2x + x 2 )] = 1 + x with x = (1 + u)1/v − 1, we deduce that inequality (8.12) implies (8.11)

(8.12)

742

N. VERZELEN AND F. VILLERS

because

k log 1 +

k nλ2 k cosh p vk(var(Y ) − λ2 k)

v

−1

≤ k log 1 + ≤

k2 u ≤ L(η). p

For any β ≥ 1 and any x > 0, it holds that (1 + x)β ≥ 1 + βx. As tion (8.12) is implied by λ2 k u kv ≤ log 1 + + var(Y ) − λ2 k n v

k u p

1 v

≥ 1, condi-

2u . v

One then combines the previous inequality with the definitions of u and v to obtain the upper bound λ2 k var(Y ) − λ2 k

p(log(2) ∨ L(η)) k L(η) ∧ 1 log 1 + + ≤ n log(2) k2

2p(log(2) ∨ L(η)) . k2

For any x positive and any u between 0 and 1, log(1 + ux) ≥ u log(1 + x). As a consequence, the previous inequality is implied by

k L(η) λ2 k p ≤ ∧ 1 [L(η) ∨ log(2)] ∧ 1 log 1 + 2 + 2 var(Y ) − λ k n log(2) k

k p = L(η) ∧ 1 log 1 + 2 + n k

2p k2

2p . k2

To resume, if we take ρ 2 smaller than (4.4), then βI ([k, p, ρ]) ≥ δ. Moreover, the lower bound is strict if ρ 2 is strictly smaller than (4.4). To prove the second part of the theorem, one has to observe that α + δ ≤ 53% implies that L(η) ≥ 12 . P ROOF OF P ROPOSITION 4.2. Let us first assume that the covariance matrix of X is the identity. We argue as in the proof of Theorem 4.3 taking k = p. The sketch of the proof remains unchanged except that we slightly modify the last part. Inequality (8.11) becomes

nλ2 p pv log cosh vp(var(Y ) − λ2 p)

≤ L(η),

743

TESTS FOR HIGH-DIMENSIONAL MODELS

(η) 2 where we recall that v = L log 2 ∧ 1. For all x ∈ R, cosh(x) ≤ exp(x /2). Consequently, the previous inequality is implied by √ p λ2 p ≤ , 2v L (η) 2 var(Y ) − λ p n

and the result follows easily. If we no longer assume that the covariance matrix is the identity, we orthogonalize the sequence Xi thanks to Gram–Schmidt process. Applying the previous argument to this new sequence of covariates completes the proof. P ROOF OF P ROPOSITION 3.4. the condition:

Let us define the constant L(α, δ) involved in

*

+

L(α, δ) := log 1 + 8(1 − α − δ)2 1 ∧ log 1 + 8(1 − α − δ)2 /(2 log 2) . √

m Let us apply Proposition 4.2. For any ρ ≤ L(α, δ) D n and any ς > 0 there ex2 ists some θ ∈ Sm such that var(Yθ)−θ = ρ 2 and Pθ (φm,α ≤ 0) ≥ δ −ς . In the proof 2 of Theorem 3.3, we have shown in (7.3) and following equalities that the distribu2 = var(Y )−var(Y |Xm ) . Let tion of the test statistic φm only depends on the quantity κm var(Y |Xm ) 2 = ρ 2 . The distribution of φ under P is the θ be an element of Sm such that κm m θ same as its distribution under Pθ , and therefore

Pθ (φm,α ≤ 0) ≥ δ − ς. Letting ς go to 0 completes the proof. P ROOF OF P ROPOSITION 4.4. This lower bound for dependent Gaussian covariates is proved through the same approach as Theorem 4.3. We define the measure μρ as in that proof. Under hypothesis (H0 ), Y is independent of X. We note the covariance matrix of X and E0, stands for the distribution of (Y, X) under (H0 ) in order to emphasize the dependence on . First, one has to upper bound the quantity E0, [L2μρ (Y, X)]. For the sake of simplicity, we make the hypothesis that every covariate Xj has variance 1. If this is not the case, we only have to rescale these variables. The quantity corr(i, j ) refers to the correlation between Xi and Xj . As we only consider the case k = 1, the set of models m in M(1, p) is in correspondence with the set {1, . . . , p}:

E0, (Lμi,ζ 1 ,ρ (Y, X)Lμj,ζ 2 ,ρ (Y, X)) =

var(Y ) var(Y ) − corr(i, j )λ2 ζ 1 ζ 2

n

.

When i and j are fixed, we upper bound the expectation of this quantity with respect to ζ 1 and ζ 2 by

(8.13)

var(Y ) 1 1 E0, (Lμi,ρ (Y, X)Lμj,ρ (Y, X)) ≤ + 2 2 var(Y ) − | corr(i, j )|λ2

n

.

744

N. VERZELEN AND F. VILLERS

If i = j , | corr(i, j )| is smaller than c, and if i = j , corr(i, j ) is exactly one. As a consequence, taking the expectation of (8.13) with respect to i and j yields the upper bound (8.14)

E0, (L2μρ (Y, X))

≤

var(Y ) 1 1 1 + 2 2 p var(Y ) − λ2

n

+

var(Y ) p−1 p var(Y ) − cλ2

n

.

Recall that we want to constrain this quantity (8.14) to be smaller than 1 + η2 . In particular, this holds if the two following inequalities hold:

(8.15)

1 var(Y ) p var(Y ) − λ2

(8.16)

p−1 var(Y ) p var(Y ) − cλ2

n n

1 One then uses the inequality log( 1−x )≤ smaller than one. Condition (8.15) holds if

(8.17)

≤

1 + η2 , p

≤

p−1 + η2 . p

x 1−x

which holds for any positive x

λ2 1 ≤ log(1 + pη2 ), var(Y ) − λ2 n

whereas condition (8.16) is implied by

cλ2 1 p η2 . ≤ log 1 + 2 var(Y ) − cλ n p−1 As c is smaller than one and (8.18)

p p−1

is larger than 1, this last inequality holds if

λ2 1 log(1 + η2 ). ≤ var(Y ) − λ2 nc

Gathering conditions (8.17) and (8.18) allows us to complete the proof and to obtain the desired lower bound (4.6). P ROOF OF P ROPOSITION 4.6. The sketch of the proof and the notation are analogous to the one in Proposition 4.4. The upper bound (8.13) still holds:

var(Y ) 1 1 E0, (Lμi,ρ (Y, X)Lμj,ρ (Y, X)) ≤ + 2 2 var(Y ) − | corr(i, j )|λ2

n

.

Using the stationarity of the covariance function, we derive from (8.13) the following upper bound: E0, (L2μρ (Y, X)) ≤

n p−1 1 1 var(Y ) , + 2 2p i=0 var(Y ) − λ2 | corr(0, i)|

745

TESTS FOR HIGH-DIMENSIONAL MODELS

where corr(0, i) equals corr(X1 , Xi+1 ). As previously, we want to constrain this quantity to be smaller than 1 + η2 . In particular, this is implied if for any i between 0 and p − 1,

var(Y ) var(Y ) − λ2 | corr(i, 0)|

n

2pη2 | corr(i, 0)| . ≤ 1 + p−1 | corr(i, 0)| i=0

Using the inequality log(1 + u) ≤ u, it is straightforward to show that this previous inequality holds if

1 2pη2 | corr(0, i)| λ2 ≤ log 1 + . p−1 var(Y ) − λ2 | corr(i, 0)| n| corr(i, 0)| | corr(i, 0)| i=0 As | corr(i, 0)| is smaller than one for any i between 0 and p − 1, it follows that E0, (L2μρ (Y, X)) is smaller than 1 + η2 if ρ ≤ 2

p−1 , i=0

2pη2 | corr(0, i)| 1 . log 1 + p−1 n| corr(i, 0)| i=0 | corr(i, 0)|

We now apply the convexity inequality log(1 + ux) ≥ u log(1 + x) which holds for any positive x and any u between 0 and 1 to obtain the condition

(8.19)

1 2pη2 . ρ ≤ log 1 + p−1 n | corr(i, 0)| 2

i=0

It turns out we only have to upper bound the sum of | corr(i, 0)| for the following different types of correlation: −w

e 1. For corr(i, j ) = exp(−w|i − j |p ), the sum is clearly bounded by 1 + 2 1−e −w and condition (8.19) simplifies as

ρ2 ≤

1 − e−w 1 log 1 + 2pη2 ; n 1 + e−w

2. if corr(i, j ) = (1 + |i − j |p )−t for t strictly larger than one, then 2 and condition (8.19) simplifies as 0)| ≤ 1 + t−1

p−1 i=0

| corr(i,

1 2p(t − 1)η2 ρ ≤ log 1 + ; n t +1 2

3. if corr(i, j ) = (1 + |i − j |p )−1 then condition (8.19) simplifies as

p−1 i=0

| corr(i, 0)| ≤ 1 + 2 log(p − 1) and

1 2pη2 ; ρ ≤ log 1 + n 1 + 2 log(p − 1) 2

746

N. VERZELEN AND F. VILLERS

4. if corr(i, j ) = (1 + |i − j |p )−t for 0 < t < 1, then p−1

| corr(i, 0)| ≤ 1 +

i=0

2 1−t

1−t

p 2

−1 ≤

1−t

p 2 1−t 2

,

and condition (8.19) simplifies as

1 ρ 2 ≤ log 1 + pt 21−t (1 − t)η2 . n

P ROOF OF P ROPOSITION 5.3. For each dimension D between 1 and p, we 2 = ρ2 2 2 2 2 define rD D,n ∧ aD R . Let us fix some D ∈ {1, . . . , p}. Since rD ≤ aD and since the aj ’s are nonincreasing, D var(Y |X mj −1 ) − var(Y |Xmj )

aj2

j =1

≤ var(Y |X)R 2

θ 2 2 . Indeed, θ2 = D var(Y |X = rD mj −1 ) − j =1 var(Y )−θ2 2 var(Y ) − θ = var(Y |X). As a consequence,

for all θ ∈ SmD such that var(Y |Xmj ) and

θ2 θ 2 2 2 = r ⊂ θ ∈ E (R), ≥ rD . a D var(Y ) − θ2 var(Y ) − θ2 Since rD ≤ ρD,n , we deduce from Proposition 4.2 that θ ∈ SmD ,

θ2 2 ≥ rD ≥ δ. var(Y ) − θ 2 The first result of Proposition 5.3 follows by gathering these lower bounds for all D between 1 and p. √ √ 2 is defined in Proposition 4.2 as ρ 2 = 2[√L(η) ∧ √L(η) ] i . Moreover, ρi,n i,n log 2 n β

θ ∈ Ea (R),

2 ≥ If α + δ ≤ 47%, it is straightforward to show that ρi,n

P ROOF OF P ROPOSITION 5.5.

√ i n .

We first need the following lemma.

L EMMA 8.1. We consider (Ij )j ∈J a partition of I . For each j ∈ J let p(j ) = |Ij |. For any j ∈ J , we define j as the set of θ ∈ Rp such that their support is included in Ij . For any sequence of positive weights kj such that

kj = 1,

j ∈J

it holds that βI

j ∈J

θ2 θ ∈ j , = rj2 var(Y ) − θ2

≥ δ,

if for all j ∈ J , rj ≤ ρp(j ),n (η/ kj ) where the function ρp(j ),n is defined by (4.3).

TESTS FOR HIGH-DIMENSIONAL MODELS

747

For all j ≥ 0 such that 2j +1 − 1 ∈ I [i.e., for all j ≤ J where J = log(p + 1)/ log(2) − 1], let S¯j be the linear span of the ek ’s for k ∈ {2j , . . . , 2j +1 − 1}. Then dim(S¯j ) = 2j and S¯j ⊂ SmD for D = D(j ) = 2j +1 − 1. It is straightforward to show that J

*

+

S¯j rD(j ) ⊂

j =0

J

*

p

+

SmD(j ) rD(j ) ⊂

j =0

SmD [rD ],

D=1

2 where S¯j [rD (j )] := {θ ∈ S¯j , var(Yθ)−θ 2 = rD(j ) } and SmD [rD ] := {θ ∈ SmD , 2

θ2 var(Y )−θ 2

2 }. = rD

We choose J = {1, . . . , J }. For any j ∈ J , we define Ij = {2j , 2j + 1, . . . , − 1}. Applying Lemma 8.1 with kj := [(j + 1)R(p)]−1 where R(p) := J k=0 1/(k + 1), we get 2j +1

βI

p D=1

θ2 2 θ ∈ SmD , = rD var(Y ) − θ2

≥ δ,

if for all those D = D(j ), 2 rD

≤ log(1 + 2η2 /kj ) 1 ∧

√ log(1 + 2η2 /kj ) D √ . n 2 log 2

For D = D(j ), this last quantity is lower bounded by

log(1 + 2η2 /kj )

(8.20)

1∧

√ log(1 + 2η2 /kj ) D √ n 2 log 2

≥ log 1 + 2η2 (j + 1)R(p) 1 ∧

log(1 + 2η2 ) 2j/2 √ . n 2 log 2

It remains to check that (8.20) is larger than ρ¯D(j ),n . Using j + 1 = log(D + √ 1)/ log(2) ≥ log(D + 1), we get 2j/2 ≥ D/2. Thanks to the convexity inequality log(1+ux) ≥ u log(1+x), which holds for any x > 0 and any u ∈ ]0, 1], we obtain

log 1 + 2η2 (j + 1)R(p) 2j/2

≥ D/2 η 2R(p) ∧ 1 log[1 + log(D + 1)] √

≥ η 2 ∧ 1 log log(D + 1) D/2 √

1 ≥ √ 1 ∧ log(1 + 2η2 ) log log(D + 1) D, 2

748

N. VERZELEN AND F. VILLERS

as R(p) is larger than one for any p ≥ 1. All in all, we get the lower bound

log(1 + 2η2 ) 2j/2 √ log 1∧ n 2 log 2 √ D 1 2

2 = ρ¯D,n 1 ∧ log(1 + 2η ) log log(D + 1) . ≥ √ n 2 log(2)

1 + 2η2 (j

+ 1)2 R(p)

2 is smaller than ρ¯ 2 , it holds that Thus, if for all 1 ≤ D ≤ p, rD D,n

βI

p D=1

θ2 2 θ ∈ SmD , = rD var(Y ) − θ2

≥ δ.

P ROOF OF L EMMA 8.1. Usinga similar approach to the proof of Theorem 4.3, we know that for each rj ≤ ρ˜j (η/ kj ) there exists some measure μj over

θ 2 j [rj ] := θ ∈ j , = rj2 var(Y ) − θ2

such that E0 [L2μj (Y, X)] ≤ 1 + η2 /kj .

(8.21)

We now define a probability measure μ = j ∈J kj μj over refers to the density of Pμj with respect to P0 . Thus Lμ (Y ) = and E0 [L2μ (Y, X)] =

j ∈J

j [rj ]. Lμj

dPμ (Y, X) = kj Lμj (Y, X) dP0 j ∈J j,j ∈J

kj kj E0 [Lμj (Y, X)Lμj (Y, X)].

Using expression (8.5), it is straightforward to show that if j = j , then E0 [Lμj (Y, X)Lμj (Y, X)] = 1. This follows from the fact that the sets j and j are orthogonal with respect to the inner product (2.4). Thus

E0 [Lμ Y, X ] = 1 +

j ∈J

kj2 E0 [L2μj (Y, X)] − 1 ≤ 1 + η2 ,

thanks to (8.21). Using argument (8.2) as in the proof of Theorem 4.3 completes the proof. P ROOF OF P ROPOSITION 5.6. First of all, we only have to consider the case where the covariance matrix of X is the identity. If this is not the case, one only

749

TESTS FOR HIGH-DIMENSIONAL MODELS

has to apply the Gram–Schmidt process to X and thus obtain a vector X and a new basis for Rp which is orthonormal. We refer to the beginning of Section 5 for more details. Like the previous bounds for ellipsoids, we adapt the approach of Section 6 in Baraud [2]. We use the same notation as in proof of Proposition 5.3. Let D ∗ (R) ∈ 2 ∧ (R 2 a 2 ) = r¯ 2 . As in {1, . . . , p}, an integer which achieves the supremum of ρ¯D D D proof of Proposition 5.3, for any R > 0,

θ ∈ SmD∗ (R) ,

θ 2 2 = rD ∗ (R) var(Y ) − θ 2

θ 2 2 ⊂ θ ∈ Ea (R), ≥ rD ∗ (R) . var(Y ) − θ2 When R varies, D ∗ (R) describes {1, . . . , p}. Thus we obtain 1≤D≤p

=

θ2 2 θ ∈ SmD , = rD var(Y ) − θ 2

R>0

⊂

θ2 2 θ ∈ SmD∗ (R) , = rD ∗ (R) var(Y ) − θ2

θ ∈ Ea (R)

R>0

θ2 2 ≥ rD ∗ (R) , var(Y ) − θ2

and the result follows from Proposition 5.5.

APPENDIX P ROOF OF P ROPOSITION 3.1. The test associated with procedure P1 corresponds to a Bonferroni procedure. Hence we prove that its size is less than α by the following argument: let θ be an element of SV (defined in Section 2.2), Pθ (Tα > 0) ≤

m∈M

Pθ φm (Y, X) − F¯D−1 (αm ) > 0 , m ,Nm

where φm (Y, X) is defined in (2.2). The test is rejected if for some model m, (αm ). As θ belongs to SV , V ∪m Y − V Y = φm (Y, X) is larger than F¯D−1 m ,Nm V ∪m − V and Y − V ∪m Y = − V ∪m . Then the quantity φm (Y, X) is equal to φm (Y, X) =

Nm V ∪m − V 2n . Dm − V ∪m 2n

750

N. VERZELEN AND F. VILLERS

Because is independent of X, the distribution of φm (Y, X) conditionally to X is a Fisher distribution with Dm and Nm degrees of freedom. As a consequence, φm,αm (Y, X) is a Fisher test with Dm and Nm degrees of freedom. It follows that Pθ (Tα > 0) ≤

αm ≤ α.

m∈M

The test associated with procedure P2 has the property to be of size exactly α. More precisely, for any θ ∈ SV , we have that Pθ (Tα > 0|X) = α,

X a.s.

The result follows from the fact that qX,α satisfies

Pθ

sup

m∈M

Nm V ∪m () − V ()2n − F¯D−1 (qX,α ) > 0 X = α, m ,Nm Dm − V ∪m ()2n

and that for any θ ∈ SV , V ∪m Y − V Y = V ∪m − V and Y − V ∪m Y = − V ∪m . We come back to the definitions of Tα1 and Tα2 :

P ROOF OF P ROPOSITION 3.2.

(α/|M|)}, Tα1 (X, Y) = sup {φm (Y, X) − F¯D−1 m ,Nm m∈M

(qX,α )}. Tα2 (X, Y) = sup {φm (Y, X) − F¯D−1 m ,Nm m∈M

Conditionally on X, the size of Tα1 is smaller than α whereas the size Tα2 is exactly α. As a consequence, qX,α ≥ α/|M| as the statistics Tα1 and Tα2 differ only through these quantities. Thus Tα2 (X, Y) ≥ Tα1 (X, Y), (X, Y) almost surely, and the result (3.4) follows. Acknowledgments. We gratefully thank Sylvie Huet and Pascal Massart for many fruitful discussions. We also thank the two referees for their helpful comments. REFERENCES [1] A LDOUS , D. J. (1985). Exchangeability and related topics. In École d’été de probabilités de Saint Flour XIII. Lecture Notes in Math. 1117. Springer, Berlin. MR0883646 [2] BARAUD , Y. (2002). Non-asymptotic rates of testing in signal detection. Bernoulli 8 577–606. MR1935648 [3] BARAUD , Y., H UET, S. and L AURENT, B. (2003). Adaptative tests of linear hypotheses by model selection. Ann. Statist. 31 225–251. MR1962505 [4] B ÜHLMANN , P., K ALISCH , M. and M AATHUIS , M. H. (2009). Variable selection for highdimensional models: Partially faithful distributions and the PC-simple algorithm. Biometrika. To appear. [5] C ANDÈS , E. and TAO , T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n. Ann. Statist. 35 2313–2351. MR2382644

TESTS FOR HIGH-DIMENSIONAL MODELS

751

[6] C OWELL , R. G., DAWID , A. P., L AURITZEN , S. L. and S PIEGELHALTER , D. J. (1999). Probabilistic Networks and Expert Systems. Springer, New York. MR1697175 [7] C RESSIE , N. (1993). Statistics for Spatial Data, revised ed. Wiley, New York. MR1239641 [8] D RTON , M. and P ERLMAN , M. (2007). Multiple testing and error control in Gaussian graphical model selection. Statist. Sci. 22 430–449. MR2416818 [9] E FRON , B., H ASTIE , T., J OHNSTONE , I. and T IBSHIRANI , R. (2004). Least angle regression. Ann. Statist. 32 407–499. MR2060166 [10] G IRAUD , C. (2008). Estimation of Gaussian graphs by model selection. Electron. J. Stat. 2 542–563. MR2417393 [11] H UANG , J., L IU , N., P OURAHMADI , M. and L IU , L. (2006). Covariance matrix selection and estimation via penalised normal likehood. Biometrika 93 85–98. MR2277742 [12] I NGSTER , Y. I. (1993). Asymptotically minimax hypothesis testing for nonparametric alternatives I. Math. Methods Statist. 2 85–114. MR1257978 [13] I NGSTER , Y. I. (1993). Asymptotically minimax hypothesis testing for nonparametric alternatives II. Math. Methods Statist. 3 171–189. MR1257983 [14] I NGSTER , Y. I. (1993). Asymptotically minimax hypothesis testing for nonparametric alternatives III. Math. Methods Statist. 4 249–268. MR1259685 [15] K ISHINO , H. and WADDELL , P. (2000). Correspondence analysis of genes and tissue types and finding genetic links from microarray data. Genome Informatics 11 83–95. [16] L AURENT, B. and M ASSART, P. (2000). Adaptive estimation of a quadratic function by model selection. Ann. Statist. 28 1302–1338. MR1805785 [17] L AURITZEN , S. L. (1996). Graphical Models. Oxford Univ. Press, New York. MR1419991 [18] M ASSART, P. (2007). Concentration inequalities and model selection. In École d’été de probabilités de Saint Flour XXXIII. Lecture Notes in Math. 1896. Springer, Berlin. MR2319879 [19] M EINSHAUSEN , N. and B ÜHLMANN , P. (2006). High-dimensional graphs and variable selection with the Lasso. Ann. Statist. 34 1436–1462. MR2278363 [20] RUE , H. and H ELD , L. (2005). Gaussian Markov Random Fields: Theory and Applications. Chapman and Hall/CRC, London. MR2130347 [21] S CHÄFER , J. and S TRIMMER , K. (2005). An empirical Bayes approach to inferring large-scale gene association network. Bioinformatics 21 754–764. [22] S POKOINY, V. G. (1996). Adaptative hypothesis testing using wavelets. Ann. Statist. 24 2477– 2498. MR1425962 [23] V ERZELEN , N. and V ILLERS , F. (2009). Tests for Gaussian graphical models. Comput. Statist. Data Anal. 53 1894–1905. [24] WAINWRIGHT, M. J. (2007). Information-theoretic limits on sparsity recovery in the highdimensional and noisy setting. Technical Report 725, Dept. Statistics, Univ. California, Berkeley. [25] W ILLE , A. and B ÜHLMANN , P. (2006). Low-order conditional independence graphs for inferring genetic networks. Stat. Appl. Genet. Mol. Biol. 5 Art. 1 (electronic). MR2221304 [26] W ILLE , A., Z IMMERMANN , P., V RANOVA , E., F ÜRHOLZ , A., L AULE , O., B LEULER , S., H ENNIG , L., P RELIC , A., VON ROHR , P., T HIELE , L., Z ITZLER , E., G RUISSEM , W. and B ÜHLMANN , P. (2004). Sparse graphical Gaussian modelling of the isoprenoid gene network in arabidopsis thaliana. Genome Biology 5 11. [27] Y UAN , M. and L IN , Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika 94 19–35. MR2367824 [28] Z HANG , C.-H. and H UANG , J. (2008). The sparsity and bias of the LASSO selection in highdimensional linear regression. Ann. Statist. 36 1567–1594. MR2435448 [29] Z HAO , P. and Y U , B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res. 7 2541–2563. MR2274449

752

N. VERZELEN AND F. VILLERS

[30] Z OU , H. and H ASTIE , T. (2005). Regularization and variable selection via the Elastic Net. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 301–320. MR2137327 L ABORATOIRE DE M ATHÉMATIQUES D ’O RSAY U NIVERSITÉ PARIS -S UD CNRS O RSAY C EDEX , F-91405 F RANCE E- MAIL : [email protected]

L ABORATOIRE DE BIOMÉTRIE INRA 78352 J OUY- EN -J OSAS C EDEX F RANCE E- MAIL : [email protected]

Goodness-of-fit tests for high-dimensional Gaussian ... - Project Euclid

des documents recommandant