# Method of Contraction-Expansion (MOCE) for Simultaneous Inference in Linear Models

Simultaneous inference after model selection is of critical importance to address scientific hypotheses involving a set of parameters. In this paper, we consider high-dimensional linear regression model in which a regularization procedure such as LASSO is applied to yield a sparse model. To establish a simultaneous post-model selection inference, we propose a method of contraction and expansion (MOCE) along the line of debiasing estimation that enables us to balance the bias-and-variance trade-off so that the super-sparsity assumption may be relaxed. We establish key theoretical results for the proposed MOCE procedure from which the expanded model can be selected with theoretical guarantees and simultaneous confidence regions can be constructed by the joint asymptotic normal distribution. In comparison with existing methods, our proposed method exhibits stable and reliable coverage at a nominal significance level with substantially less computational burden, and thus it is trustworthy for its application in solving real-world problems.

## Authors

• 77 publications
• 7 publications
• 4 publications
• 4 publications
• ### Exact post-selection inference, with application to the lasso

We develop a general approach to valid inference after model selection. ...
11/25/2013 ∙ by Jason D. Lee, et al. ∙ 0

• ### A Model Free Perspective for Linear Regression: Uniform-in-model Bounds for Post Selection Inference

For the last two decades, high-dimensional data and methods have prolife...
02/15/2018 ∙ by Arun Kumar Kuchibhotla, et al. ∙ 0

• ### Simultaneous Confidence Tubes for Comparison of Several Multivariate Linear Regression Models

Much of the research on multiple comparison and simultaneous inference i...
05/20/2020 ∙ by Jianan Peng, et al. ∙ 0

• ### Robust adaptive variable selection in ultra-high dimensional regression models based on the density power divergence loss

We consider the problem of simultaneous model selection and the estimati...
04/11/2020 ∙ by Abhik Ghosh, et al. ∙ 0

• ### Permutation-based simultaneous confidence bounds for the false discovery proportion

When multiple hypotheses are tested, interest is often in ensuring that ...
08/16/2018 ∙ by Jesse Hemerik, et al. ∙ 0

• ### Simultaneous Inference Under the Vacuous Orientation Assumption

I propose a novel approach to simultaneous inference that alleviates the...
05/15/2019 ∙ by Ruobin Gong, et al. ∙ 0

• ### A Java library to perform S-expansions of Lie algebras

The contraction method is a procedure that allows to establish non-trivi...
03/11/2017 ∙ by C. Inostroza, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider the linear model with a response vector

and an design matrix ,

 y=Xβ∗+ϵ, (1.1)

where denotes a -dimensional vector of unknown true regression coefficients, and is an -dimensional vector of random errors with mean zero and variance , where is the identity matrix. All columns in are normalized to have mean zero and -norm 1. The sample covariance matrix of predictors and its corresponding population covariance matrix are denoted by and , respectively. Let be the support of with cardinality . In this paper, assuming as , we focus on simultaneous statistical inferences on a certain parameter subset of when .

Arguably, in the setting of , a simultaneous inference for the entire set of parameters, i.e. , is generally not tractable due to the issue of model identification. A key assumption widely adopted in the current literature to facilitate statistical inference is the sparsity of , namely , in addition to regularity conditions on the design matrix; see for example [RSSB12094, vandegeer2014, zhangcun2014], among others. The sparsity assumption of the true signals necessitates variable selection, which has been extensively studied in the past two decades or so. Being one of the most celebrated variable selection methods, Least Absolute Shrinkage and Selection Operator (LASSO)[tibshirani94] has gained great popularity in both theory and applications. Specifically, a LASSO estimator is obtained by minimizing the following penalized objective function:

 ^βλ=argminβ∈Rp(12n∥y−Xβ∥22+λ∥β∥1), (1.2)

where is the -norm of a vector and is the tuning parameter. Based on this LASSO estimator, , given in (1.2), statistical inferences for parameters in in the aspects of hypothesis test and confidence region construction have recently received a surge of attention in the literature because statistical inference has been always playing a central role in the statistical theory and providing one of the most effective ways for the transition of data to knowledge.

Some progresses in post-model selection inferences have been reported in the literature. The method LASSO+mLS proposed in [liu2013] first performs LASSO model selection and then draws statistical inferences based on the selected model. This approach requires model selection consistency and some incoherence conditions on the design matrix [yu2006, meinshausen2009, Buhlmann2011]. Inference procedures built upon those conditions have been noted as being impractical and exhibited poor performances due to the lack of uniform validity of inferential precedures over sequences of models; see for example, [leeb2008sparse, chernozhukov2014valid].

To overcome the reliance on the oracle asymptotic distribution in inference, many solutions have been proposed in recent years. Among those, three methods are so far known for a valid post-model selection inference. (i) The first kind is sample splitting method [wasserman2009, noicolai2009, meinshausen2010stability] and resampling method [minnier2012perturbation]. A key drawback of the sample splitting method is its requirement of the beta-min assumption, while the resampling approach entails a strong restrictive exchangeability condition on the design matrix. (ii) The second kind is group inference proposed in [RSSB12094]. Unfortunately, this approach fails to show desirable power to detect individual signals, and thus it is not useful in practical studies. (iii) The third kind is low-dimensional projection (LDP) [zhangcun2014, vandegeer2014, java2014]. Such inferential method is rooted in a seminal idea of debiasing, resulting from the use of penalized objective function that causes estimation shrinkage. This method will be adopted in this paper for a new paradigm of post-model selection inference. Following the debiasing approach proposed by [zhangcun2014], [cai2017] investigates both adaptivity and minimax rate of the debiasing estimation, which provides useful insights on the rate of model contraction and expansion considered in this paper. Specifically, an LDP estimator, , takes a debiasing step under an operation of this form: , where is a sparse estimate of precision matrix . When matrix is properly constructed with a well-controlled behavior, the bias term, , would become asymptotically negligible. In this case, statistical inference can be conducted using the debiased estimator . It is known that obtaining a desirable is not a trivial task due to the singularity of sample covariance . For examples, [vandegeer2014] proposes to use node-wise LASSO to get , while [java2014] adopts a convex optimization algorithm to obtain . It is worth noting that these existing approaches are computationally burdensome, and require extra regularity conditions to ensure the estimated sparse to be feasible and stable. In the setting of the LDP estimator, [zhangcheng2017] proposes a bootstrap-based simultaneous inference for a group, say , of parameters in via the distribution of quantity , where the bootstrap resampling, unfortunately, demands much more computational power than a regular LDP estimator based on the node-wise LASSO estimation .

Overcoming the excessive computational cost on acquiring motivates us to consider a ridge type of approximation to the precision matrix , in a similar spirit to the approach proposed by Ledoit2004365 for estimation of a high-dimensional covariance matrix. Note that the LASSO estimator satisfies the following Karush-Kuhn-Tucker (KKT) condition:

 −1nXTϵ+S(^βλ−β∗)+λκ=0, (1.3)

where is the subdifferential of whose th component is if , if , and if . Let be a diagonal matrix with all positive element , . We propose to add a term , and then multiply on the both sides of (1.3), leading to an equivalent expression of (1.3),

 −1n^Σ−1τXTϵ+{(^βλ+λ^Σ−1τκ)−β∗}−^Σ−1ττ(^βλ−β∗)=0, (1.4)

where is the debiasing estimator, and is a ridge-type sample covariance matrix. It is easy to see that on the basis of (1.4), establishing a valid inference on becomes straightforward if is nonsingular and bias term may be asymptotically negligible under a properly tuned matrix . The associated technical treatments are of theoretical interest but methodologically challenging. To address such challenges, in this paper, we propose a new approach, termed as Method of Contraction and Expansion (MOCE).

Our solution based on the proposed MOCE offers a practically feasible way to perform a valid simultaneous post-model selection inference in which the ridge type matrix is properly tuned to establish desirable theoretical guarantees. As seen later in the paper, the ridge matrix

plays a key role in determining the length of confidence interval, which can vary according to signal strengths. That is, MOCE is able to provide a wider confidence interval which is deemed for a strong signal to achieve a proper coverage, while a shorter one for a null signal. This is because a null signal is known with zero coefficient (i.e., no need for estimation once being identified), whereas a non-null signal is only known with non-zero coefficient, which needs to be further estimated in order to construct its confidence interval, and thus incurs extra variability in inference. Specifically, MOCE takes on an expanded model

that is enlarged from an initially selected model, in the hope that the bigger model may include most of “weak” signals which will be handled together with strong signals in inference. In this way, weak signals that have non-zero coefficients are separated from null signals that have zero coefficients. Technically, we attempt to build an expanded model big enough so that it is able to cover both strong signals and most, if not all, of weak signals under some mild regularity conditions. Implementing the idea of model expansion is practically feasible; for example, the LASSO method allows us not only to identify strong signals, but also to rank predictors in a descending order via their solution paths. With a given expanded model, MOCE modifies the original KKT condition accordingly, where the precision matrix is estimated by . Under the sparsity assumption and some additional mild conditions, the bias term in (1.4) vanishes asymptotically with a proper rate, and consequently confidence region for a set of regression parameters is readily constructed in the paradigm of MOCE.

This paper makes new contributions to the following five domains. (i) MOCE is established under weaker sparsity conditions required for valid simultaneous inference in comparison to those given in the current literature. That is, MOCE assumes the sparsity condition , instead of the popular sup-sparsity assumption, ; more importantly, MOCE does not demand additional sparsity assumptions required by the node-wise LASSO to obtain sparse estimate of the precision matrix. (ii) MOCE is shown to achieve a smaller error bound in terms of mean squared error (MSE) in comparison to the seminal LDP debiasing method. In effect, MOCE estimator has the MSE rate with being the size of the expanded model, clearly lower than , the rate of the LDP estimator. (iii) MOCE enjoys both reproducibility and numerical stability in inference because the model expansion leaves little ambiguity for post-selection inference as opposed to many existing methods based on a selected model that may vary substantially due to different tuning procedures [berk2013]. (iv) MOCE is advantageous for its fast computation, because of the ridge-type regularization, which is known to be conceptually simple and computationally efficient. It is shown that the computational complexity of MOCE is of order , in comparison to the order of the LDP method. (v) MOCE enables us to construct a new simultaneous test similar to the classical Wald test for a set of parameters based on its asymptotic normal distribution. The proposed hypothesis test method is computationally superior to the bootstrap-based test [zhangcheng2017] based on the sup-norms of individual estimation errors. All these improvements above make the MOCE method ready to be applied in real-world applications.

The rest of the paper is organized as follows. Section 2 introduces notation and Section 3 provides preliminary results that are used in the proposed MOCE method. In Section 4 we discuss in detail about MOCE and its algorithm, including computational complexity and schemes for model expansion. Section 5 concerns theoretical guarantees for MOCE, including a new simultaneous test. Through simulation experiments, Section 6 illustrates performances of MOCE, with comparison to existing methods. Section 7 contains some concluding remarks. Some lengthy technical proofs are included in the Appendix.

## 2 Notation

For a vector , the -norm is ; the -norm is ; and the -norm is . For a matrix , the -norm is and the Frobenious norm is where is the trace of matrix . Refer to [horn2012] for other matrix norms. Let and

be the smallest and largest nonzero singular values of a positive semi-definite matrix

, respectively.

With a given index subset , vector and matrix can be partitioned as and . For two positive definite matrices and , their Löewner order indicates that is positive definite. For two sequences of real numbers and , the expression means that there exist positive constants and such that .

For the self-containedness, here we introduce restricted eigenvalue

condition and sparse eigenvalue condition; refer to [bickel2009] for more details. For a given subset and a constant , define the following subspace in :

 R(J,k):={ν∈Rp:∥νJc∥1≤k∥νJ∥1}.

A sample covariance matrix is said to satisfy the restricted eigenvalue condition if for and there exists a constant such that

 minJ⊂{1,…,p}|J|≤s minν∈R(J,k)∥Xν∥22n∥ν∥22≥ϕ0. (2.1)

A sample covariance matrix is said to satisfy the sparse eigenvalue condition if for any with it holds

 0<λmin(s)≤λmax(s)<∞, (2.2)

where

 λmin(s):=min∥ν∥0≤sν≠0∥Xν∥22n∥ν∥22,λmax(s):=max∥ν∥0≤sν≠0∥Xν∥22n∥ν∥22.

## 3 Preliminary Results

The first regularity condition on the design matrix is given as follows.

###### Assumption 1.

The design matrix in the linear model (1.1) satisfies the condition for and , where is the number of non-null signals.

Assumption 1 above is routinely assumed for design matrix in a high-dimensional linear model; see for example, [bickel2009, zhangcun2014]. Note that the compatibility assumption given in [vandegeer2014] is slightly weaker than Assumption 1.

As discussed above, when the bias term in (1.4) is asymptotically negligible, the modified KKT (1.4) enables us to establish an asymptotic distribution for the proposed debiasing estimator of the form:

 ~βτ=^βλ+λ^Σ−1τκ. (3.1)

Lemma 3.1 below assesses both Frobenious norm and -norm of , a key term in the bias. This lemma suggests that when it is impossible to fully reduce the LASSO bias in [buhlmann2013]. Rather, in this paper, alternatively, we are able to establish an appropriate order for the ridge tuning parameters in matrix, with which the resulting is controlled at a desirable large-sample rate.

###### Lemma 3.1.

Consider the sample covariance . Let the ridge matrix with for , and and . Let . Then, the Frobenious norm and -norm of are given as follows, respectively:

 max(p−n,0)+min(n,p){ρ+max(τ−1/2Sτ−1/2)+1}2≤∥^Σ−1ττ∥2F ≤max(p−n,0)+min(n,p){ρ+min(τ−1/2Sτ−1/2)+1}2; τminρ+max(S)+τmax≤|^Σ−1ττ|∞≤⎧⎨⎩τmaxτmin,if p>n;τmaxρ+min(S)+τmin,if p≤n.

Proof of Lemma 3.1 is given in Appendix A.1. According to Lemma 3.1, when , it is interesting to note that the -norm is bounded above by . This upper bound may converge to 0 if and . On the other hand, when , its upper bound is , which is always greater than or equal to 1. Hence, when the bias term can be controlled by an appropriately small , leading to a simultaneous inference on by the means of debiasing. In contrast, the case presents the difficulty of bias reduction for . Such insight motivates us to seek for an alternative solution in the framework of post-model selection inference, resulting in our proposed MOCE.

The proposed MOCE mimics the well-known physical phenomenon of thermal contraction and expansion for materials with the tuning parameter being an analog to temperature. Specifically, MOCE reduces LASSO estimation bias in two steps as shown in Figure 1. In the step of contraction, LASSO selects a model , represented by the small circle in Figure 1, which may possibly miss some signals contained in the signal set . In the step of expansion, MOCE enlarges to form an expanded model , indicated by the large circle in Figure 1. As a result, the signal set would be completely contained by the expanded model . In other words, MOCE begins with an initial model through the LASSO regularization which contains most of important signals, and then expands into a bigger model to embrace not only strong signals, but also almost all weak signals. Refer to Section 4.2 where required specific conditions and rules are discussed for the model expansion.

We now introduce notations necessary for a further discussion on the step of model expansion. Let be a LASSO selected model, whose cardinality is denoted by . Here, both and are tuning parameter -dependent, which is suppressed for the sake of simplicity. Similarly, let be an expanded model with cardinality denoted by . Given and , model expansion leads to disjoint subsets of predictors which may conveniently be presented in Table 1 by a 2-way cross-classification, respectively, for the LASSO selected model (the left table) and (the right table), the complement of .

Among these subsets, two are of primary interest, namely, and , given as follows, respectively:

 Bfn=A∩~Ac,Btn=Ac∩~Ac, (3.2)

and let their cardinalities are and , respectively. collects signals missed by expanded model (i.e., false negatives), while collects all null signals that expanded model does not contain (i.e., true negatives). With expanded model , we assume that the design matrix satisfies Assumption 2.

###### Assumption 2.

The design matrix in the linear model (1.1) satisfies the sparse eigenvalue condition for .

Assumption 2 ensures that any main diagonal sub-matrices of sample covariance matrix has finite positive minimum and maximum singular values, which essentially requires any selected model, or , to have well-defined Hessian matrices.

## 4 MOCE Method

We first introduce MOCE and then discuss its computational complexity. In particular, procedures for model expansion are discussed in detail in section 4.2.

### 4.1 Moce

Suppose an expanded model has been given. We partition a LASSO estimator given in (1.2) as . Rewrite KKT condition (1.3) according to this partition, respectively, for and :

 −1nXT~A(y−X~A^β~A−XBfn^βBfn−XBtn^βBtn)+λκ~A =0, (4.1) −1nXT~Ac(y−X~A^β~A−X~Ac^β~Ac)+λκ~Ac =0. (4.2)

It follows from (4.1) that

 S~ABfn(^βBfn−β∗Bfn)+S~ABtn^βBtn−1nXT~Aϵ+S~A~A(^β~A−β∗~A)+λκ~A =0. (4.3)

In regard to expanded model , the corresponding -matrix is an positive diagonal matrix, denoted by , and the corresponding ridge sample covariance submatrix is denoted by . Adding and multiplying on both sides of equation (4.3), we have

 ^β~Aτa−β∗~A=1n^Σ−1~A~AXT~Aϵ+ra, (4.4)

where the debiasing estimator of subvector takes the form:

 ^β~Aτa=^β~A+λ^Σ−1~A~Aκ~A, (4.5)

and the remainder is given by

 ra=^Σ−1~A~Aτa(^β~A−β∗~A)+^Σ−1~A~AS~ABtn^βBtn+^Σ−1~A~AS~ABfn(^βBfn−β∗Bfn)def=I11+I12+I13. (4.6)

If holds, Lemma 4.1 shows that . Thus, as stated in Theorem 5.1 equation (4.4) implies that is consistent and follows asymptotically a normal distribution.

Now, consider the complementary model . Following similar steps of deriving equation (4.4), we rewrite (4.2) as follows:

 S~Ac~A(^β~A−β∗~A)+^Σ~Ac~Ac(^β~Ac−β∗~Ac)+λκ~Ac=1nXT~Acϵ+τc(^β~Ac−β∗~Ac),

where the corresponding ridge sample covariance submatrix is and is a matrix of positive diagonals. Plugging (4.4) and (4.5) into the above equation, we can show

 ^β~Acτc−β∗~Ac= 1n^Σ−1~Ac~Ac(XT~Ac−S~Ac~A^Σ−1~A~AXT~A)ϵ+rc, (4.7)

where is the debiasing estimator of subvector , which takes the following form:

 ^β~Acτc=^β~Ac+λ^Σ−1~Ac~Acκ~Ac−λ^Σ−1~Ac~AcS~Ac~A^Σ−1~A~Aκ~A, (4.8)

and the associated remainder term is

If holds, we can show in Lemma 4.1.

Now, combining the two estimators (4.5) and (4.8), namely , we express the proposed MOCE estimator for as follows,

 ^βτ=^βλ+λL−1τκ, (4.10)

where matrix is a block matrix given by

 L−1τ=⎛⎝^Σ−1~A~A0−^Σ−1~Ac~AcS~Ac~A^Σ−1~A~A^Σ−1~Ac~Ac⎞⎠.

In comparison to equation (3.1), in (4.10) the MOCE presents a different bias correction term, . Consequently, the inverse matrix of , , takes the form of

 Lτ=(^Σ~A~A0S~Ac~A^Σ~Ac~Ac),

which is different from the ridge covariance matrix in (3.1). The fact of being a lower triangular matrix implies that the MOCE estimator in (4.8) on has no impact on in (4.5) on .

### 4.2 Model expansion and size determination

A primary purpose of model expansion is to control the uncertainty of model selection at a lower level than the sampling uncertainty. This may be achieved by some regularity conditions. Intuitively, when an expanded model is too small, is likely to miss many weak signals; on the other hand, when an expanded model is too large, would include many noise signals. The size of expanded model in MOCE is critical as it pertains to a trade-off between uncertainty of model selection and efficiency of statistical inference. In this setting, the theory for the selection of tuning parameter and is also relevant.

Donoho94idealspatial show that at a hard threshold LASSO can achieve the performance of an oracle within a factor of in terms of mean squared error. Under the Donoho-Johnstone’s order , NIPS2009_3697 develops a consistent thresholding procedure for variable selection. For the purpose of inference, we want to have a relatively large model to include most weak signals, so we set . We consider a factor to scale the product , defined as the smallest integer such that

 p∑i=1min{|β∗i|,λsσ}≤a∗λsσ. (4.11)

Note that term represents a compound of model selection uncertainty and sampling uncertainty . Denote a signal set

 As={j:|β∗j|>λsσ,j=1,…,p}, (4.12)

whose cardinality is . Clearly . It is worth noting that factor measures the overall cumulative signal strength, while size of is the number of signals stronger than the corresponding factor . Essentially, the set given in (4.12) is formed by the signal-to-noise ratio, where the noise arises from both model selection uncertainty and sampling error uncertainty . Apparently, also contains the set of stronger signals defined by .  that is, .

For a given signal set , Assumption 3 below describes characteristics of expanded model .

###### Assumption 3.

.

Assumption 3 is a very weak condition; first, it holds when , that is, expanded model contains all signals. However, this full capture may be relaxed in MOCE; in other words, Assumption 3 permits expanded model to leak some weak signals with their strength being order of .

###### Assumption 4.

.

Assumption 4 is a very mild condition too, which can always be satisfied if . This assumption is imposed to protect rare occasions when an initial LASSO selection ends up with a model containing excessively many small nonzero coefficients. In this case, to proceed MOCE for inference, Assumption 4 requires to choose a relatively small which may not necessarily cover . As stated in Assumption 3, the leakage of very weak signals is allowed by MOCE in inference.

When LASSO solution paths are monotonic in , we may choose a hard threshold to directly determine the size of . The fact of being smaller than implies that more variables are included in . Assumption 4 further implies that the maximum signal strength among the false negatives and true negatives is well controlled; that is,

 max{∥^βBtn∥2,∥^βBfn∥2}≤max{∥^βBtn∩^A∥2,∥^βBfn∩^A∥2}≤√^a∥^β~Ac∩^A∥∞=op(1/√n). (4.13)

In practice, the size of may be set to where is the largest tuning value in LASSO solution paths at which all parameters are shrunk to zero. We first select variables contained in into if . Next we introduce a noise injection step to randomly select predictors into from variables with zero estimates at . This noise injection step eseentially helps reduce the sensitivity of the expanded model with variable selection relative to the sampling variability. It is worthy to comment that although LASSO has been the method of choice for our procedure in this paper, in fact, the proposed MOCE allows other methods to construct as long as a chosen expanded model satisfies Assumptions 3 and 4. Based on above assumptions, Lemma 4.1 assesses the remainder terms in (4.6) and in (4.9) in terms of -norm.

###### Lemma 4.1.

Suppose that Assumptions 14 hold. Assume , and

 ρ+max(τa)=o(√logp/n),ρ+min(τc)=O(√λmax(p−~a)). (4.14)

Then, and .

###### Proof.

By the expression of in (4.6), it suffices to show that three terms , and are all of order . Similarly, by the expression of in (4.9), the order of is established if both terms and