# Constrained High Dimensional Statistical Inference

In typical high dimensional statistical inference problems, confidence intervals and hypothesis tests are performed for a low dimensional subset of model parameters under the assumption that the parameters of interest are unconstrained. However, in many problems, there are natural constraints on model parameters and one is interested in whether the parameters are on the boundary of the constraint or not. e.g. non-negativity constraints for transmission rates in network diffusion. In this paper, we provide algorithms to solve this problem of hypothesis testing in high-dimensional statistical models under constrained parameter space. We show that following our testing procedure we are able to get asymptotic designed Type I error under the null. Numerical experiments demonstrate that our algorithm has greater power than the standard algorithms where the constraints are ignored. We demonstrate the effectiveness of our algorithms on two real datasets where we have intrinsic constraint on the parameters.

## Authors

• 9 publications
• 12 publications
• 48 publications
• ### A Flexible Framework for Hypothesis Testing in High-dimensions

Hypothesis testing in the linear regression model is a fundamental stati...
04/26/2017 ∙ by Adel Javanmard, et al. ∙ 0

• ### Statistical inference for high dimensional regression via Constrained Lasso

In this paper, we propose a new method for estimation and constructing c...
04/17/2017 ∙ by Yun Yang, et al. ∙ 0

• ### Generative discriminative models for multivariate inference and statistical mapping in medical imaging

This paper presents a general framework for obtaining interpretable mult...
07/02/2018 ∙ by Erdem Varol, et al. ∙ 0

• ### Explaining Constraint Interaction: How to Interpret Estimated Model Parameters under Alternative Scaling Methods

In this paper, we explain the reasons behind constraint interaction, whi...
04/30/2018 ∙ by Stefan Klößner, et al. ∙ 0

• ### Spatially relaxed inference on high-dimensional linear models

We consider the inference problem for high-dimensional linear models, wh...
06/04/2021 ∙ by Jérôme-Alexis Chevalier, et al. ∙ 5

• ### The empirical duality gap of constrained statistical learning

This paper is concerned with the study of constrained statistical learni...
02/12/2020 ∙ by Luiz F. O. Chamon, et al. ∙ 0

• ### Family learning: nonparametric statistical inference with parametric efficiency

Hypothesis testing and other statistical inference procedures are most e...
11/27/2017 ∙ by William Fithian, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Statistical estimation of high dimensional problems has been attracting more and more attention due to the abundance of such data in many emerging fields such as genetic studies, social network analysis, etc. High dimensional geometry is inherently different from low-dimensional geometry. As an example, for linear regression, in low dimensions the Ordinary Least Square (OLS) estimator allows for constructing confidence intervals and hypothesis tests for the true coefficient. In high dimensional models OLS is ill-conditioned so instead we have to solve for penalized estimators like LASSO. In low dimensions we can test for hypothesis such as

by partial likelihood function while in high dimensions this also fails, due to the large amount of nuisance parameters.

In this paper we consider a hypothesis testing problem in a high dimensional model under constrained parameter space. For many problems, before analyzing data and fitting models we might already know some constraints on the parameters. This can also be viewed as prior information on the parameters. For example in isotonic regression [6, 66, 11] we have a constraint that the variables are non-decreasing; in non-negative least square problem [58]

we have a constraint that the coefficients are non-negative. in real-world reinforcement learning applications, we need to take into consideration the safety of the agent

[5, 64, 72]. Also, in Gaussian process, it is sometimes assumed that the parameters satisfy some linear inequality constraints [40].

With this additional information the statistical inference and hypothesis testing for the parameters may be different. For example, consider a simple model: . In general if we want to test whether is 0 or not, i.e. test for versus , we will reject if the absolute value of the mean is relatively large. However, if we have the constraint that , then we are testing versus , and we reject only when is relatively large.

When we have constraints on parameters, a natural question we want to answer is whether the parameter lies on the boundary or is away from the boundary, since these two cases are usually very different. For example for nonnegativity constraint, we want to know whether the parameter is exactly zero or strictly positive; for monotonic constraint we want to know whether the two variables are equal or one is strictly greater than the other.

In this paper we perform statistical inference (hypothesis testing) for low dimensional parameters in a high dimensional model under cone constraint. Denote the parameter , where is the low dimensional parameter of interest and denotes nuisance parameters. Denote the constraint set as a closed and convex cone, and let be a linear space in . In most of the cases is a polyhedron and the linear space denotes the (subset of) the boundary set of . In this paper we want to test

 H0:α∈MversusHA:α∈C∖M, (1)

where we have the constraint . We develop an algorithm for this constrained testing problem in high dimensional models. Following our procedure we show that the hypothesis test method we propose has asymptotic designed Type I error, and it has much greater power than when the constraints are ignored.

### 1.1 Related Work

#### High-dimensional inference without constraint.

There is a vast literature on performing statistical inference for high dimensional models and here we provide a brief overview. Early work [34] shows that the limiting distribution of LASSO estimator is not normal even in low dimensions. More recently, several approaches have been proposed to obtain asymptotic distribution on low dimensional parameters in high dimensional linear model, mostly by approximating the inverse of the Gram matrix. [73] gives confidence intervals for low dimensional parameters in a high dimensional linear model using low dimensional projection estimator (LDPE). [29] provides asymptotic confidence interval of LASSO estimator for high dimensional linear regression by introducing the debiasing method. [62] further extends their work to a more general setting, including Generalized Linear Model and other nonlinear models. [47] deals with general model on Hessian matrix with Dantzig type estimator. Related works also include [12, 74] for simultaneous inference, [4] for double selection method, [52, 71, 70] for graphical model, [61, 67, 37] for post selective inference, [39, 9] for for synthetic control, [59] for noisy labels, etc.

#### Low-dimensional constrained inference.

The literature on constrained testing dates back to [10]

, where the authors prove the asymptotic distribution of the likelihood ratio (LR) test statistic for constrained testing to be weighted Chi-square.

[48] further considers testing with unknown covariance matrix, and gives sharp upper and lower bounds for the weights. [23] introduces the test statistics for likelihood ratio test, Wald test, and Kuhn-Tucker test with inequality constraint in linear model, and proves the equivalence of these three tests. [32] proposes one-sided -test when the coefficients’ signs are known. [54] introduces a modified Lagrange multiplier test for testing one-sided problem. [35] proposes Wald test for jointly testing equality and inequality constraints on the parameters. [65] develops asymptotically equivalent tests under linear inequality restrictions for linear models. [33] introduces a locally most mean powerful (LMMP) test. [1] introduces directed tests, which is optimal in terms of power. [8] introduces multiple-endpoint testing in clinical trials. [24] provides Order-Restricted Score Tests for generalized linear and nonlinear mixed models. [2] proposes test when nuisance parameters appear under the alternative hypothesis, but not under the null. [49] gives improved LRT and UIT test. More recently, [42] has discussed halfline test for inequality constraints. [60]

gives conservative likelihood ratio test using data-dependent degree of freedom.

[76] gives Wald test under inequality constraint in linear model with spherically symmetric disturbance. [43] proposes an extended MaxT test and gets the power improvement. However, all these existing results are for low dimensional models.

In terms of statistical inference, our work is most related to [47], where the authors establish inference for high dimensional models using decorrelation method. We will review this method in Section 2. For constrained testing, our work is most related to [55] where the authors introduce and discuss Chi-bar-squared statistic, and [57] and [44] which form the one sided test to test whether a parameter is zero or strictly positive. Recent works [28, 63] consider hypothesis testing on whether the parameters lie in some convex cone. This is still different from our setting where we know the parameters lie in the convex cone and the goal is to test whether they lie on the boundary of the cone.

### 1.2 Organization of the Paper

In Section 2 we give the detailed procedure for our algorithm. Section 3 gives assumptions under which our method is valid, and states our main theorem. Sections 4 and 5 present experimental results on synthetic datasets and real world datasets, respectively. We conclude in Section 6.

## 2 Algorithm

In this section we describe our main algorithm. Consider a high dimensional statistical model with parameters and the partition , where is dimensional parameter of interest, and is a dimensional nuisance parameter with . We write , and the true parameter as with . Moreover, we have the constraint where is a closed and convex cone. Let be a linear space in . In most of the cases is an polyhedron and the linear space denotes the (subset of) the boundary set of . The hypothesis we want to test is

 H0:α∗∈MversusHA:α∗∈C∖M, (2)

i.e. we want to test whether lies on the boundary of , or is a strict interior point of in at least one direction. For example, with nonnegativity constraint we have and . The hypothesis we want to test is

 H0:α∗=0versusHA:∃j∈{1,...,d}s.t.α∗j>0. (3)

Another example is monotonic constraint where we have and . The hypothesis we want to test is

 H0:α∗1=α∗2=...=α∗dversusHA:∃j∈{1,...,d−1}s.t.α∗j<α∗j+1. (4)

Suppose we have independent trials where we allow for . Denote the sample negative log likelihood function as

 ℓ(β)=−1nn∑i=1logLi(β), (5)

where is the likelihood function for one trial . In low dimensions we can estimate the parameter by maximum likelihood estimation (MLE). However in high dimensions, MLE may not work. Instead we use the penalized estimator

 ˆβ=argminβ{ℓ(β)+Pλ(β)}, (6)

where is some penalty function with tuning parameter . Note that this estimation can be performed with or without the cone constraint . In Section 3 we will see that all we need is the consistency of this estimator.

Let be the gradient of the negative log likelihood function and , be the corresponding partitions. Similarly let be the sample Hessian matrix, and let , , and be the corresponding partitions. Let be the population Fisher information matrix. Denote and , , , as the corresponding partitions for .

The difficulty of the problem comes from two aspects: the problem is high dimensional, and that we have the constraint on . We first deal with the difficulty from high dimensions. It is well known that in low dimensions we can test for based on the partial score function

 S(α)=∇α(α,ˆθ(α)), (7)

where

is the partial maximum likelihood estimator. Under the null hypothesis we have

 √nS(0)d→N(0,H∗α|θ), (8)

where is the partial information matrix. We then reject the null when is relatively large. However, in high dimensions this method does not work. To overcome this issue, we follow the decorrelation procedure introduced in [15, 47] as described in Step 1 in Algorithm 1.

In Step 1.2, we want to get a linear combination of to best approximate

. The population version of this vector should be

 (17)

However, in high dimensions, we cannot directly estimate by the corresponding sample version since the problem is ill-conditioned. So instead we estimate by the Dantzig selector .

In Step 1.3 we get decorrelated score function which is approximately orthogonal to any component of the nuisance score function

. This is approximately an unbiased estimating equation for

so the root of this equation should give us an approximately unbiased estimator for . Since searching for the root may be computational intensive, we use one Newton step, as stated in (11).

With the decorrelated score function, the decorrelated estimator, and the decorrelated likelihood function, under mild conditions we will specify in Section 3, we have the following asymptotic distributions [47]:

 √nˆU(α∗)→N(0,H∗α|θ), (18) √n(˜α−α∗)→N(0,H∗−1α|θ), (19) 2n(ℓde(α∗)−ℓde(˜α))→χ2d, (20)

where , and in practice it can be estimated by

 ˆHα|θ=∇2ααℓ(ˆβ)−ˆW⊤∇2θαℓ(ˆβ). (21)

We then deal with the second difficulty: cone constraint. Since we already get asymptotic normality, we follow the procedure in [55] to construct the Score, Wald and likelihood ratio test statistics, as described in Step 2 in Algorithm 1.

This two-step procedure gives us the final test statistics , and

. In the next section we will show that under null hypothesis, all of them converge weakly to the weighted Chi-square distribution, and from which we can construct valid hypothesis test with asymptotic designed Type I error.

## 3 Theoretical result

In this section, we outline the main theoretical properties of our method. We start by providing high-level conditions in Section 3.1, and state our main theorem in Section 3.2 that the null distribution is a weighted Chi-square distribution. In Section 3.3 we describe the way to calculate the weights. We analyze the power of our method in Section 3.4 and the proof of the main theorem is given in Section 3.5.

### 3.1 Assumptions

In this section we provide high-level assumptions that allow us to establish properties of each step in our procedure.

#### Sparsity Condition:

Both and are sparse: . (We use a single for notational simplicity, but this is not required for our method to work).

#### Score Condition:

The expected value of the score function at true is 0:

 E(∇ℓ(β∗))=0. (22)

#### Sparse Eigenvalue Condition:

We have and for any with . Also both , , and are bounded element-wise, i.e., the maximum element is and each element has absolute value bounded by some constant .

Denote as the maximum absolute value of elements in , i.e., . By saying the maximum element of is , we are assuming .

#### Estimation Accuracy Condition:

The penalized estimator in (6) is a consistent estimator for the true :

 ∥ˆβ−β∗∥1=O(λs)% and∥ˆβ−β∗∥2=O(λ√s), (23)

where is the hyper-parameter in the penalty .

#### Smooth Hessian Condition:

The Hessian matrix is Lipschitz continuous:

 ∥∇2ℓ(β1)−∇2ℓ(β2)∥∞≤L⋅∥β1−β2∥1, (24)

for some constant .

The score condition holds for most of the log likelihood functions. In fact, let be the likelihood function and be the parameter, then under certain regularity conditions [51], we have

 Eddθlogf =Edfdθ⋅1f=∫dfdθ⋅1f⋅fdx=∫dfdθdx=ddθ∫fdx=ddθ1=0. (25)

The sparse eigenvalue (SE) condition can be replaced by restricted eigenvalue (RE) condition: let

, RE condition requires and for any in the cone for some . Both sparse eigenvalue condition and restricted eigenvalue condition are common in high dimensional statistical estimation literature, and are known to hold for a large number of models. See Remark A in the supplementary material for the proof.

The estimation condition is also common for penalized estimators. For example, [46]

shows that, if the sample loss function

(e.g. negative log likelihood function here) is convex, differentiable, and satisfies Restricted Strong Convexity:

 L(β∗+Δ)−L(β∗)−⟨∇L(β∗),Δ⟩≥κ∥Δ∥2 (26)

for certain , then for being penalty, with we have

 ∥ˆβ−β∗∥1=O(λs)% and∥ˆβ−β∗∥2=O(λ√s). (27)

The smooth Hessian condition is to make sure the Hessian matrix is well-behaved locally, hence to make sure the Dantzig selector is consistent. This condition is also known to hold for general models.

### 3.2 Main theorem

Before we proceed with our main theorem, we first introduce the following Lemma 3.2 which shows the asymptotic distribution of the decorrelated score function and decorrelated estimator constructed in Step 1 of Algorithm 1. It is in the same spirit as and corresponds to Theorem 4.4 and 4.7 in [15]. All the other related lemmas and proofs are provided in the supplementary material. For ease of presentation, in the following Lemma 3.2 we focus on the case where is a scalar. It is straightforward to generalize to the vector case.

Suppose all the conditions in Section 3.1 are satisfied. Let in Step 1.1, in Step 1.2, and , we have

 √nˆU(α∗)→N(0,H∗α|θ), (28) √n(˜α−α∗)→N(0,H∗−1α|θ), (29) ∣∣H∗α|θ−ˆHα|θ∣∣=o\PP(1) (30)

where and is estimated by the sample version

 ˆHα|θ=∇2ααℓ(ˆβ)−ˆw⊤∇2θαℓ(ˆβ). (31)
###### Proof.

The outline of the proof follows from [15]. We start from the proof of (28) for where by mean value theorem we have:

 ˆU(α∗) =∇αℓ(α∗,ˆθ)−ˆw⊤∇θℓ(α∗,ˆθ) (32) =∇αℓ(α∗,θ∗)+∇2αθℓ(α∗,¯θ)(ˆθ−θ∗)−[ˆw⊤∇θℓ(α∗,θ∗)+ˆw⊤∇2θθℓ(α∗,˜θ)(ˆθ−θ∗)] +[∇2αθℓ(α∗,˜θ)−ˆw⊤∇2θθℓ(α∗,¯θ)](ˆθ−θ∗)E3 =E1+E2+E3,

where for some .

We consider the three terms separately. For , by taking in Lemma A, under the null hypothesis we have

 √nE1→N(0,H∗α|θ). (33)

For , according to Hölder’s inequality, Lemma A, and Lemma A we have

 |E2|≤∥ˆw−w∗∥1⋅∥∇θℓ(0,θ∗)∥∞=OP(λ′s√logp/n)=O\PP(s3logp/n). (34)

For we have

 |E3| ≤|(ˆw−w∗)⊤∇2θθℓ(α∗,¯θ)(ˆθ−θ∗)|R1+∣∣[∇2αθℓ(α∗,˜θ)−w∗⊤∇2θθℓ(α∗,¯θ)](ˆθ−θ∗)∣∣ (35) ≤R1+∣∣[∇2αθℓ(α∗,˜θ)−H∗αθ](ˆθ−θ∗)∣∣R2+∣∣[H∗αθ−w∗⊤∇2θθℓ(α∗,¯θ)](ˆθ−θ∗)∣∣ ≤R1+R2+∣∣w∗⊤[H∗θθ−∇2θθℓ(α∗,¯θ)](ˆθ−θ∗)∣∣R3 ≤R1+R2+R3.

Considering the three terms and separately, according to Lemma A and Lemma A we have

 R1≤∥ˆw−w∗∥2∥∇θθℓ(α∗,¯θ)∥2∥ˆθ−θ∗∥2≤C√sλ′c√slogp/n=O\PP(s3logp/n), (36) R2≤∥∇2αθℓ(α∗,˜θ)−H∗αθ∥∞⋅∥ˆθ−θ∗∥1=O\PP(s2logp/n), (37) R3≤∥w∗∥1∥H∗θθ−∇2θθℓ(α∗,¯θ)∥∞∥ˆθ−θ∗∥1=O\PP(s3logp/n). (38)

Combining all these terms we show that (28) holds. We then turn to the proof of (30) that is an consistent estimator. By definition we have

 |ˆHα|θ−H∗α|θ| ≤|H∗αα−∇2ααℓ(ˆα,ˆθ)|T1+|H∗αθH∗−1θθH∗θα−ˆw⊤∇2θαℓ(ˆα,ˆθ)| (39) ≤T1+|(w∗−ˆw)⊤H∗θα|T2+|ˆw⊤(H∗θα−∇2θαℓ(ˆα,ˆθ))|T3 ≤T1+T2+T3.

Considering the terms and separately, according to Lemma A and Lemma A we have

 T1=O\PP(s√logp/n), (40) T2≤∥w∗−ˆw∥1⋅∥H∗θα∥∞=O\PP(s3√logp/n), (41) T3≤∥ˆw∥1⋅∥H∗θα−∇2θαℓ(ˆα,ˆθ)∥∞=O\PP(s2√logp/n). (42)

Combining the three terms we show that (30) holds. Finally we prove the result (29) for . By construction we have

 ˜α =ˆα−(∂ˆU(ˆα)∂α)−1⋅ˆU(ˆα)=ˆα−H∗−1α|θˆU(ˆα)+ˆU(ˆα)[H∗−1α|θ−(∂ˆU(ˆα)∂α)−1]S1 (43) =ˆα−H∗−1α|θ[ˆU(α∗)+(ˆα−α∗)⋅∂ˆU(˘α)∂α]+S1 =ˆα−H∗−1α|θˆU(α∗)−(ˆα−α∗)H∗−1α|θ⋅H∗α|θ+(ˆα−α∗)H∗−1α|θ[H∗α|θ−(∂ˆU(˘α)∂α)]S2+S1 =α∗−H∗−1α|θˆU(α∗)+S1+S2,

where for some . We consider the terms and separately. For we have

 |ˆU(α∗)−ˆU(ˆα)|≤|α∗−ˆα|⋅∣∣∂ˆU(˘α)∂α∣∣=O\PP(λ). (44)

Moreover, from the analysis of above we have that . We then obtain

 (45)

For we have that

 |S2|≤|ˆα−α∗|⋅H∗−1α|θ⋅∣∣H∗α|θ−(∂ˆU(˘α)∂α)∣∣≤O\PP(s3logp/n). (46)

Plugging in (45) and (46) into (43) we obtain

 √n(˜α−α∗)=−√nH∗−1α|θˆU(α∗)+o\PP(1). (47)

According to (28), this gives

 √n(˜α−α∗)→N(0,H∗−1α|θ), (48)

and our claim (29) holds. ∎

The stated sample complexity is for a general model. For specific models we may be able to get sharper results. For example for linear model and generalized linear model suffices [47].

In Lemma 3.2 we focus on the case where is a scalar. It is straightforward to generalize to the vector case. We are now almost ready for our main theorem. For any positive definite matrix , denote and as the inner product and the norm, respectively. For the linear space , the usual orthogonal complement of associated with is defined as

 M⊥V={y:⟨x,y⟩V=0for allx∈M}. (49)

For any positive definite matrix and convex cone , let and consider

 T0=y⊤V−1y−minη∈C(y−η)⊤V−1(y−η). (50)

It can be shown [55] that is distributed as a weighted mixture of Chi-squared distribution associated with and denoted as . That is

 Pr{T0≥c}=Pr{¯χ2(V,C)≥c}=m∑i=0wi(m,V,C)⋅Pr{χ2i≥c}, (51)

where

is a Chi-squared random variable with

degrees of freedom and is the point mass at 0. Here are non-negative weights satisfying . See Section 3.3 for details. We then have the following main theorem: Suppose the hypothesis we would like to test is versus where we have the constraint , and suppose all the conditions in Section 3.1 are satisfied. Then under the null hypothesis, the test statistics , and constructed in Step 2 satisfy

 Ts,Tw,TL→¯χ2(H∗α|θ,C∗), (52)

where . The proof of Theorem 52 is postponed to Section 3.5.

Our method is also valid for cones not centered at the origin, for example and . The two-step procedure is exactly the same as before. Under the null hypothesis , by removing from both and , we see that has the same distribution with the case and . This is also validated experimentally by the sum constraint in Section 4.

In this paper we focus on hypothesis on a low dimension parameter only. It is in fact straightforward to extend Theorem 52 to the whole parameter . However, as we will see in Section 3.3, the weights of the null distribution (52) usually lack closed form expression and can only be calculated using numerical methods in practice. When dimension of parameter of interest is large, this could be computationally intractable.

With this weighted Chi-square distribution under the null, we can build hypothesis test for with any designed Type I error. It remains to calculate the weights and the critical value in (51). We describe the calculation of the weights in the next section. The critical value can be calculated numerically as follow.

#### Critical value.

The final step is to calculate the critical value. Specifically, we want to find critical value such that

 Pr{¯χ2(ˆHα|θ,C∗)≥c}=m∑i=0wi(m,ˆHα|θ,C∗)⋅Pr{χ2i≥c}=γ, (53)

where is the designed Type I error. This can be solved numerically by binary search on .

Combining all these result we are able to build valid testing procedure for the original hypothesis (2) with asymptotic designed Type I error by calculating the quantile of the weighted Chi-squared distribution, and reject when , or is greater than this quantile.

### 3.3 Weights Calculation

According to Lemma 3.2, the covariance matrix can be consistently estimated by sample version (31). The cone depends on the constraint space , and . For example for non-negative constraint, we have and hence and ; for monotonic constraint, we have and hence . Since , we have

 C∗=C∩M⊥H∗α|θ={α:α1≤α2≤...≤αd,1⊤H∗α|θ⋅α=0}. (54)

The weights depend on and and can be complicated and without closed form expression. Here we briefly review the expression of general weights for some general dimension , covariance matrix , and cone obtained in [36]. We refer to [55] for more detailed formulas. We start from the simplest case where and . From (50) we have

 T0=m∑i=1max(yi,0)2∼¯χ2(I,\RRm+). (55)

We can see that the weight depends on the number of positive components of : if of them are positive then the distribution would be . There are in total choices of signs on each component of and therefore

 wi(m,I,\RRm+)=2−m(mi). (56)

We then consider with general where the weights are given by

 wi(m,V,\RRm+)\coloneqq∑|A|=i,A⊆[m]p{(VAc)−1}⋅p{VA;Ac}