Statistical estimation of high dimensional problems has been attracting more and more attention due to the abundance of such data in many emerging fields such as genetic studies, social network analysis, etc. High dimensional geometry is inherently different from low-dimensional geometry. As an example, for linear regression, in low dimensions the Ordinary Least Square (OLS) estimator allows for constructing confidence intervals and hypothesis tests for the true coefficient. In high dimensional models OLS is ill-conditioned so instead we have to solve for penalized estimators like LASSO. In low dimensions we can test for hypothesis such asby partial likelihood function while in high dimensions this also fails, due to the large amount of nuisance parameters.
In this paper we consider a hypothesis testing problem in a high dimensional model under constrained parameter space. For many problems, before analyzing data and fitting models we might already know some constraints on the parameters. This can also be viewed as prior information on the parameters. For example in isotonic regression [6, 66, 11] we have a constraint that the variables are non-decreasing; in non-negative least square problem 
we have a constraint that the coefficients are non-negative. in real-world reinforcement learning applications, we need to take into consideration the safety of the agent[5, 64, 72]. Also, in Gaussian process, it is sometimes assumed that the parameters satisfy some linear inequality constraints .
With this additional information the statistical inference and hypothesis testing for the parameters may be different. For example, consider a simple model: . In general if we want to test whether is 0 or not, i.e. test for versus , we will reject if the absolute value of the mean is relatively large. However, if we have the constraint that , then we are testing versus , and we reject only when is relatively large.
When we have constraints on parameters, a natural question we want to answer is whether the parameter lies on the boundary or is away from the boundary, since these two cases are usually very different. For example for nonnegativity constraint, we want to know whether the parameter is exactly zero or strictly positive; for monotonic constraint we want to know whether the two variables are equal or one is strictly greater than the other.
In this paper we perform statistical inference (hypothesis testing) for low dimensional parameters in a high dimensional model under cone constraint. Denote the parameter , where is the low dimensional parameter of interest and denotes nuisance parameters. Denote the constraint set as a closed and convex cone, and let be a linear space in . In most of the cases is a polyhedron and the linear space denotes the (subset of) the boundary set of . In this paper we want to test
where we have the constraint . We develop an algorithm for this constrained testing problem in high dimensional models. Following our procedure we show that the hypothesis test method we propose has asymptotic designed Type I error, and it has much greater power than when the constraints are ignored.
1.1 Related Work
High-dimensional inference without constraint.
There is a vast literature on performing statistical inference for high dimensional models and here we provide a brief overview. Early work  shows that the limiting distribution of LASSO estimator is not normal even in low dimensions. More recently, several approaches have been proposed to obtain asymptotic distribution on low dimensional parameters in high dimensional linear model, mostly by approximating the inverse of the Gram matrix.  gives confidence intervals for low dimensional parameters in a high dimensional linear model using low dimensional projection estimator (LDPE).  provides asymptotic confidence interval of LASSO estimator for high dimensional linear regression by introducing the debiasing method.  further extends their work to a more general setting, including Generalized Linear Model and other nonlinear models.  deals with general model on Hessian matrix with Dantzig type estimator. Related works also include [12, 74] for simultaneous inference,  for double selection method, [52, 71, 70] for graphical model, [61, 67, 37] for post selective inference, [39, 9] for for synthetic control,  for noisy labels, etc.
Low-dimensional constrained inference.
The literature on constrained testing dates back to 
, where the authors prove the asymptotic distribution of the likelihood ratio (LR) test statistic for constrained testing to be weighted Chi-square. further considers testing with unknown covariance matrix, and gives sharp upper and lower bounds for the weights.  introduces the test statistics for likelihood ratio test, Wald test, and Kuhn-Tucker test with inequality constraint in linear model, and proves the equivalence of these three tests.  proposes one-sided -test when the coefficients’ signs are known.  introduces a modified Lagrange multiplier test for testing one-sided problem.  proposes Wald test for jointly testing equality and inequality constraints on the parameters.  develops asymptotically equivalent tests under linear inequality restrictions for linear models.  introduces a locally most mean powerful (LMMP) test.  introduces directed tests, which is optimal in terms of power.  introduces multiple-endpoint testing in clinical trials.  provides Order-Restricted Score Tests for generalized linear and nonlinear mixed models.  proposes test when nuisance parameters appear under the alternative hypothesis, but not under the null.  gives improved LRT and UIT test. More recently,  has discussed halfline test for inequality constraints. 
gives conservative likelihood ratio test using data-dependent degree of freedom. gives Wald test under inequality constraint in linear model with spherically symmetric disturbance.  proposes an extended MaxT test and gets the power improvement. However, all these existing results are for low dimensional models.
In terms of statistical inference, our work is most related to , where the authors establish inference for high dimensional models using decorrelation method. We will review this method in Section 2. For constrained testing, our work is most related to  where the authors introduce and discuss Chi-bar-squared statistic, and  and  which form the one sided test to test whether a parameter is zero or strictly positive. Recent works [28, 63] consider hypothesis testing on whether the parameters lie in some convex cone. This is still different from our setting where we know the parameters lie in the convex cone and the goal is to test whether they lie on the boundary of the cone.
1.2 Organization of the Paper
In this section we describe our main algorithm. Consider a high dimensional statistical model with parameters and the partition , where is dimensional parameter of interest, and is a dimensional nuisance parameter with . We write , and the true parameter as with . Moreover, we have the constraint where is a closed and convex cone. Let be a linear space in . In most of the cases is an polyhedron and the linear space denotes the (subset of) the boundary set of . The hypothesis we want to test is
i.e. we want to test whether lies on the boundary of , or is a strict interior point of in at least one direction. For example, with nonnegativity constraint we have and . The hypothesis we want to test is
Another example is monotonic constraint where we have and . The hypothesis we want to test is
Suppose we have independent trials where we allow for . Denote the sample negative log likelihood function as
where is the likelihood function for one trial . In low dimensions we can estimate the parameter by maximum likelihood estimation (MLE). However in high dimensions, MLE may not work. Instead we use the penalized estimator
where is some penalty function with tuning parameter . Note that this estimation can be performed with or without the cone constraint . In Section 3 we will see that all we need is the consistency of this estimator.
Let be the gradient of the negative log likelihood function and , be the corresponding partitions. Similarly let be the sample Hessian matrix, and let , , and be the corresponding partitions. Let be the population Fisher information matrix. Denote and , , , as the corresponding partitions for .
The difficulty of the problem comes from two aspects: the problem is high dimensional, and that we have the constraint on . We first deal with the difficulty from high dimensions. It is well known that in low dimensions we can test for based on the partial score function
is the partial maximum likelihood estimator. Under the null hypothesis we have
where is the partial information matrix. We then reject the null when is relatively large. However, in high dimensions this method does not work. To overcome this issue, we follow the decorrelation procedure introduced in [15, 47] as described in Step 1 in Algorithm 1.
Get penalized estimator using (6) for some tuning parameter .
For each , estimate by the following Dantzig selector
where is a hyper-parameter which we describe how to choose later. Combine them to get matrix , i.e., .
Define the decorrelated score function:
Define the decorrelated estimator:
Define the decorrelated likelihood function:
Get one-sided Wald test statistic
Get one-sided Likelihood ratio test statistic
Get one-sided Score test statistic
In Step 1.2, we want to get a linear combination of to best approximate
. The population version of this vector should be
However, in high dimensions, we cannot directly estimate by the corresponding sample version since the problem is ill-conditioned. So instead we estimate by the Dantzig selector .
In Step 1.3 we get decorrelated score function which is approximately orthogonal to any component of the nuisance score function
. This is approximately an unbiased estimating equation forso the root of this equation should give us an approximately unbiased estimator for . Since searching for the root may be computational intensive, we use one Newton step, as stated in (11).
With the decorrelated score function, the decorrelated estimator, and the decorrelated likelihood function, under mild conditions we will specify in Section 3, we have the following asymptotic distributions :
where , and in practice it can be estimated by
We then deal with the second difficulty: cone constraint. Since we already get asymptotic normality, we follow the procedure in  to construct the Score, Wald and likelihood ratio test statistics, as described in Step 2 in Algorithm 1.
This two-step procedure gives us the final test statistics , and
. In the next section we will show that under null hypothesis, all of them converge weakly to the weighted Chi-square distribution, and from which we can construct valid hypothesis test with asymptotic designed Type I error.
3 Theoretical result
In this section, we outline the main theoretical properties of our method. We start by providing high-level conditions in Section 3.1, and state our main theorem in Section 3.2 that the null distribution is a weighted Chi-square distribution. In Section 3.3 we describe the way to calculate the weights. We analyze the power of our method in Section 3.4 and the proof of the main theorem is given in Section 3.5.
In this section we provide high-level assumptions that allow us to establish properties of each step in our procedure.
Both and are sparse: . (We use a single for notational simplicity, but this is not required for our method to work).
The expected value of the score function at true is 0:
Sparse Eigenvalue Condition:
We have and for any with . Also both , , and are bounded element-wise, i.e., the maximum element is and each element has absolute value bounded by some constant .
Denote as the maximum absolute value of elements in , i.e., . By saying the maximum element of is , we are assuming .
Estimation Accuracy Condition:
The penalized estimator in (6) is a consistent estimator for the true :
where is the hyper-parameter in the penalty .
Smooth Hessian Condition:
The Hessian matrix is Lipschitz continuous:
for some constant .
The score condition holds for most of the log likelihood functions. In fact, let be the likelihood function and be the parameter, then under certain regularity conditions , we have
The sparse eigenvalue (SE) condition can be replaced by restricted eigenvalue (RE) condition: let, RE condition requires and for any in the cone for some . Both sparse eigenvalue condition and restricted eigenvalue condition are common in high dimensional statistical estimation literature, and are known to hold for a large number of models. See Remark A in the supplementary material for the proof.
The estimation condition is also common for penalized estimators. For example, 
shows that, if the sample loss function(e.g. negative log likelihood function here) is convex, differentiable, and satisfies Restricted Strong Convexity:
for certain , then for being penalty, with we have
The smooth Hessian condition is to make sure the Hessian matrix is well-behaved locally, hence to make sure the Dantzig selector is consistent. This condition is also known to hold for general models.
3.2 Main theorem
Before we proceed with our main theorem, we first introduce the following Lemma 3.2 which shows the asymptotic distribution of the decorrelated score function and decorrelated estimator constructed in Step 1 of Algorithm 1. It is in the same spirit as and corresponds to Theorem 4.4 and 4.7 in . All the other related lemmas and proofs are provided in the supplementary material. For ease of presentation, in the following Lemma 3.2 we focus on the case where is a scalar. It is straightforward to generalize to the vector case.
Suppose all the conditions in Section 3.1 are satisfied. Let in Step 1.1, in Step 1.2, and , we have
where and is estimated by the sample version
where for some .
We consider the three terms separately. For , by taking in Lemma A, under the null hypothesis we have
For we have
where for some . We consider the terms and separately. For we have
Moreover, from the analysis of above we have that . We then obtain
For we have that
The stated sample complexity is for a general model. For specific models we may be able to get sharper results. For example for linear model and generalized linear model suffices .
In Lemma 3.2 we focus on the case where is a scalar. It is straightforward to generalize to the vector case. We are now almost ready for our main theorem. For any positive definite matrix , denote and as the inner product and the norm, respectively. For the linear space , the usual orthogonal complement of associated with is defined as
For any positive definite matrix and convex cone , let and consider
It can be shown  that is distributed as a weighted mixture of Chi-squared distribution associated with and denoted as . That is
is a Chi-squared random variable withdegrees of freedom and is the point mass at 0. Here are non-negative weights satisfying . See Section 3.3 for details. We then have the following main theorem: Suppose the hypothesis we would like to test is versus where we have the constraint , and suppose all the conditions in Section 3.1 are satisfied. Then under the null hypothesis, the test statistics , and constructed in Step 2 satisfy
Our method is also valid for cones not centered at the origin, for example and . The two-step procedure is exactly the same as before. Under the null hypothesis , by removing from both and , we see that has the same distribution with the case and . This is also validated experimentally by the sum constraint in Section 4.
In this paper we focus on hypothesis on a low dimension parameter only. It is in fact straightforward to extend Theorem 52 to the whole parameter . However, as we will see in Section 3.3, the weights of the null distribution (52) usually lack closed form expression and can only be calculated using numerical methods in practice. When dimension of parameter of interest is large, this could be computationally intractable.
With this weighted Chi-square distribution under the null, we can build hypothesis test for with any designed Type I error. It remains to calculate the weights and the critical value in (51). We describe the calculation of the weights in the next section. The critical value can be calculated numerically as follow.
The final step is to calculate the critical value. Specifically, we want to find critical value such that
where is the designed Type I error. This can be solved numerically by binary search on .
3.3 Weights Calculation
According to Lemma 3.2, the covariance matrix can be consistently estimated by sample version (31). The cone depends on the constraint space , and . For example for non-negative constraint, we have and hence and ; for monotonic constraint, we have and hence . Since , we have
The weights depend on and and can be complicated and without closed form expression. Here we briefly review the expression of general weights for some general dimension , covariance matrix , and cone obtained in . We refer to  for more detailed formulas. We start from the simplest case where and . From (50) we have
We can see that the weight depends on the number of positive components of : if of them are positive then the distribution would be . There are in total choices of signs on each component of and therefore
We then consider with general where the weights are given by