1 Introduction
Statistical estimation of high dimensional problems has been attracting more and more attention due to the abundance of such data in many emerging fields such as genetic studies, social network analysis, etc. High dimensional geometry is inherently different from lowdimensional geometry. As an example, for linear regression, in low dimensions the Ordinary Least Square (OLS) estimator allows for constructing confidence intervals and hypothesis tests for the true coefficient. In high dimensional models OLS is illconditioned so instead we have to solve for penalized estimators like LASSO. In low dimensions we can test for hypothesis such as
by partial likelihood function while in high dimensions this also fails, due to the large amount of nuisance parameters.In this paper we consider a hypothesis testing problem in a high dimensional model under constrained parameter space. For many problems, before analyzing data and fitting models we might already know some constraints on the parameters. This can also be viewed as prior information on the parameters. For example in isotonic regression [6, 66, 11] we have a constraint that the variables are nondecreasing; in nonnegative least square problem [58]
we have a constraint that the coefficients are nonnegative. in realworld reinforcement learning applications, we need to take into consideration the safety of the agent
[5, 64, 72]. Also, in Gaussian process, it is sometimes assumed that the parameters satisfy some linear inequality constraints [40].With this additional information the statistical inference and hypothesis testing for the parameters may be different. For example, consider a simple model: . In general if we want to test whether is 0 or not, i.e. test for versus , we will reject if the absolute value of the mean is relatively large. However, if we have the constraint that , then we are testing versus , and we reject only when is relatively large.
When we have constraints on parameters, a natural question we want to answer is whether the parameter lies on the boundary or is away from the boundary, since these two cases are usually very different. For example for nonnegativity constraint, we want to know whether the parameter is exactly zero or strictly positive; for monotonic constraint we want to know whether the two variables are equal or one is strictly greater than the other.
In this paper we perform statistical inference (hypothesis testing) for low dimensional parameters in a high dimensional model under cone constraint. Denote the parameter , where is the low dimensional parameter of interest and denotes nuisance parameters. Denote the constraint set as a closed and convex cone, and let be a linear space in . In most of the cases is a polyhedron and the linear space denotes the (subset of) the boundary set of . In this paper we want to test
(1) 
where we have the constraint . We develop an algorithm for this constrained testing problem in high dimensional models. Following our procedure we show that the hypothesis test method we propose has asymptotic designed Type I error, and it has much greater power than when the constraints are ignored.
1.1 Related Work
Highdimensional inference without constraint.
There is a vast literature on performing statistical inference for high dimensional models and here we provide a brief overview. Early work [34] shows that the limiting distribution of LASSO estimator is not normal even in low dimensions. More recently, several approaches have been proposed to obtain asymptotic distribution on low dimensional parameters in high dimensional linear model, mostly by approximating the inverse of the Gram matrix. [73] gives confidence intervals for low dimensional parameters in a high dimensional linear model using low dimensional projection estimator (LDPE). [29] provides asymptotic confidence interval of LASSO estimator for high dimensional linear regression by introducing the debiasing method. [62] further extends their work to a more general setting, including Generalized Linear Model and other nonlinear models. [47] deals with general model on Hessian matrix with Dantzig type estimator. Related works also include [12, 74] for simultaneous inference, [4] for double selection method, [52, 71, 70] for graphical model, [61, 67, 37] for post selective inference, [39, 9] for for synthetic control, [59] for noisy labels, etc.
Lowdimensional constrained inference.
The literature on constrained testing dates back to [10]
, where the authors prove the asymptotic distribution of the likelihood ratio (LR) test statistic for constrained testing to be weighted Chisquare.
[48] further considers testing with unknown covariance matrix, and gives sharp upper and lower bounds for the weights. [23] introduces the test statistics for likelihood ratio test, Wald test, and KuhnTucker test with inequality constraint in linear model, and proves the equivalence of these three tests. [32] proposes onesided test when the coefficients’ signs are known. [54] introduces a modified Lagrange multiplier test for testing onesided problem. [35] proposes Wald test for jointly testing equality and inequality constraints on the parameters. [65] develops asymptotically equivalent tests under linear inequality restrictions for linear models. [33] introduces a locally most mean powerful (LMMP) test. [1] introduces directed tests, which is optimal in terms of power. [8] introduces multipleendpoint testing in clinical trials. [24] provides OrderRestricted Score Tests for generalized linear and nonlinear mixed models. [2] proposes test when nuisance parameters appear under the alternative hypothesis, but not under the null. [49] gives improved LRT and UIT test. More recently, [42] has discussed halfline test for inequality constraints. [60]gives conservative likelihood ratio test using datadependent degree of freedom.
[76] gives Wald test under inequality constraint in linear model with spherically symmetric disturbance. [43] proposes an extended MaxT test and gets the power improvement. However, all these existing results are for low dimensional models.In terms of statistical inference, our work is most related to [47], where the authors establish inference for high dimensional models using decorrelation method. We will review this method in Section 2. For constrained testing, our work is most related to [55] where the authors introduce and discuss Chibarsquared statistic, and [57] and [44] which form the one sided test to test whether a parameter is zero or strictly positive. Recent works [28, 63] consider hypothesis testing on whether the parameters lie in some convex cone. This is still different from our setting where we know the parameters lie in the convex cone and the goal is to test whether they lie on the boundary of the cone.
1.2 Organization of the Paper
2 Algorithm
In this section we describe our main algorithm. Consider a high dimensional statistical model with parameters and the partition , where is dimensional parameter of interest, and is a dimensional nuisance parameter with . We write , and the true parameter as with . Moreover, we have the constraint where is a closed and convex cone. Let be a linear space in . In most of the cases is an polyhedron and the linear space denotes the (subset of) the boundary set of . The hypothesis we want to test is
(2) 
i.e. we want to test whether lies on the boundary of , or is a strict interior point of in at least one direction. For example, with nonnegativity constraint we have and . The hypothesis we want to test is
(3) 
Another example is monotonic constraint where we have and . The hypothesis we want to test is
(4) 
Suppose we have independent trials where we allow for . Denote the sample negative log likelihood function as
(5) 
where is the likelihood function for one trial . In low dimensions we can estimate the parameter by maximum likelihood estimation (MLE). However in high dimensions, MLE may not work. Instead we use the penalized estimator
(6) 
where is some penalty function with tuning parameter . Note that this estimation can be performed with or without the cone constraint . In Section 3 we will see that all we need is the consistency of this estimator.
Let be the gradient of the negative log likelihood function and , be the corresponding partitions. Similarly let be the sample Hessian matrix, and let , , and be the corresponding partitions. Let be the population Fisher information matrix. Denote and , , , as the corresponding partitions for .
The difficulty of the problem comes from two aspects: the problem is high dimensional, and that we have the constraint on . We first deal with the difficulty from high dimensions. It is well known that in low dimensions we can test for based on the partial score function
(7) 
where
is the partial maximum likelihood estimator. Under the null hypothesis we have
(8) 
where is the partial information matrix. We then reject the null when is relatively large. However, in high dimensions this method does not work. To overcome this issue, we follow the decorrelation procedure introduced in [15, 47] as described in Step 1 in Algorithm 1.
Step 1

Get penalized estimator using (6) for some tuning parameter .

For each , estimate by the following Dantzig selector
(9) where is a hyperparameter which we describe how to choose later. Combine them to get matrix , i.e., .

Define the decorrelated score function:
(10) 
Define the decorrelated estimator:
(11) 
Define the decorrelated likelihood function:
(12)
Step 2

Get onesided Wald test statistic
(13) 
Get onesided Likelihood ratio test statistic
(14) 
Get onesided Score test statistic
(15) where
(16)
In Step 1.2, we want to get a linear combination of to best approximate
. The population version of this vector should be
(17) 
However, in high dimensions, we cannot directly estimate by the corresponding sample version since the problem is illconditioned. So instead we estimate by the Dantzig selector .
In Step 1.3 we get decorrelated score function which is approximately orthogonal to any component of the nuisance score function
. This is approximately an unbiased estimating equation for
so the root of this equation should give us an approximately unbiased estimator for . Since searching for the root may be computational intensive, we use one Newton step, as stated in (11).With the decorrelated score function, the decorrelated estimator, and the decorrelated likelihood function, under mild conditions we will specify in Section 3, we have the following asymptotic distributions [47]:
(18)  
(19)  
(20) 
where , and in practice it can be estimated by
(21) 
We then deal with the second difficulty: cone constraint. Since we already get asymptotic normality, we follow the procedure in [55] to construct the Score, Wald and likelihood ratio test statistics, as described in Step 2 in Algorithm 1.
This twostep procedure gives us the final test statistics , and
. In the next section we will show that under null hypothesis, all of them converge weakly to the weighted Chisquare distribution, and from which we can construct valid hypothesis test with asymptotic designed Type I error.
3 Theoretical result
In this section, we outline the main theoretical properties of our method. We start by providing highlevel conditions in Section 3.1, and state our main theorem in Section 3.2 that the null distribution is a weighted Chisquare distribution. In Section 3.3 we describe the way to calculate the weights. We analyze the power of our method in Section 3.4 and the proof of the main theorem is given in Section 3.5.
3.1 Assumptions
In this section we provide highlevel assumptions that allow us to establish properties of each step in our procedure.
Sparsity Condition:
Both and are sparse: . (We use a single for notational simplicity, but this is not required for our method to work).
Score Condition:
The expected value of the score function at true is 0:
(22) 
Sparse Eigenvalue Condition:
We have and for any with . Also both , , and are bounded elementwise, i.e., the maximum element is and each element has absolute value bounded by some constant .
Denote as the maximum absolute value of elements in , i.e., . By saying the maximum element of is , we are assuming .
Estimation Accuracy Condition:
The penalized estimator in (6) is a consistent estimator for the true :
(23) 
where is the hyperparameter in the penalty .
Smooth Hessian Condition:
The Hessian matrix is Lipschitz continuous:
(24) 
for some constant .
The score condition holds for most of the log likelihood functions. In fact, let be the likelihood function and be the parameter, then under certain regularity conditions [51], we have
(25) 
The sparse eigenvalue (SE) condition can be replaced by restricted eigenvalue (RE) condition: let
, RE condition requires and for any in the cone for some . Both sparse eigenvalue condition and restricted eigenvalue condition are common in high dimensional statistical estimation literature, and are known to hold for a large number of models. See Remark A in the supplementary material for the proof.The estimation condition is also common for penalized estimators. For example, [46]
shows that, if the sample loss function
(e.g. negative log likelihood function here) is convex, differentiable, and satisfies Restricted Strong Convexity:(26) 
for certain , then for being penalty, with we have
(27) 
The smooth Hessian condition is to make sure the Hessian matrix is wellbehaved locally, hence to make sure the Dantzig selector is consistent. This condition is also known to hold for general models.
3.2 Main theorem
Before we proceed with our main theorem, we first introduce the following Lemma 3.2 which shows the asymptotic distribution of the decorrelated score function and decorrelated estimator constructed in Step 1 of Algorithm 1. It is in the same spirit as and corresponds to Theorem 4.4 and 4.7 in [15]. All the other related lemmas and proofs are provided in the supplementary material. For ease of presentation, in the following Lemma 3.2 we focus on the case where is a scalar. It is straightforward to generalize to the vector case.
Suppose all the conditions in Section 3.1 are satisfied. Let in Step 1.1, in Step 1.2, and , we have
(28)  
(29)  
(30) 
where and is estimated by the sample version
(31) 
Proof.
The outline of the proof follows from [15]. We start from the proof of (28) for where by mean value theorem we have:
(32)  
where for some .
We consider the three terms separately. For , by taking in Lemma A, under the null hypothesis we have
(33) 
For , according to Hölder’s inequality, Lemma A, and Lemma A we have
(34) 
For we have
(35)  
Combining all these terms we show that (28) holds. We then turn to the proof of (30) that is an consistent estimator. By definition we have
(39)  
Combining the three terms we show that (30) holds. Finally we prove the result (29) for . By construction we have
(43)  
where for some . We consider the terms and separately. For we have
(44) 
Moreover, from the analysis of above we have that . We then obtain
(45) 
For we have that
(46) 
The stated sample complexity is for a general model. For specific models we may be able to get sharper results. For example for linear model and generalized linear model suffices [47].
In Lemma 3.2 we focus on the case where is a scalar. It is straightforward to generalize to the vector case. We are now almost ready for our main theorem. For any positive definite matrix , denote and as the inner product and the norm, respectively. For the linear space , the usual orthogonal complement of associated with is defined as
(49) 
For any positive definite matrix and convex cone , let and consider
(50) 
It can be shown [55] that is distributed as a weighted mixture of Chisquared distribution associated with and denoted as . That is
(51) 
where
is a Chisquared random variable with
degrees of freedom and is the point mass at 0. Here are nonnegative weights satisfying . See Section 3.3 for details. We then have the following main theorem: Suppose the hypothesis we would like to test is versus where we have the constraint , and suppose all the conditions in Section 3.1 are satisfied. Then under the null hypothesis, the test statistics , and constructed in Step 2 satisfy(52) 
where . The proof of Theorem 52 is postponed to Section 3.5.
Our method is also valid for cones not centered at the origin, for example and . The twostep procedure is exactly the same as before. Under the null hypothesis , by removing from both and , we see that has the same distribution with the case and . This is also validated experimentally by the sum constraint in Section 4.
In this paper we focus on hypothesis on a low dimension parameter only. It is in fact straightforward to extend Theorem 52 to the whole parameter . However, as we will see in Section 3.3, the weights of the null distribution (52) usually lack closed form expression and can only be calculated using numerical methods in practice. When dimension of parameter of interest is large, this could be computationally intractable.
With this weighted Chisquare distribution under the null, we can build hypothesis test for with any designed Type I error. It remains to calculate the weights and the critical value in (51). We describe the calculation of the weights in the next section. The critical value can be calculated numerically as follow.
Critical value.
The final step is to calculate the critical value. Specifically, we want to find critical value such that
(53) 
where is the designed Type I error. This can be solved numerically by binary search on .
3.3 Weights Calculation
According to Lemma 3.2, the covariance matrix can be consistently estimated by sample version (31). The cone depends on the constraint space , and . For example for nonnegative constraint, we have and hence and ; for monotonic constraint, we have and hence . Since , we have
(54) 
The weights depend on and and can be complicated and without closed form expression. Here we briefly review the expression of general weights for some general dimension , covariance matrix , and cone obtained in [36]. We refer to [55] for more detailed formulas. We start from the simplest case where and . From (50) we have
(55) 
We can see that the weight depends on the number of positive components of : if of them are positive then the distribution would be . There are in total choices of signs on each component of and therefore
(56) 
We then consider with general where the weights are given by
Comments
There are no comments yet.