Modern studies in the sciences collect huge data sets which include information of a large number of potential explanatory variables, and then attempt to discover the possible association between these variables and the response of interest. For example, in the genome-wide association study (GWAS), where researchers want to find which genetic variants are associated with a trait, we collect high dimensional single-nucleotide polymorphism (SNP) arrays and then aim at finding the association between the trait and SNPs.
Consider the following linear regression model:
where and the dimension is potentially much larger than the number of observations
. The regression coefficient vectorhas sparsity , which is assumed to satisfy throughout this paper, where the constant . Note this sparsity setting includes both the classical strict sparse model, i.e., , and the linear sparsity model, i.e., . An important aspect of any high dimensional estimation procedure is its variable selection performance. To evaluate it, we may consider multiple hypotheses testing
with a particular control on its type I and type II errors. In literature, a popular way is to make as many rejections as possible subject to a pre-determined false discovery rate (FDR), or said differently, to pursue the largest testing power under a given bound for the (expected) proportion of false rejection. Various works [3, 23, 26, 4] studied FDR control for independent or dependent p-values. Recently, [27, 32]
analyze the trade-off between type I and type II errors along the Lasso regularization path under Gaussian random design and linear sparsity. Unlike the multiple testing problems whose primary focus is only on the correctness of the rejection decisions, the regression analysis has another important objective: the accuracy of parameter estimation. As far as we are aware, the connection between estimation and selection is rarely studied in the literature. Therefore, in this work, we try to bridge the selection correctness and estimation accuracy and to understand the interplay between them. More specifically, we will pursue the best type I error control, in the sense of number of false discoveries and false discovery rate, subject to rate-optimalestimation error.
As proved by , when the design matrix meets certain regularity conditions and , the minimax convergence rate for is of the order
111When is an identity matrix, the minimax rate is
is an identity matrix, the minimax rate is where if .. And, various estimation approaches attain this minimax rate [12, 1, 33, 35, 5, 22, 28, 17, 24]. Our main result shows that if an estimator is rate-minimax, i.e., there exists a constant such that holds for any -sparse vector with , then the logarithm of its false discover rate under worst scenario is no smaller than the order of . In other words, for any rate minimax estimator,
for some positive constants , and , where and increase as increases. Furthermore we show that this polynomial decay (with respect to sparse ratio ) is achievable. Therefore, this implies the following minmax type result for FDR (and number of false discoveries respectively):
Particularly, if , we obtain a sharper minimax result for the number of false discoveries when :
Based on this result, we characterize the optimal type I error control depending on the model sparsity:
(Polynomial sparsity) If for some fixed , then rate-minimax estimators, at the best, can guarantee that the number of false discoveries decays to 0 (as long as constant is sufficiently large);
(Near-linear sparsity) If and , the best rate-minimax estimators can guarantee the rate of false discoveries decays to 0, but false positive selection always occurs under worst scenario. Note that this complies with existing results such as theorem 3.4 of  or Corollary 5.3 of , which suggest rate-minimax estimators will select larger model than true model;
(Linear sparsity) If for some fixed , no minimax estimation can ensure a decaying false discovery rate.
Note that the near-linear sparsity scenario is still a strict sparse setting. If combined with certain beta-min condition (i.e., the nonzero coordinates of are bounded away from 0) that guarantees no false negative selection, the above result implies that (a) under polynomial sparsity, rate-minimax estimators can achieve selection consistent; (b) under near-linear sparsity, rate-minimax estimators, at the best, achieve almost full model recovery , that is,
To comment on other similar results,  also established the relationship between asymptotic sharp minimaxity and false discovery rate for Gaussian means models. Their results rely on a narrower range of sparsity (i.e., polynomial sparsity) and decaying rate of FDR is at most of logarithm order. In contrast, this presented work considers more general regression models and broader sparsity range.
A toy simulation is conducted under normal means models, i.e., and , where , and nonzero ’s are which corresponds to the worst case of . The rate minimax estimator (4.1) with is used for estimation, and Figure 1 plots the logarithm of estimated FDR based on 100 independent simulations versus the logarithm of true sparsity ratio . The plot displays a clear and strong linear trend with .
It is worth mentioning that in literature, a variety of regression estimators achieve the rate- [20, 36, 10, 11, 37, 38, 34]. Our above type I error control results don’t apply to this class of estimators, for example, under a proper choice of tuning parameter, Lasso solution doesn’t include any false positives . Note that and share the exactly same order under polynomial sparsity setting. However, under near-linear and linear sparse models, is strictly larger than the minimax rate. Thus rate- estimators are considered as (situational) suboptimal. In this work, we distinguish suboptimal estimators from universally rate-minimax estimators, since near-linear and linear sparsity settings are of great practical interests. In many modern high dimensional study such as omics studies, this is not an uncommon situation that the underlying model contains many many covariates, i.e, dense model. For example, in gene regulator network study, there usually are a huge number of regulators interacting with each other to change the expression level.
Another interesting relationship between rate-optimal estimators and rate- suboptimal estimators is that, the former ones do yield false discovery (under near-linear or linear sparsity) whilst the latter ones can achieve no false discovery. Use hard thresholding estimator of normal means regression as an example, if and only if , it ensures no false positive selection, but in consequence, its convergence rate is of . On the other side, to attain rate-minimaxity, one must reduce , say to . Similar things occur to the LASSO tuning parameter as well . Rigorously, we shows that 1) any estimator that ensures no false positive selection (no-false-positive estimators), at the best, has suboptimal rates; and furthermore, we show that 2) under proper regularity requirement, selection-consistent estimation must be no-false-positive estimator. Together, it explains the phenomenon in literature that most of the model selection consistent estimations only achieve suboptimal convergence rate. It is worth to emphasize that second result in above is not trivial. The term ”no-false-positive” means there is asymptotically no false positive selection regardless of the magnitude of true parameter and the term ”selection-consistent” refers to consistently select true underlying model under a necessary beta-min condition, therefore no false positive is not a necessary condition for selection consistent. More discussion can be found in Sections 2 and 3.
Our study and results mainly focus on the mean sequence models and regression models with independent random Gaussian covariates, and this paper is organized as follows. In Section 2, we study the selection behavior of rate minimax estimators under normal means models, and similar theoretical investigation is conducted for regression models under Gaussian random design in Section 3. Section 4 shows that the lower bound discussed in Sections 2 and 3 can be achieved by -penalization. Some more discussion and conclusive remarks are provided in Section 5. All technical proofs are provided in the Appendix.
Notation of this work:
Throughout the paper, we use to denote a subset model, and be the size of this model. For any vector and matrix , and denote the sub-vector or submatrix corresponding to the model . Denote , and if is a zero vector, then we define . With slight abuse of the notation, we use as the the operator that extracts the model of , i.e., . Let denote all -sparse vectors in the the dimensional space. For two sequences of positive values and , means that , means , means that , and means that . Given two vectors , , is called to majorize if and for all .
2 Type I Error Control under Normal Means Models
In this section, we investigate the relationship between type I error control and rate minimxity for the simple normal means model
where , and true parameter . We are interested in answering the following question: what is the best a rate-minimax estimator can do in term of controlling the number of false discoveries, or false discovery rate? And our main result proves that the false discovery rate of a rate-minimax estimator decreases, at the best, at a polynomial rate of .
To state our result, we denote be the number of false discovery yielded by estimator , and define . Hence the set is the collection of all estimators whose convergence are rate-optimal with a multiplicative constant . Our next theorem studies the minimax lower bound for the expected number of false discoveries.
Under normal means models, if , then
for some positive constant and which depend on and the ratio . Furthermore, if and , then (2.1) holds for any asymptotically.
Given a reasonably large , the above theorem shows any rate-minimax estimator can yield at least false discoveries (for some ) on average in the worst scenario. On the other side, this polynomial lower bound of can be achieved by a simple hard thresholding minimax estimator
For this estimator, apparently . By the fact that where and are pdf and cdf of standard normal, we have that for any with , this hard thresholding estimator satisfies
Note that this hard-thresholding estimator is not practical since it relies on the unknown true sparsity, and in Section 4, we will discuss some adaptive estimator that can also achieve the same false positive control. In summary, we claim that
By the remarks in the proof of Theorem 2.1, the polynomial degree in (2.1) increases as increases, which implies a potential trade-off between estimation accuracy and false discovery control. In other words, estimators with larger (i.e., worse convergence rate in terms of multiplicative constant) will have a smaller lower bound for the expected number of false positives (i.e., potentially less type I errors). Such trade-off is also reflected by the thresholding estimator (2.2): larger penalization parameter will increase the multiplicative constant of the convergence, but on the other side, it decreases the number of false positives as the the polynomial degree in (2.3) is larger.
This presented minimax result implies that under the polynomial sparsity that for some , as long as is sufficient large such that , then , that is, there will be no false discovery in probability; But under linear or near-linear sparsity, we always have regardless of the value of . Note that doesn’t necessarily imply that doesn’t converge to 1. However, later on in Theorem 2.2, we will show that rate-minimax estimators indeed can never guarantee under near-linear or linear sparsity.
The result in Theorem 2.1 also implies a lower bound for the minimax false discovery rate under minimax estimations. Note that
where is the number of true positives. Combining with , we have that
In Section 4, we will show that polynomial decaying of false discovery rate are attainable, thus
This trivially implies that, a good rate-minimax estimation can ensure that FDR decays to 0 as increases under polynomial or near-linear sparsity, but not under linear sparsity. Note that readers shall not interpret this result as that rate-minimax estimator can not achieve small FDR under linear sparsity. Given any pre-specific level of FDR, no matter how small it is, rate-minimax estimator can still attain it, but at the expense of a very large ; but given a pre-specified , the worst-case FDR of a rate-minimax estimator will be always bounded away from 0 as increases under linear sparsity.
Another important aspect of the selection behavior is the type II error, or said differently, the false negative selections. In high dimensional literature, a common result is that a nonzero covariate will be consistently selected if the magnitude of true parameter is larger than certain thresholding value, i.e., under a proper beta-min condition. A trivial result will be that, if the estimator satisfies for some constant with high probability, then in probability, as long as . Sharper beta-min condition is on a case-by-case basis and depends what estimator is used.
Combining the above discussions with our previous results on false discovery control, we can conclude that if is a rate-minimax estimator with optimal polynomial decaying FDR control, then, under polynomial sparsity and beta-min condition, it can still recover the exact sparsity structure; under near-linear sparsity and beta-min condition, since , it can accomplish almost fully recovery for the sparsity structure , i.e., , where denotes the symmetric difference of two sets; under linear sparsity and beta-min condition, it can only ensure that and for some positive , and we can make the smaller at the expense of larger multiplicative constant of its convergence rate. It is worthy mentioning that these model selection behaviors described above only hold for rate-minimax estimators that have universally polynomial decay rates for FDR, but not to all rate-minimax estimators. For instance, the estimator , which selects the top covariates in terms of the absolute value , is a rate-minimax estimator, but it has no control on the false discover at all, especially when the true nonzero coefficients are small. Hence our previous arguments don’t apply here, and this estimator indeed is always selection consistent under beta-min condition regardless of the growth of sparsity.
Now, we would like to investigate in depth the false discovery control for minimax estimator under near-linear or linear sparsity setting. As discussed above, under (near-)linear sparsity, , but it doesn’t directly imply that false discover will occur with positive probability, i.e., . But our next result confirms it in the following way.
For any estimator , we have
for some constant , where as and .
Result (2.4) claims if an estimator ensures no false discovery, that is , its convergence rate is at least of order . Under (near-)linear sparsity setting, to achieve -rate, we must require . Equivalently, a rate-minimax estimator must satisfy under (near-)linear sparsity, that is, under worst case, false discovery always occurs.
Theorem 2.2 essentially claims the incompatibility between rate minimaxity and no false discovery. It is quite tempting to believe that similar incompatibility phenomenon occurs between rate minimaxity and selection consistency as well, since it is indeed true for many popular penalized estimators and Bayesian shrinkage estimators proposed in literature. However, as mentioned in the Introduction section, no false discovery is not quite a prerequisite for selection consistency, since the former concept holds uniformly for all -sparse and the latter one requires beta-min condition. Counterexamples that can achieve rate-minimaxity and selection consistency simultaneously include the estimator that selects top covariates, or the following one which doesn’t rely on true sparsity ,
where and is some large constant. The rate-minimaxity of estimator (2.5) follows from similar arguments used in Theorem 4.1. These counterexamples, although are selection consistent and rate minimax, possess an unusual selection behavior that is a larger data value doesn’t always induces a larger selected model. For instance, let two data and where , i.e., is a larger than in terms of data magnitude. Then for estimator (2.5), and , i.e., larger data values actually yield a smaller subset model.
It turns out that the incompatibility between selection consistency and rate minimaxity does depend on whether this estimator possesses certain monotone selection property. And a monotone estimator is never both selection consistent and rate minimax. Formally, we call an estimator is monotone if majorizes providing that majorizes . This monotonicity trivially implies that if majorizes . Define the class of selection consistency estimators as for some given positive function , where represents the minimal signal strength, e.g. . Let be the collection of estimators that ensure no false discovery asymptotically.
If a monotone estimator , then .
The above result states that if a selection consistent estimator is monotone, then it must never yield false discovery, hence by Theorem 2.2, it must not be rate-minimax. This result provides us an explanation for the incompatibility between rate minimaxity and selection consistency observed in literature, as most of estimators used are monotone. For instance, if a separable penalty function (i.e., for some function ) is used for penalized estimator under normal mean models, then the monotonicity of estimator is equivalent to that the thresholding function is monotone. This is true, as long as that is symmetric and non-negative, , and is monotone on . Therefore, almost all penalty functions proposed in literature, including LASSO, non-concave penalties [36, 15], penalty and reciprocal penalty , lead to monotone estimators. Other popular frequentist approaches, such as FDR estimator  and SLOPE estimator , also belong to the class of monotone estimators.
3 Type I Error Control under Gaussian Regression Models
In this section, we are interested in generalizing the theorems in Section 2 to regression models
where and . To facilitate theoretical analysis, we restrict our investigation to the case that the design matrix is almost orthogonal. Particularly, we consider that the design matrix follows the independent Gaussian random design, i.e.
All entries in the design matrix
are i.i.d standard normally distributed.
First of all, we obtain the same lower bound for the minimax expected value of false positive as in the means models. Define the collection of rate-minimax estimators with any constant . The following result holds:
Under condition (C1), if for some constant , and , are reasonably large, then
for some constant and , where depends on , and . In particular, if , and , the above lower bound holds for any asymptotically.
It is worth mentioning this polynomial decaying lower bound actually holds as long as where denotes the th column of , but not necessarily under random Gaussian design. The proof in the appendix shows that this lower bound can be attained by some estimation function where subscript denotes all indices but . Note that this is not an estimator since it depends on knowledge of true . In Section 4, we will show that under condition (C1) and , there exists a penalized estimator that can achieve this polynomial decaying bound. Thus, by the same arguments used in Section 2,
under Gaussian design and . Our remarks on the relationship between sparsity growth and type I error control behavior for means models, therefore apply to Gaussian linear regression model as well.
As in means model, the following theorem establishes the relationship between the rate of convergence and the probability of selecting false discovery, and claims that rate minimax estimators always yield false positive under the worst scenario.
Under Gaussian regression models, any estimator satisfies
for some constant , where as , and .
Under normal means model, we connect the no-false-discovery estimation and selection-consistent estimation by introducing the concept of monotonicity. But under general regression model, due to the column dependencies of , it is difficult to introduce a similar concept or to obtain similar results such as Lemma 2.1
. However, by random matrix theory, e.g.,
, under condition (C1), with high probability, the singular values of low dimensional submatrix ofare very close to 1, i.e., the columns in are nearly orthogonal. Hence, we conjecture that selection-consistent estimators which possess monotonicity under normal means models (i.e., is a monotone estimator), can still ensure no false discovery for Gaussian regression models, under a proper condition on the growth rate of dimension and sparsity. For example,  showed that, LASSO estimator is selection consistent and yields no false positive when and its tuning parameter ; Similar result holds for penalized estimator with penalty as well, if and (refer to Lemma A.4 in the Appendix). Theoretical investigation on this matter is beyond the scope of this work. In general, we conjecture that:
Selection consistent penalized estimators induced by separable penalty function , in general, ensure that, asymptotically there is no false discovery if and grow slowly.
More discussions on the above proposition are provided in the Appendix B. In particular, we show that the above proposition is true under some regularity conditions. Therefore, at least for some class of penalized estimator, selection consistency and rate-minimaxity can never be accomplished simultaneously.
4 Optimal Penalized Estimator
In this section, we show that there exists a rate-minimax estimator that can achieve the polynomial decay rate for the false discovery control, i.e., the lower bounds derived in the previous theorems are attainable. We consider the class of estimators based on selection criterion:
where the penalty only depends on the norm. The estimator is the OLS estimation based on the selected model , which is obtained by searching the model space as follows:
Here, is the residual sum of squares under model .
The particular penalty function we use in this section is
for some user-specific parameter . Equivalently,
Thus, tuning parameter is an upper bound size for the searched models, and a trivial choice could be . Within the model size search range , this penalty function assigns smaller penalty for adding one more covariate into the current model when the size of current model is larger.
The penalty has been already extensively used for regression in literature [18, 29, 33, 16, 6]. For example,  investigated convergence rate of this penalization under normal means problem, and established sharp minimaxity under and .  studied the a wide class of selection penalization under a general regression setting including (4.1). These existing results in literature mostly focused on the convergence of estimation or the risk . In this section, we will also focus on its selection behavior, especially the false discovery control behavior.
It is worth mentioning that this form of penalty also strongly links to the Benjamini-Hochberg (BH) FDR control procedure . More specifically, under means models, the step-up BH FDR estimator  is , where is the th largest entry of , is the rightmost local minimum of
where is the desired FDR level.  showed that the step-up BH FDR estimator is sharply minimax if and the sparsity ratio for some . On the other hand, the penalized estimator (4.1) is also equivalent to where the is the global minimum within the range of for the objective function
When , and are approximately the same, since . Another related work is the SLOPE estimator [28, 7], which can be viewed as a soft thresholding FDR penalization. SLOPE estimator controls the FDR under means model, and achieves sharp minimaxity under Gaussian random design.
For the sparse means model, the next theorem show that the penalty induces a rate-minimax estimator, as well as polynomial decay of false discovery control.
Consider estimator (4.1) under mean sequence models with parameter and , if the tuning parameters and is sufficiently large, then the following properties asymptotically hold with dominating probability ,
where and are some positive constants. Furthermore, if and tuning parameters satisfy and , then the above results asymptotically hold for any and .
This result indicates that in probability, the number of false discoveries is bounded by , and said differently, matches the lower bound order presented in the previous section. Furthermore, under the strict sparsity setting, the polynomial degree can be as large as , where is the upper bound of the multiplicative constant of convergence rate. Comparing with the polynomial degree of lower bound result (i.e., ) in Theorem 2.1, we see that the polynomial degree of estimator (4.1) is nearly optimal as well. In the statement of this theorem, phrase “with dominating probability” means with probability as least for some , as showed in the proof in the Appendix. This hence implies that
In the literature,  established the sharp minimaxity of step-up BH FDR estimator under polynomial sparsity, where its FDR is allowed to decrease at rate of . But our result shows that estimator (4.1) is almost minimax under both polynomial and near-linear sparsity (when is sufficiently close to 2), and its FDR can decay polynomially fast.
In general, the value of plays a role of balancing the rate of convergence and rate of false discovery rate decay. A larger leads to a large polynomial order , but at the expense of greater multiplicative constant in the convergence rate. In other words, there is a trade-off between false discoveries control and estimation accuracy in terms of the choice of .
For Gaussian design regression model, similar results can be developed, as stated in the next theorem.
Consider estimator (4.1) for Gaussian design linear regression models with parameter , and , if we choose tuning parameters and to be a sufficiently large constant, then the following results holds asymptotically with dominating probability:
for some positive constant and . If furthermore, , then
hold with high probability for some constant .
The above theorem asserts that the number of false discoveries is bounded linearly by , and the polynomial rate of false discovery control can be attained under the dimensional condition . Under strictly sparse setting, as showed by theorem 1.3 of , the minimax convergence rate is , thus the estimator (4.1) is almost sharply minimax if we choose , and the polynomial degree for the type I control is almost sharp as well. Other remarks for the Theorem 4.1 also apply to this theorem as well.
5 Conclusion and Discussion
In this work, we mainly investigate the selection performance for rate-minimaxity estimators, more precisely, we are interested in understanding the best possible type I error control behavior under rate-optimal estimation. Our study shows that rate-optimal estimation can induce as many as false positive selections, and its FDR decay rate is at best of a polynomial rate of . Therefore, depending on the growth rate of sparsity, rate-minimax estimators have different optimal selection performance. Under near-linear sparsity, the number of false discoveries cannot be bounded, and its can explode to infinity; under linear sparsity, the false discovery rate is bounded away from 0 in the worst case. These results also help us to understand the incompatibility between selection consistency and rate-minimaxity observed in statistical literature.
Polynomial rate of false discovery control can be achieved by the adaptive penalty . Under the beta-min condition, the resulting penalized estimator can recover the true model under polynomial sparsity, and almost recover the true model under near-linear sparsity. But under linear sparsity, no such selection consistency is guaranteed any more. In addition, this penalized estimator is almost sharp minimax under polynomial or near-linear sparsity given . Notice that the SLOPE estimator employs a soft version of -penalization, hence we conjecture that the SLOPE estimator can also achieve a similar asymptotic FDP control as in our Theorem 4.2.
Appendix A Proofs
Proof of Lemma 2.1
If a monotone selection consistent estimator , then there exists a sequence of , such that , where . To induce contraction, it is sufficient to show that there exists a sequence of , such that and .
For any , if and for some index , we claim that if we replace this entry by a sufficiently large value, i.e., there exist a such that for all and , then . Therefore, the can be constructed by replacing all small-but-nonzero entries of by large absolute value entries.
Now we show the existence of . Without losing generality, let the index and denote where . We define set where . By the monotonicity of estimator, if , and , then . First of all, if , then is zero Lebesgue measure set, and the existence of is trivial. Now we only consider . If