The purpose of this paper is twofold. First, we provide a unified treatment of prediction, coefficient estimation, and variable selection properties of concave penalized least squares estimation (PLSE) in high-dimensional linear regression under the restrictive eigenvalue (RE) condition on the design matrix. Second, we propose sorted concave PLSE to combine the advantages of concave and sorted penalties, and to prove its superior theoretical properties and computational feasibility under the RE condition. Along the way, we study penalty level and concavity of multivariate penalty functions, including mixed penalties motivated by Bayesian considerations as well as sorted and separable penalties. Local convex approximation (LCA) is proposed and studied as a solution for the computation of sorted concave PLSE.
Consider the linear model
where is a design matrix, is a response vector, is a noise vector, and is an unknown coefficient vector. For simplicity, we assume throughout the paper that the design matrix is column normalized with .
Our study focuses on local and approximate solutions for the minimization of penalized loss functions of the form
with a penalty function satisfying certain minimum penalty level and maximum concavity conditions as described in Section 2. The PLSE can be viewed as a statistical choice among local minimizers of the penalized loss.
Among PLSE methods, the Lasso  with the penalty is the most widely used and extensively studied. The Lasso is relatively easy to compute as it is a convex minimization problem, but it is well known that the Lasso is biased. A consequence of this bias is the requirement of a neighborhood stability/strong irrepresentable condition on the design matrix for the selection consistency of the Lasso [13, 33, 24, 26]. Fan and Li  proposed a concave penalty to remove the bias of the Lasso and proved an oracle property for one of the local minimizers of the resulting penalized loss. Zhang  proposed a path finding algorithm PLUS for concave PLSE and proved the selection consistency of the PLUS-computed local minimizer under a rate optimal signal strength condition on the coefficients and the sparse Riesz condition (SRC)  on the design. The SRC, which requires bounds on both the lower and upper sparse eigenvalues of the Gram matrix and is closely related to the restricted isometry property (RIP) , is substantially weaker than the strong irrepresentable condition. This advantage of concave PLSE over the Lasso has since become well understood.
For prediction and coefficient estimation, the existing literature somehow presents an opposite story. Consider hard sparse coefficient vectors satisfying with and small . Although rate minimax error bounds were proved under the RIP and SRC respectively for the Dantzig selector and Lasso in  and , Bickel et al.  sharpened their results by weakening the RIP and SRC to the RE condition, and van de Geer and Bühlmann  proved comparable prediction and estimation error bounds under an even weaker compatibility or RE condition. Meanwhile, rate minimax error bounds for concave PLSE still require two-sided sparse eigenvalue conditions like the SRC [29, 32, 27, 10] or a proper known upper bound for the norm of the true coefficient vector . It turns out that the difference between the SRC and RE conditions are quite significant as Rudelson and Zhou  proved that the RE condition is a consequence of a lower sparse eigenvalue condition alone. This seems to suggest a theoretical advantage of the Lasso, in addition to its relative computational simplicity, compared with concave PLSE.
Emerging from the above discussion, an interesting question is whether the RE condition alone on the design matrix is also sufficient for the above discussed results for concave penalized prediction, coefficient estimation and variable selection, provided proper conditions on the true coefficient vector and the noise. An affirmative answer of this question, which we provide in this paper, amounts to the removal of the upper sparse eigenvalue condition on the design matrix and actually also a relaxation of the lower sparse eigenvalue condition or the restricted strong convexity (RSC) condition  imposed in ; or equivalently, to the removal of the remaining analytical advantage of the Lasso as far as error bounds for the afore mentioned aims are concerned. Specifically, we prove that when the true is sparse, concave PLSE achieves rate minimaxity in prediction and coefficient estimation under the RE condition on the design. Furthermore, the selection consistency of concave PLSE is also guaranteed under the same RE condition and an additional uniform signal strength condition on the nonzero coefficients, and these results also cover non-separable multivariate penalties imposed on the vector as a whole, including sorted and mixed penalties such as the spike-and-slab Lasso .
In addition to the above conservative prediction and estimation error bounds for the concave PLSE that are comparable with those for the Lasso in both rates and regularity conditions on the design, we also prove faster rates for concave PLSE when the signal is partially strong. For example, instead of the prediction error rate in the worst case scenario, the prediction rate for concave PLSE is actually where is the number of small nonzero signals under the same RE condition on the design. Thus, concave PLSE adaptively benefits from signal strength with no harm to the performance in the worst case scenario where all signals are just below the radar screen. This advantage of concave PLSE is known under the sparse Riesz and comparable conditions, but not under the RE condition as presented in this paper.
The bias of the Lasso can be also reduced by taking a smaller penalty level than those required for variable selection consistency, regardless of signal strength. In the literature, PLSE is typically studied in a standard setting at penalty level . This lower bound has been referred to as the universal penalty level. However, as the bias of the Lasso is proportional to its penalty level, rate minimaxity in prediction and coefficient estimation requires smaller [22, 3]. Unfortunately, this smaller penalty level depends on , which is typically unknown. For the penalty, a remedy for this issue is to apply the Slope or a Lepski type procedure [21, 3]. However, it is unclear from the literature whether the same can be done with concave penalties.
We propose a class of sorted concave penalties to combine the advantages of concave and sorted penalties. This extends the Slope beyond . Under an RE condition, we prove that the sorted concave PLSE inherits the benefits of both concave and sorted PLSE, namely bias reduction through signal strength and adaptation to the smaller penalty level. This provides prediction and estimation error bounds of the order and comparable estimation error bounds. Moreover, our results apply to approximate local solutions which can be viewed as output of computational algorithms for sorted concave PLSE.
To prove the computational feasibility of our theoretical results in polynomial time, we develop an LCA algorithm for a large class of multivariate concave PLSE to produce approximate local solutions to which our theoretical results apply. The LCA is a majorization-minimization (MM) algorithm and is closely related to the local quadratic approximation (LQA)  and the local linear approximation (LLA)  algorithms. The development of the LCA is needed as the LLA does not majorize sorted concave penalties in general. Our analysis of the LCA can be viewed as extension of the results in [32, 11, 14, 1, 27, 12, 10] where separable penalties are considered, typically at larger penalty levels.
The rest of this paper is organized as follows. In Section 2 we study penalty level and concavity of general multivariate penalties in a general optimization setting, including separable, multivariate mixed and sorted penalties, and also introduce the LCA for sorted penalties. In Section 3, we develop a unified treatment of prediction, coefficient estimation and variable selection properties of concave PLSE under the RE condition at penalty levels required for variable selection consistency. In Section 4 we provide error bounds for approximate solutions at smaller and sorted penalty levels and output of LCA algorithms.
Notation: We denote by the true regression coefficient vector, the sample Gram matrix, the support set of the coefficient vector, the size of the support, and
the standard Gaussian cumulative distribution function. For vectors, we denote by the norm, with and . Moreover, .
2 Penalty functions
We consider minimization of penalized loss
with a general Fréchet differentiable loss function and a general multivariate penalty function satisfying certain minimum penalty level and maximum concavity conditions.
Penalty level and concavity of univariate penalty functions are well understood as we will briefly describe in our discussion of separable penalties in Subsection 2.2 below. However, for multivariate penalties, we need to carefully define their penalty level and concavity in terms of sub-differential. This is done in Subsection 2.1. We then study in separate subsections three types of penalty functions, namely separable penalties, multivariate mixed penalties, and sorted penalties. Moreover, we develop the LCA for sorted penalties in Subsection 2.5.
2.1 Sub-differential, penalty level and concavity
The sub-differential of a penalty at a point , denoted by as a subset of , can be defined as follows. A vector belongs iff
As is continuous in , is always a closed convex set.
Suppose is everywhere Fréchet differentiable with derivative . It follows immediately from the definition of the sub-differential in (2.2) that
for all iff . This includes all local minimizers. Let denote a member of . We say that is a local solution for minimizing (2.1) iff the following estimating equation is feasible:
As (2.3) characterizes all minimizers of the penalized loss when the penalized loss is convex, it can be viewed as a Karush Kuhn Tucker (KKT) condition.
We define the penalty level of at a point as
This definition is designed to achieve sparsity for solutions of (2.3). Although is a function of in general, it depends solely on for many commonly used penalty functions. Thus, we may denote by for notational simplicity. For example, in the case of the penalty , (2.4) holds with for all with . We consider a somewhat weaker penalty level for the sorted penalty in Subsection 2.4.
We define the concavity of at , relative to an oracle/target coefficient vector , as
with the convention , where the supreme is taken over all choices and . We use to denote the maximum concavity of . For convex penalties, is the symmetric Bregman divergence. A penalty function is convex if and only if . Given and , we may consider a relaxed concavity of at as
where infimum is taken over all nonnegative and satisfying
with for all and . This notion of concavity is more relaxed than the one in (2.5) because always holds due to the option of picking . The relaxed concavity is quite useful in our study of multivariate mixed penalties in Subsection 2.3. To include more solutions for (2.3) and also to avoid sometimes tedious task of fully characterizing the sub-differential, we allow to be a member of the following “completion” of the sub-differential,
in the estimating equation (2.3), as long as is replaced by the same subset in (2.3), (2.4), (2.5), (2.6) and (2.7). However, for notational simplicity, we may still use to denote . We may also impose an upper bound condition on the penalty level of :
It is common to have although by (2.4). Without loss of generality, we impose the condition .
2.2 Separable penalties
In general, separable penalty functions can be written as a sum of penalties on individual variables, . We shall focus on separable penalties of the form
where is a parametric family of penalties with the following properties:
is symmetric, with ;
is monotone, for all ;
is left- and right-differentiable in for all ;
has selection property, ;
for all real ,
where denote the one-sided derivatives. Condition (iv) guarantees that the index equals to the penalty level defined in (2.4), and condition (v) bounds the maximum penalty level with in (2.9). We write when is between the left- and right-derivative of at , including where means , so that is defined in the sense of (2.8). By (2.5), the concavity of is defined as
where the supreme is taken over all possible choices of and between the left- and right-derivatives. Further, define the overall maximum concavity of as
gives the concavity (2.5) of the multivariate penalty .
Many popular penalty functions satisfy conditions (i)–(v) above, including the penalty for the Lasso with , the SCAD (smoothly clipped absolute deviation) penalty  with
and , and the MCP (minimax concave penalty)  with
and . An interesting way of constructing penalty functions is to mix penalties with a distribution and a real as follows,
This class of mixed penalties has a Bayesian interpretation as we discuss in Subsection 2.3. If we treat as conditional density of
under a joint probability, we have
due to and for all and . For example, if puts the entire mass in a two-point set ,
2.3 Multivariate mixed penalties
Let be a parametric family of prior density functions for . When with known and is given, the posterior mode can be written as the minimizer of
when . In a hierarchical Bayes model where has a prior distribution , the posterior mode corresponds to
with and . This gives rise to (2.19) as a general way of mixing penalties with suitable . When , it corresponds to the posterior for a proper hierarchical prior if the integration is finite, and an improper one otherwise. When , it still has a Bayesian interpretation with respect to mis-specified noise level or sample size . While leads to as the limit at , the formulation does not prohibit .
For , let be a separable penalty function with different penalty levels for different coefficients , where is a family of penalties indexed by penalty level as discussed in Subsection 2.2. As in (2.19),
with the convention for , is a mixed penalty for any probability measure . We study below the sub-differential, penalty level and concavity of such mixed penalties.
By definition, the sub-differential of (2.20) can be written as
where the conditional is proportional to . We recall that may take any value in .
with being the largest eigenvalue, and (2.7) holds with
If the components of are independent given , then (2.7) holds with
If in addition is exchangeable under , the penalty level of (2.20) is
Interestingly, (2.23) indicates that mixing with makes the penalty more convex.
For the non-separable spike-and-slab Lasso , the prior is hierarchical where are independent, are iid with for some given constants and , and . As and , the penalty can be written as
where are iid given with and . The penalty level is given by
2.4 Sorted concave penalties
Given a sequence of sorted penalty levels , the sorted penalty  is defined as
where is the j-th largest value among .
Here we extend the sorted penalty beyond . Given a family of univariate penalty functions and a vector with non-increasing nonnegative elements, we define the associated sorted penalty as
Although (2.26) seems to be a superficial extension of (2.25), it brings upon potentially significant benefits and its properties are nontrivial. We say that the sorted penalty is concave if is concave in in . In Section 4, we prove that under an RE condition, the sorted concave penalty inherits the benefits of both the concave and sorted penalties, namely bias reduction for strong signal components and adaptation to the penalty level to the unknown sparsity of .
The following proposition gives penalty level and an upper bound for the maximum concavity for a broad class sorted concave penalties, including the sorted SCAD penalty and MCP. In particular, the construction of the sorted penalty does not increase the maximum concavity in the class.
Let be as in (2.26) with . Suppose with a certain non-decreasing in almost everywhere in positive . Let and . Then, the sub-differential of includes all vectors satisfying
Moreover, the maximum concavity of is no greater than that of the penalty family :
The monotonicity condition on holds for the , SCAD and MCP. It follows from (2.27) that the maximum penalty level at each index is . Although the penalty level does not reach simultaneously for all as in (2.4), we still take as the penalty level for the sorted penalty . This is especially reasonable when decreases slowly in . In Subsection 4.6, we show that this weaker version of the penalty level is adequate for Gaussian errors provided that for certain
More important, (2.27) shows sorted penalties automatically pick penalty level from the sequence without requiring the knowledge of .
2.5 Local convex approximation
We develop here LCA for penalized optimization (2.1), especially for sorted penalties. As a majorization-minimization (MM) algorithm, it is closely related to and in fact very much inspired by the LQA  and LLA [34, 32, 11].
Suppose for a certain continuously differentiable convex function ,
is convex. The LCA algorithm can be written as
This LCA is clearly an MM-algorithm: As
is a convex majorization of with ,
Let be the sorted concave penalty in (2.26) with a penalty family and a vector of sorted penalty levels . Suppose is non-decreasing in almost everywhere in positive , so that Proposition 2 applies. Suppose for a certain continuously differentiable convex function
|is convex in for .||(2.34)|
By (2.29), , the sorted penalty with , is convex in , so that the LCA algorithm for can be written as
Figure 1 demonstrates that for , the LCA with also majorizes the LLA with . With in (2.34), the LCA is identical to an unfolded LLA with . The situation is the same for separable penalties, i.e. . However, the LLA is not feasible for truly sorted concave penalties with . As the LCA also majorized the LLA, it imposes larger penalty on solutions with larger step size compared with the LLA, but this has little effect in our theoretical analysis in Subsections 4.5 and 4.6.
|Algorithm 1: ISTA for LCA|
where , is the reciprocal of a Lipschitz constant for or determined in the iteration by backtracking, and
is the so called proximal mapping for convex Pen, e.g. . We may also apply FISTA  as an accelerated version of Algorithm 1.
|Algorithm 2: FISTA for LCA|
For sorted penalties , the proximal mapping is not separable but still preserves the sign and ordering in absolute value of the input. Thus, after removing the sign and sorting the input and output simultaneously, it can be solved with the isotonic proximal mapping,
with . Moreover, similar to the computation of the proximal mapping for the Slope in , this isotonic proximal mapping can be computed by the following algorithm.