Many regression and learning problems aim at finding important explanatory factors in predicting the response variable, where each explanatory factor may be represented by a group of derived input variables (see, e.g.,[9, 35, 14, 21, 25, 36, 12, 23] be a collection of index sets to represent the group structure of explanatory factors, where for all and . This class of regression problems can be stated via the following observation model
where is the true (unknown) coefficient vector, is an design matrix corresponding to the th factor, and is the noise vector. Clearly, when for , (1
) reduces to the common linear regression model.
Sparse estimation, using penalization or regularization technique to perform variable selection and estimation simultaneously, has become a mainstream approach especially for high-dimensional data. In the past decade, some popular penalized estimators were proposed one after another, including the -type estimators such as the Lasso  and the Dantzig , and the nonconvex penalized estimators such as the SCAD  and the MCP . For the model (1), one may embrace the -norm regularized estimator due to the simplicity of computation (see, e.g., [3, 35]), but this estimator inherits the bias of the Lasso. The major reason for this dilemma is the significant difference between the -norm and the zero-norm (or cardinality function). To enhance the quality of the -type selector, some researchers focused on the estimator induced by nonconvex surrogates for the zero-norm regularized problem, such as the bridge , the SCAD  and the MCP . In particular, some algorithms were also developed for computing these nonconvex penalized estimators [10, 41, 20, 4, 5]; for example, the local quadratic approximation (LQA) algorithm  and the local linear approximation approximation (LLA) algorithm . Recently, Fan, Xue and Zou  also provided a unified theory to show explicitly how to obtain the oracle solution via the LLA algorithm for the class of folded concave penalized estimators, which covers the SCAD and MCP as special cases.
Let and for . In this work, we are interested in the following group zero-norm regularized estimator
where is the regularization parameter, denotes the zero-norm of a vector, and for some . Here, the simple constraint is imposed to (2) in order to guarantee that the group zero-norm estimator is well-defined. In fact, similar simple constraints are also used for the -regularized models (see ). The estimator may be unacceptable for many statisticians since, compared with the convex -regularized estimator and the nonconvex SCAD and MCP penalized estimators, it seems that is unapproachable due to the combinatorial property of the zero-norm. The main motivation for us to study such an estimator comes from the following facts:
Good group sparsity and unbiasedness of . By the definition of , clearly, can automatically set small estimated coeffcients to be zero, which reduces well the model complexity. In addition, by following the analysis in [10, Section 2], is nearly unbiased when is orthonormal and the true coefficients are not too small.
The estimator is the restriction of the SCAD and MCP over the ball . The SCAD and MCP estimators were well studied in the past ten years, but there are few works to discuss their relation with the zero-norm regularized estimator except that they are effective nonconvex surrogates of the latter. In Section 2, we shall show that the SCAD and MCP functions arise from the global exact penalty for the equivalent MPEC (mathematical program with equilibrium constraints) of (2), and so is the restriction of the SCAD and MCP estimators over the ball .
Specifically, by means of the variational characterization of the zero-norm, the group zero-norm regularized problem (2) can be rewritten as an MPEC. We show that the penalty problem, yielded by moving the equilibrium constraint into the objective, is a global exact penalty for the MPEC in the sense that it has the same global optimal solution set as the MPEC does. Consequently, one may approach the estimator by solving a global exact penalization problem. This result is significant since, on one hand, the global exact penalty is an Lipschitz continuous optimization problem whose objective function possesses a structure favorable to the design of effective algorithms; and on the other hand, it provides a recipe to deliver equivalent DC (difference of convex functions) penalized functions whose global minimizers provide an estimator with three desirable properties stated in ; for example, the popular SCAD and MCP penalized estimators.
By the biconvex structure of the global exact penalty, we solve it in an alternating way and develop a multi-stage convex relaxation approach called GEP-MSCRA for computing (see Section 3). The GEP-MSCRA consists of solving a sequence of weighted -norm regularized subproblems. In this sense, it is similar to the LLA algorithm  for the nonconcave penalized likelihood model. However, it is worth emphasizing that the start-up of the LLA algorithm depends implicitly on an initial estimator , while the start-up of the GEP-MSCRA depends explicitly on a dual variable . In addition, the involving subproblems may be different since the subproblems of the LLA are obtained from the primal angle, and those of the GEP-MSCRA are yielded from the primal-dual angle. For the proposed GEP-MSCRA, under a restricted strong convex (RSC) assumption on , we verify in Section 4 that the error bounds of the iterates to the true is decreasing as the number of iterates increases, and if the smallest nonzero group element of is not too small, the iterates after finite steps will coincide with the oracle solution, and hence the support of
is exactly identified. Since the RSC assumption holds with high probability by, the GEP-MSCRA has the theoretical guarantees in a statistical sense.
We implement the GEP-MSCRA by solving the weighted -norm regularized subproblems with a semismooth Newton ALM. The semismooth Newton ALM is a dual method that can solve the weighted -norm regularized problems more efficiently than the existing first-order methods by exploiting the second-order information of the objective function in an economic way. As illustrated in [16, Section 3.3], the dual structure of the weighted -norm regularized problems implies a nonsingular generalized Hessian matrix, which is well suitable for the semismooth Newton method. We compare the performance of the GEP-MSCRA with that of the SLEP and the MALSAR in , the solvers to the unconstrained weighted -norm regularized least squares problems on synthetic group sparse regression problems and real multi-task learning problems, respectively. Numerical comparisons demonstrates that the GEP-MSCRA has a remarkable advantage in reducing error and achieving the exact group sparsity although it requires a little more time than the SLEP and the MALSAR do; for example, for the synthetic group sparse regression problems, the GEP-MSCRA reduces the relative recovery error of the SLEP at least for the design matrix of Gaussian or sub-Gaussian type (see the first four subfigures in Figure 3), and for the real (School data) multi-task learning, the GEP-MSCRA can reduce the prediction error of the MALSAR at least when there are more than examples are used as training samples (see Figure 8).
To close this section, we introduce some necessary notations. We denote by the spectral norm and by the maximum column norm of , respectively. Let and
denote the vector of all ones and the identity matrix, respectively, whose dimensions are known from the context. For a convex function, denotes the conjugate of ; for a given closed set , means the indicator function over the set , i.e., if and otherwise ; and for a given index set , write and , and denote by
the characteristic function of, i.e., if and otherwise .
2 A new perspective on the estimator
We shall examine the estimator from an equivalent MPEC of (2) and a global exact penalty of this MPEC, and conclude that can be obtained by solving an exact penalty problem which is constructed by moving the complementarity (or equilibrium) constraint into the objective of the MPEC. For convenience, we write and denote by the Lipschitz constant of relative to the set
. One will see that the results of this section are also applicable to a general continuous loss function.
Let denote the family of closed proper convex functions satisfying , and where is the unique minimizer of over . Let be the minimum element in such that , where is the subdifferential mapping of . The existence of such is guaranteed by Lemma 4.
Now we show that with an arbitrary , the problem (2) can be rewritten as an MPEC. Fix an arbitrary and . By the definition of , one may check that
This variational characterization of means that the problem (2) is equivalent to
in the following sense: if is globally optimal to (2), then is a global optimal solution of (4); and conversely, if is a global optimal solution of (4), then is globally optimal to (2) with . This means that the difficulty to compute the estimator comes from the following equilibrium condition
It is well known that the MPEC is a class of very difficult problems in optimization. In the past two decades, there was active research on its theory and algorithms especially for the one over the polyhedral cone, and the interested readers may refer to [17, 34]. We notice that most of existing algorithms are generic and inappropriate for solving (4). To handle the tough equilibrium constraint, we here consider its penalized version
The proof of Theorem 2.1 is included in Appendix B. From the proof, we see that the constraint in (2) is also necessary to establish the exact penalty for the MPEC. By Theorem 2.1 and the equivalence between (4) and (2), the estimator can be computed by solving a single penalty problem (6) associated to . Since , the function is Lipschitzian relative to by [27, Theorem 10.4], and so is the objective function of (6) relative to its feasible set . Thus, compared with the discontinuous nonconvex problem (2), the problem (6) is at least an Lipschitz-type one, for which the Clarke generalized differential  can be used for its analysis.
Interestingly, the equivalence between (2) and (6) also implies a mechanism to yield equivalent DC surrogates for the group zero-norm , and the popular SCAD function  and MCP function  are one of the products. Next we demonstrate this fact. For each , let be the associated closed proper convex function:
By using the function , the problem (6) can be rewritten in the following compact form
Together with the definition of conjugate functions and the above discussion, we have
This means that the following function provides an equivalent DC surrogate for :
which, by setting for some constant , implies that
Thus, when , by taking and , the function in (9) reduces to the SCAD function in . Similarly, when is chosen as the one in Example 2.2, by taking and , the function in (9) becomes the MCP function in .
Take with for , where is a constant. Clearly, with and . After a simple computation,
Let with where is a constant and . Clearly, with and . An elementary calculation yields that takes the following form
Let with for , where is a small constant. It is not hard to check that with and . Also, with
Proof: Since is a local optimal solution of (6) associated to , there exists such that for all with ,
Fix an arbitrary with where denotes the feasible set of (4). Then, from the last inequality, it immediately follows that
This shows that is a locally optimal solution of the problem (4).
Let where is the minimal nonzero component of . Fix an arbitrary . We shall establish the following inequality
and consequently is a local optimal solution to (2). If ,
3 GEP-MSCRA for computing the estimator
From the last section, to compute the estimator , one only needs to solve the penalty problem (6) associated to with , where the threshold is easily estimated once is given since . For a given , although the problem (6) is nonconvex due to the coupled term , its special structure makes it much easier to cope with than do the problem (2). Specifically, when the variable is fixed, the problem (6) reduces to a convex minimization in ; and when the variable is fixed, it reduces to a convex minimization in which, as will be shown below, has a closed-form solution. Motivated by this, we propose a multi-stage convex relaxation approach for computing by solving (6) in an alternating way.
Algorithm 3.1(GEP-MSCRA for computing ) (S.0) Choose , and an initial . Set and . (S.1) Compute the following minimization problem
(a) By the definition of , clearly, is an optimal solution to (13) if and only if . Since is a convex function in , the subdifferential is easily characterized by ; for example, for the function , it holds that
Thus, the main computation work in each iterate of the GEP-MSCRA is to solve (12).
(b) When , since for , we have for all , which means that the subproblem (12) for has a similar form to the one yielded by applying the linear approximation technique in  to . Together with part (a), the GEP-MSCRA is analogous to the LLA algorithm in  for nonconvex penalized LS problems except the start-up and the weights. We see that the initial subproblem of the GEP-MSCRA depends explicitly on the dual variable , while the initial subproblem of the LLA algorithm depends implicitly on an initial estimator . This means that the start-up of the GEP-MSCRA is more easily controlled.
4 Statistical guarantees
For convenience, throughout this section, we denote by the group support of the true , i.e., , and write . With and an integer , we define
Recall that the matrix is said to satisfy the RSC of constant in a set if
In this section, under an RSC assumption on over , we shall establish an error bound for the iterate to and verify that the error sequence is strictly decreasing as increases, and if in addition the nonzero group vectors of are not too small, the iterate of the GEP-MSCRA after finite steps satisfies . Throughout the analysis, we assume that the components of the noise vector are independent (not necessarily identically distributed) sub-Gaussians, i.e., the following assumption holds.
Assume that are independent (but not necessarily identically distributed) sub-Gaussians, i.e., there exists such that for all and
The proofs of the main results of this section are all included in Appendix C.
4.1 Theoretical performance bounds
First of all, we characterize the error bound for every iterate to the true vector .
Let . Suppose that has the RSC of constant over , and If , then
(a) When , the subproblem (12) reduces to the -regularized least squares problem, and the bound in (14) has the same order as the one in [24, Corollary 4] except that the coefficient there is improved to be . From the choice interval of , the worst bound of is and that of for is .
(b) The restriction on and implies that . Such a restriction on is also required in the analysis of the -regularized LS estimator [24, 18]. The choice interval of depends on the RSC property of in and the noise level. Clearly, for those problems in which has a better RSC property over or the noise is smaller, there is a larger choice interval for the parameter . In addition, those with larger and smaller can deliver a larger choice interval of .
Theorem 4.1 provides an error bound for every iterate of the GEP-MSCRA, but it is unclear whether the error bound for the current iterate is better than that of the previous iterate , i.e., the error bound sequence is decreasing or not. We resolve this problem by bounding for with where . The following theorem states this main result, and its proof involves the index sets