1 Introduction
Many regression and learning problems aim at finding important explanatory factors in predicting the response variable, where each explanatory factor may be represented by a group of derived input variables (see, e.g.,
[9, 35, 14, 21, 25, 36, 12, 23]). The most common example is the multifactor analysisofvariance problem, in which each factor may have several levels and can be expressed through a group of dummy variables. Let
be a collection of index sets to represent the group structure of explanatory factors, where for all and . This class of regression problems can be stated via the following observation model(1) 
where is the true (unknown) coefficient vector, is an design matrix corresponding to the th factor, and is the noise vector. Clearly, when for , (1
) reduces to the common linear regression model.
Sparse estimation, using penalization or regularization technique to perform variable selection and estimation simultaneously, has become a mainstream approach especially for highdimensional data
[6]. In the past decade, some popular penalized estimators were proposed one after another, including the type estimators such as the Lasso [31] and the Dantzig [8], and the nonconvex penalized estimators such as the SCAD [10] and the MCP [39]. For the model (1), one may embrace the norm regularized estimator due to the simplicity of computation (see, e.g., [3, 35]), but this estimator inherits the bias of the Lasso. The major reason for this dilemma is the significant difference between the norm and the zeronorm (or cardinality function). To enhance the quality of the type selector, some researchers focused on the estimator induced by nonconvex surrogates for the zeronorm regularized problem, such as the bridge [13], the SCAD [10] and the MCP [39]. In particular, some algorithms were also developed for computing these nonconvex penalized estimators [10, 41, 20, 4, 5]; for example, the local quadratic approximation (LQA) algorithm [10] and the local linear approximation approximation (LLA) algorithm [41]. Recently, Fan, Xue and Zou [11] also provided a unified theory to show explicitly how to obtain the oracle solution via the LLA algorithm for the class of folded concave penalized estimators, which covers the SCAD and MCP as special cases.Let and for . In this work, we are interested in the following group zeronorm regularized estimator
(2) 
where is the regularization parameter, denotes the zeronorm of a vector, and for some . Here, the simple constraint is imposed to (2) in order to guarantee that the group zeronorm estimator is welldefined. In fact, similar simple constraints are also used for the regularized models (see [2]). The estimator may be unacceptable for many statisticians since, compared with the convex regularized estimator and the nonconvex SCAD and MCP penalized estimators, it seems that is unapproachable due to the combinatorial property of the zeronorm. The main motivation for us to study such an estimator comes from the following facts:

Good group sparsity and unbiasedness of . By the definition of , clearly, can automatically set small estimated coeffcients to be zero, which reduces well the model complexity. In addition, by following the analysis in [10, Section 2], is nearly unbiased when is orthonormal and the true coefficients are not too small.

The estimator is the restriction of the SCAD and MCP over the ball . The SCAD and MCP estimators were well studied in the past ten years, but there are few works to discuss their relation with the zeronorm regularized estimator except that they are effective nonconvex surrogates of the latter. In Section 2, we shall show that the SCAD and MCP functions arise from the global exact penalty for the equivalent MPEC (mathematical program with equilibrium constraints) of (2), and so is the restriction of the SCAD and MCP estimators over the ball .
Specifically, by means of the variational characterization of the zeronorm, the group zeronorm regularized problem (2) can be rewritten as an MPEC. We show that the penalty problem, yielded by moving the equilibrium constraint into the objective, is a global exact penalty for the MPEC in the sense that it has the same global optimal solution set as the MPEC does. Consequently, one may approach the estimator by solving a global exact penalization problem. This result is significant since, on one hand, the global exact penalty is an Lipschitz continuous optimization problem whose objective function possesses a structure favorable to the design of effective algorithms; and on the other hand, it provides a recipe to deliver equivalent DC (difference of convex functions) penalized functions whose global minimizers provide an estimator with three desirable properties stated in [10]; for example, the popular SCAD and MCP penalized estimators.
By the biconvex structure of the global exact penalty, we solve it in an alternating way and develop a multistage convex relaxation approach called GEPMSCRA for computing (see Section 3). The GEPMSCRA consists of solving a sequence of weighted norm regularized subproblems. In this sense, it is similar to the LLA algorithm [41] for the nonconcave penalized likelihood model. However, it is worth emphasizing that the startup of the LLA algorithm depends implicitly on an initial estimator , while the startup of the GEPMSCRA depends explicitly on a dual variable . In addition, the involving subproblems may be different since the subproblems of the LLA are obtained from the primal angle, and those of the GEPMSCRA are yielded from the primaldual angle. For the proposed GEPMSCRA, under a restricted strong convex (RSC) assumption on , we verify in Section 4 that the error bounds of the iterates to the true is decreasing as the number of iterates increases, and if the smallest nonzero group element of is not too small, the iterates after finite steps will coincide with the oracle solution, and hence the support of
is exactly identified. Since the RSC assumption holds with high probability by
[24], the GEPMSCRA has the theoretical guarantees in a statistical sense.We implement the GEPMSCRA by solving the weighted norm regularized subproblems with a semismooth Newton ALM. The semismooth Newton ALM is a dual method that can solve the weighted norm regularized problems more efficiently than the existing firstorder methods by exploiting the secondorder information of the objective function in an economic way. As illustrated in [16, Section 3.3], the dual structure of the weighted norm regularized problems implies a nonsingular generalized Hessian matrix, which is well suitable for the semismooth Newton method. We compare the performance of the GEPMSCRA with that of the SLEP and the MALSAR in [15], the solvers to the unconstrained weighted norm regularized least squares problems on synthetic group sparse regression problems and real multitask learning problems, respectively. Numerical comparisons demonstrates that the GEPMSCRA has a remarkable advantage in reducing error and achieving the exact group sparsity although it requires a little more time than the SLEP and the MALSAR do; for example, for the synthetic group sparse regression problems, the GEPMSCRA reduces the relative recovery error of the SLEP at least for the design matrix of Gaussian or subGaussian type (see the first four subfigures in Figure 3), and for the real (School data) multitask learning, the GEPMSCRA can reduce the prediction error of the MALSAR at least when there are more than examples are used as training samples (see Figure 8).
To close this section, we introduce some necessary notations. We denote by the spectral norm and by the maximum column norm of , respectively. Let and
denote the vector of all ones and the identity matrix, respectively, whose dimensions are known from the context. For a convex function
, denotes the conjugate of ; for a given closed set , means the indicator function over the set , i.e., if and otherwise ; and for a given index set , write and , and denote bythe characteristic function of
, i.e., if and otherwise .2 A new perspective on the estimator
We shall examine the estimator from an equivalent MPEC of (2) and a global exact penalty of this MPEC, and conclude that can be obtained by solving an exact penalty problem which is constructed by moving the complementarity (or equilibrium) constraint into the objective of the MPEC. For convenience, we write and denote by the Lipschitz constant of relative to the set
. One will see that the results of this section are also applicable to a general continuous loss function.
Let denote the family of closed proper convex functions satisfying , and where is the unique minimizer of over . Let be the minimum element in such that , where is the subdifferential mapping of . The existence of such is guaranteed by Lemma 4.
Now we show that with an arbitrary , the problem (2) can be rewritten as an MPEC. Fix an arbitrary and . By the definition of , one may check that
(3) 
This variational characterization of means that the problem (2) is equivalent to
(4) 
in the following sense: if is globally optimal to (2), then is a global optimal solution of (4); and conversely, if is a global optimal solution of (4), then is globally optimal to (2) with . This means that the difficulty to compute the estimator comes from the following equilibrium condition
(5) 
Also, it is the equilibrium constraint to bring the bothersome nonconvexity of (2). Since the constraint set of (4) involves the equilibrium constraint (5), we call it an MPEC.
It is well known that the MPEC is a class of very difficult problems in optimization. In the past two decades, there was active research on its theory and algorithms especially for the one over the polyhedral cone, and the interested readers may refer to [17, 34]. We notice that most of existing algorithms are generic and inappropriate for solving (4). To handle the tough equilibrium constraint, we here consider its penalized version
(6) 
where is the penalty factor. The following theorem states that (6) is a global exact penalty for (4) in the sense that it has the same global optimal solution set as (4) does.
Theorem 2.1
The proof of Theorem 2.1 is included in Appendix B. From the proof, we see that the constraint in (2) is also necessary to establish the exact penalty for the MPEC. By Theorem 2.1 and the equivalence between (4) and (2), the estimator can be computed by solving a single penalty problem (6) associated to . Since , the function is Lipschitzian relative to by [27, Theorem 10.4], and so is the objective function of (6) relative to its feasible set . Thus, compared with the discontinuous nonconvex problem (2), the problem (6) is at least an Lipschitztype one, for which the Clarke generalized differential [7] can be used for its analysis.
Interestingly, the equivalence between (2) and (6) also implies a mechanism to yield equivalent DC surrogates for the group zeronorm , and the popular SCAD function [10] and MCP function [39] are one of the products. Next we demonstrate this fact. For each , let be the associated closed proper convex function:
(7) 
By using the function , the problem (6) can be rewritten in the following compact form
Together with the definition of conjugate functions and the above discussion, we have
(8) 
This means that the following function provides an equivalent DC surrogate for :
(9) 
In particular, when is chosen as the one in Example 2.1, it becomes the SCAD function. Indeed, from the expression of in Example 2.1 below, it follows that
which, by setting for some constant , implies that
Thus, when , by taking and , the function in (9) reduces to the SCAD function in [10]. Similarly, when is chosen as the one in Example 2.2, by taking and , the function in (9) becomes the MCP function in [39].
Now we give four examples for . In the sequel we shall call  as the function in Example 2.12.4, respectively, and  as the corresponding defined by (7).
Example 2.1
Take with for , where is a constant. Clearly, with and . After a simple computation,
Example 2.2
Let with where is a constant and . Clearly, with and . An elementary calculation yields that takes the following form
Example 2.3
Example 2.4
Let with for , where is a small constant. It is not hard to check that with and . Also, with
Theorem 2.2
Proof: Since is a local optimal solution of (6) associated to , there exists such that for all with ,
Fix an arbitrary with where denotes the feasible set of (4). Then, from the last inequality, it immediately follows that
This shows that is a locally optimal solution of the problem (4).
We next argue that is locally optimal to (2). Since is a locally optimal solution of (4), there exists such that for all with ,
(10) 
Let where is the minimal nonzero component of . Fix an arbitrary . We shall establish the following inequality
(11) 
and consequently is a local optimal solution to (2). If ,
This implies that inequality (11) holds. If , by using , we have , which implies that . Thus, and . From (10), we obtain or Along with , the inequality (11) follows.
3 GEPMSCRA for computing the estimator
From the last section, to compute the estimator , one only needs to solve the penalty problem (6) associated to with , where the threshold is easily estimated once is given since . For a given , although the problem (6) is nonconvex due to the coupled term , its special structure makes it much easier to cope with than do the problem (2). Specifically, when the variable is fixed, the problem (6) reduces to a convex minimization in ; and when the variable is fixed, it reduces to a convex minimization in which, as will be shown below, has a closedform solution. Motivated by this, we propose a multistage convex relaxation approach for computing by solving (6) in an alternating way.
Algorithm 3.1
(GEPMSCRA for computing ) (S.0) Choose , and an initial . Set and . (S.1) Compute the following minimization problemRemark 3.1
(a) By the definition of , clearly, is an optimal solution to (13) if and only if . Since is a convex function in , the subdifferential is easily characterized by [27]; for example, for the function , it holds that
Thus, the main computation work in each iterate of the GEPMSCRA is to solve (12).
(b) When , since for , we have for all , which means that the subproblem (12) for has a similar form to the one yielded by applying the linear approximation technique in [41] to . Together with part (a), the GEPMSCRA is analogous to the LLA algorithm in [41] for nonconvex penalized LS problems except the startup and the weights. We see that the initial subproblem of the GEPMSCRA depends explicitly on the dual variable , while the initial subproblem of the LLA algorithm depends implicitly on an initial estimator . This means that the startup of the GEPMSCRA is more easily controlled.
4 Statistical guarantees
For convenience, throughout this section, we denote by the group support of the true , i.e., , and write . With and an integer , we define
Recall that the matrix is said to satisfy the RSC of constant in a set if
In this section, under an RSC assumption on over , we shall establish an error bound for the iterate to and verify that the error sequence is strictly decreasing as increases, and if in addition the nonzero group vectors of are not too small, the iterate of the GEPMSCRA after finite steps satisfies . Throughout the analysis, we assume that the components of the noise vector are independent (not necessarily identically distributed) subGaussians, i.e., the following assumption holds.
Assumption 1
Assume that are independent (but not necessarily identically distributed) subGaussians, i.e., there exists such that for all and
The proofs of the main results of this section are all included in Appendix C.
4.1 Theoretical performance bounds
First of all, we characterize the error bound for every iterate to the true vector .
Theorem 4.1
Let . Suppose that has the RSC of constant over , and If , then
(14) 
Remark 4.1
(a) When , the subproblem (12) reduces to the regularized least squares problem, and the bound in (14) has the same order as the one in [24, Corollary 4] except that the coefficient there is improved to be . From the choice interval of , the worst bound of is and that of for is .
(b) The restriction on and implies that . Such a restriction on is also required in the analysis of the regularized LS estimator [24, 18]. The choice interval of depends on the RSC property of in and the noise level. Clearly, for those problems in which has a better RSC property over or the noise is smaller, there is a larger choice interval for the parameter . In addition, those with larger and smaller can deliver a larger choice interval of .
Theorem 4.1 provides an error bound for every iterate of the GEPMSCRA, but it is unclear whether the error bound for the current iterate is better than that of the previous iterate , i.e., the error bound sequence is decreasing or not. We resolve this problem by bounding for with where . The following theorem states this main result, and its proof involves the index sets
(15) 
Theorem 4.2
Suppose that