# GEP-MSCRA for computing the group zero-norm regularized least squares estimator

This paper concerns with the group zero-norm regularized least squares estimator which, in terms of the variational characterization of the zero-norm, can be obtained from a mathematical program with equilibrium constraints (MPEC). By developing the global exact penalty for the MPEC, this estimator is shown to arise from an exact penalization problem that not only has a favorable bilinear structure but also implies a recipe to deliver equivalent DC estimators such as the SCAD and MCP estimators. We propose a multi-stage convex relaxation approach (GEP-MSCRA) for computing this estimator, and under a restricted strong convexity assumption on the design matrix, establish its theoretical guarantees which include the decreasing of the error bounds for the iterates to the true coefficient vector and the coincidence of the iterates after finite steps with the oracle estimator. Finally, we implement the GEP-MSCRA with the subproblems solved by a semismooth Newton augmented Lagrangian method (ALM) and compare its performance with that of SLEP and MALSAR, the solvers for the weighted ℓ_2,1-norm regularized estimator, on synthetic group sparse regression problems and real multi-task learning problems. Numerical comparison indicates that the GEP-MSCRA has significant advantage in reducing error and achieving better sparsity than the SLEP and the MALSAR do.

• 4 publications
• 6 publications
04/30/2018

### Equivalent Lipschitz surrogates for zero-norm and rank optimization problems

This paper proposes a mechanism to produce equivalent Lipschitz surrogat...
10/25/2018

### Nuclear Norm Regularized Estimation of Panel Regression Models

In this paper we investigate panel regression models with interactive fi...
11/11/2019

### Error bound of local minima and KL property of exponent 1/2 for squared F-norm regularized factorization

This paper is concerned with the squared F(robenius)-norm regularized fa...
10/04/2010

### Regularizers for Structured Sparsity

We study the problem of learning a sparse linear regression vector under...
08/24/2019

### KL property of exponent 1/2 of ℓ_2,0-norm and DC regularized factorizations for low-rank matrix recovery

This paper is concerned with the factorization form of the rank regulari...
04/04/2011

### Robust Nonparametric Regression via Sparsity Control with Application to Load Curve Data Cleansing

Nonparametric methods are widely applicable to statistical inference pro...
06/11/2018

### Adaptive Denoising of Signals with Shift-Invariant Structure

We study the problem of discrete-time signal denoising, following the li...

## 1 Introduction

Many regression and learning problems aim at finding important explanatory factors in predicting the response variable, where each explanatory factor may be represented by a group of derived input variables (see, e.g.,

[9, 35, 14, 21, 25, 36, 12, 23]

). The most common example is the multifactor analysis-of-variance problem, in which each factor may have several levels and can be expressed through a group of dummy variables. Let

be a collection of index sets to represent the group structure of explanatory factors, where for all and . This class of regression problems can be stated via the following observation model

 b=∑mi=1AJi¯¯¯xJi+ε, (1)

where is the true (unknown) coefficient vector, is an design matrix corresponding to the th factor, and is the noise vector. Clearly, when for , (1

) reduces to the common linear regression model.

Sparse estimation, using penalization or regularization technique to perform variable selection and estimation simultaneously, has become a mainstream approach especially for high-dimensional data

[6]. In the past decade, some popular penalized estimators were proposed one after another, including the -type estimators such as the Lasso [31] and the Dantzig [8], and the nonconvex penalized estimators such as the SCAD [10] and the MCP [39]. For the model (1), one may embrace the -norm regularized estimator due to the simplicity of computation (see, e.g., [3, 35]), but this estimator inherits the bias of the Lasso. The major reason for this dilemma is the significant difference between the -norm and the zero-norm (or cardinality function). To enhance the quality of the -type selector, some researchers focused on the estimator induced by nonconvex surrogates for the zero-norm regularized problem, such as the bridge [13], the SCAD [10] and the MCP [39]. In particular, some algorithms were also developed for computing these nonconvex penalized estimators [10, 41, 20, 4, 5]; for example, the local quadratic approximation (LQA) algorithm [10] and the local linear approximation approximation (LLA) algorithm [41]. Recently, Fan, Xue and Zou [11] also provided a unified theory to show explicitly how to obtain the oracle solution via the LLA algorithm for the class of folded concave penalized estimators, which covers the SCAD and MCP as special cases.

Let and for . In this work, we are interested in the following group zero-norm regularized estimator

 ˆx∈argminx∈Ω{ν2n∥Ax−b∥2+∥G(x)∥0}, (2)

where is the regularization parameter, denotes the zero-norm of a vector, and for some . Here, the simple constraint is imposed to (2) in order to guarantee that the group zero-norm estimator is well-defined. In fact, similar simple constraints are also used for the -regularized models (see [2]). The estimator may be unacceptable for many statisticians since, compared with the convex -regularized estimator and the nonconvex SCAD and MCP penalized estimators, it seems that is unapproachable due to the combinatorial property of the zero-norm. The main motivation for us to study such an estimator comes from the following facts:

• Good group sparsity and unbiasedness of . By the definition of , clearly, can automatically set small estimated coeffcients to be zero, which reduces well the model complexity. In addition, by following the analysis in [10, Section 2], is nearly unbiased when is orthonormal and the true coefficients are not too small.

• The estimator is the restriction of the SCAD and MCP over the ball . The SCAD and MCP estimators were well studied in the past ten years, but there are few works to discuss their relation with the zero-norm regularized estimator except that they are effective nonconvex surrogates of the latter. In Section 2, we shall show that the SCAD and MCP functions arise from the global exact penalty for the equivalent MPEC (mathematical program with equilibrium constraints) of (2), and so is the restriction of the SCAD and MCP estimators over the ball .

• Approachability of the estimator . As will be shown in Section 3, with the global exact penalty for the MPEC of (2) which is actually a primal-dual equivalent model of (2), there is a large space to design efficient algorithms for computing .

Specifically, by means of the variational characterization of the zero-norm, the group zero-norm regularized problem (2) can be rewritten as an MPEC. We show that the penalty problem, yielded by moving the equilibrium constraint into the objective, is a global exact penalty for the MPEC in the sense that it has the same global optimal solution set as the MPEC does. Consequently, one may approach the estimator by solving a global exact penalization problem. This result is significant since, on one hand, the global exact penalty is an Lipschitz continuous optimization problem whose objective function possesses a structure favorable to the design of effective algorithms; and on the other hand, it provides a recipe to deliver equivalent DC (difference of convex functions) penalized functions whose global minimizers provide an estimator with three desirable properties stated in [10]; for example, the popular SCAD and MCP penalized estimators.

By the biconvex structure of the global exact penalty, we solve it in an alternating way and develop a multi-stage convex relaxation approach called GEP-MSCRA for computing (see Section 3). The GEP-MSCRA consists of solving a sequence of weighted -norm regularized subproblems. In this sense, it is similar to the LLA algorithm [41] for the nonconcave penalized likelihood model. However, it is worth emphasizing that the start-up of the LLA algorithm depends implicitly on an initial estimator , while the start-up of the GEP-MSCRA depends explicitly on a dual variable . In addition, the involving subproblems may be different since the subproblems of the LLA are obtained from the primal angle, and those of the GEP-MSCRA are yielded from the primal-dual angle. For the proposed GEP-MSCRA, under a restricted strong convex (RSC) assumption on , we verify in Section 4 that the error bounds of the iterates to the true is decreasing as the number of iterates increases, and if the smallest nonzero group element of is not too small, the iterates after finite steps will coincide with the oracle solution, and hence the support of

is exactly identified. Since the RSC assumption holds with high probability by

[24], the GEP-MSCRA has the theoretical guarantees in a statistical sense.

We implement the GEP-MSCRA by solving the weighted -norm regularized subproblems with a semismooth Newton ALM. The semismooth Newton ALM is a dual method that can solve the weighted -norm regularized problems more efficiently than the existing first-order methods by exploiting the second-order information of the objective function in an economic way. As illustrated in [16, Section 3.3], the dual structure of the weighted -norm regularized problems implies a nonsingular generalized Hessian matrix, which is well suitable for the semismooth Newton method. We compare the performance of the GEP-MSCRA with that of the SLEP and the MALSAR in [15], the solvers to the unconstrained weighted -norm regularized least squares problems on synthetic group sparse regression problems and real multi-task learning problems, respectively. Numerical comparisons demonstrates that the GEP-MSCRA has a remarkable advantage in reducing error and achieving the exact group sparsity although it requires a little more time than the SLEP and the MALSAR do; for example, for the synthetic group sparse regression problems, the GEP-MSCRA reduces the relative recovery error of the SLEP at least for the design matrix of Gaussian or sub-Gaussian type (see the first four subfigures in Figure 3), and for the real (School data) multi-task learning, the GEP-MSCRA can reduce the prediction error of the MALSAR at least when there are more than examples are used as training samples (see Figure 8).

To close this section, we introduce some necessary notations. We denote by the spectral norm and by the maximum column norm of , respectively. Let and

denote the vector of all ones and the identity matrix, respectively, whose dimensions are known from the context. For a convex function

, denotes the conjugate of ; for a given closed set , means the indicator function over the set , i.e., if and otherwise ; and for a given index set , write and , and denote by , i.e., if and otherwise .

## 2 A new perspective on the estimator ˆx

We shall examine the estimator from an equivalent MPEC of (2) and a global exact penalty of this MPEC, and conclude that can be obtained by solving an exact penalty problem which is constructed by moving the complementarity (or equilibrium) constraint into the objective of the MPEC. For convenience, we write and denote by the Lipschitz constant of relative to the set

. One will see that the results of this section are also applicable to a general continuous loss function.

Let denote the family of closed proper convex functions satisfying , and where is the unique minimizer of over . Let be the minimum element in such that , where is the subdifferential mapping of . The existence of such is guaranteed by Lemma 4.

Now we show that with an arbitrary , the problem (2) can be rewritten as an MPEC. Fix an arbitrary and . By the definition of , one may check that

 ∥z∥0=minw∈Rm{∑mi=1ϕ(wi):∥z∥1−⟨w,|z|⟩=0, 0≤w≤e}. (3)

This variational characterization of means that the problem (2) is equivalent to

 minx∈Ω,w∈Rm{νf(x)+m∑i=1ϕ(wi):0≤w≤e,⟨e−w,G(x)⟩=0} (4)

in the following sense: if is globally optimal to (2), then is a global optimal solution of (4); and conversely, if is a global optimal solution of (4), then is globally optimal to (2) with . This means that the difficulty to compute the estimator comes from the following equilibrium condition

 e−w≥0, G(x)≥0  and  ⟨e−w,G(x)⟩=0. (5)

Also, it is the equilibrium constraint to bring the bothersome nonconvexity of (2). Since the constraint set of (4) involves the equilibrium constraint (5), we call it an MPEC.

It is well known that the MPEC is a class of very difficult problems in optimization. In the past two decades, there was active research on its theory and algorithms especially for the one over the polyhedral cone, and the interested readers may refer to [17, 34]. We notice that most of existing algorithms are generic and inappropriate for solving (4). To handle the tough equilibrium constraint, we here consider its penalized version

 minx∈Ω,w∈[0,e]{νf(x)+m∑i=1ϕ(wi)+ρ⟨e−w,G(x)⟩} (6)

where is the penalty factor. The following theorem states that (6) is a global exact penalty for (4) in the sense that it has the same global optimal solution set as (4) does.

###### Theorem 2.1

Let . Then, for every , we have , where is the global optimal solution set of (6) associated to , and is that of (4).

The proof of Theorem 2.1 is included in Appendix B. From the proof, we see that the constraint in (2) is also necessary to establish the exact penalty for the MPEC. By Theorem 2.1 and the equivalence between (4) and (2), the estimator can be computed by solving a single penalty problem (6) associated to . Since , the function is Lipschitzian relative to by [27, Theorem 10.4], and so is the objective function of (6) relative to its feasible set . Thus, compared with the discontinuous nonconvex problem (2), the problem (6) is at least an Lipschitz-type one, for which the Clarke generalized differential [7] can be used for its analysis.

Interestingly, the equivalence between (2) and (6) also implies a mechanism to yield equivalent DC surrogates for the group zero-norm , and the popular SCAD function [10] and MCP function [39] are one of the products. Next we demonstrate this fact. For each , let be the associated closed proper convex function:

 ψ(t):={ϕ(t)if t∈[0,1],+∞otherwise. (7)

By using the function , the problem (6) can be rewritten in the following compact form

Together with the definition of conjugate functions and the above discussion, we have

 (8)

This means that the following function provides an equivalent DC surrogate for :

 Θ(x):=1ν∑mi=1θ(ρ∥xJi∥)  with  θ(s):=s−ψ∗(s). (9)

In particular, when is chosen as the one in Example 2.1, it becomes the SCAD function. Indeed, from the expression of in Example 2.1 below, it follows that

 φ(1)θ(τ/φ(1))=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩τ% if |τ|≤1,−τ2+2aτ−12(a−1)if 1<|τ|≤a,a+12if |τ|>a

which, by setting for some constant , implies that

 λ2φ(1)θ(s/λφ(1))=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩sλif |s|≤λ,−s2+2asλ−λ22(a−1)if λ<|s|≤aλ,(a+1)λ2/2if |s|>aλ.

Thus, when , by taking and , the function in (9) reduces to the SCAD function in [10]. Similarly, when is chosen as the one in Example 2.2, by taking and , the function in (9) becomes the MCP function in [39].

Now we give four examples for . In the sequel we shall call - as the function in Example 2.1-2.4, respectively, and - as the corresponding defined by (7).

###### Example 2.1

Take with for , where is a constant. Clearly, with and . After a simple computation,

 ψ∗(s)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩0if |s|≤1φ(1),(φ(1)|s|−1)22(a−1)φ(1)if 1φ(1)<|s|≤aφ(1),|s|−a+12φ(1)if |s|>aφ(1).
###### Example 2.2

Let with where is a constant and . Clearly, with and . An elementary calculation yields that takes the following form

 ψ∗(s)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩(a−2)2+4if |s|≤a−a2/2φ(1),1a2φ(1)(a2−2a2+sφ(1))2−(a−2)2+4φ(1)if a−a2/2φ(1)<|s|≤aφ(1),|s|−1if |s|>aφ(1).
###### Example 2.3

Take for . Clearly, with and . Also,

 ψ∗(s)={s−1if s>1,0if s≤1.

In this case, the function in (9) is exactly the capped -surrogate of in [12].

###### Example 2.4

Let with for , where is a small constant. It is not hard to check that with and . Also, with

 h(t):=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩t+q−1qϵqq−1−ϵ−q−1qif t>ϵ1q−1−1,(1+ϵ)t−1q(t+1)q+1qif (1+ϵ)1q−1−1

To close this section, we take a look at the local optimality relation of (6) and (2).

###### Theorem 2.2

If with is a local optimal solution of (6) associated to , then is locally optimal to (4) and so is to (2). If is locally optimal to (2), then with is locally optimal to (6) for any .

Proof: Since is a local optimal solution of (6) associated to , there exists such that for all with ,

 νf(¯¯¯x)+∑mi=1ϕ(¯¯¯¯wi)+ρ⟨e−¯¯¯¯w,G(¯¯¯x)⟩≤νf(x)+∑mi=1ϕ(wi)+ρ⟨e−w,G(x)⟩.

Fix an arbitrary with where denotes the feasible set of (4). Then, from the last inequality, it immediately follows that

 νf(¯¯¯x)+∑mi=1ϕ(¯¯¯¯wi)=νf(¯¯¯x)+∑mi=1ϕ(¯¯¯¯wi)+ρ⟨e−¯¯¯¯w,G(¯¯¯x)⟩ ≤νf(x)+∑mi=1ϕ(wi)+ρ⟨e−w,G(x)⟩=νf(x)+∑mi=1ϕ(wi).

This shows that is a locally optimal solution of the problem (4).

We next argue that is locally optimal to (2). Since is a locally optimal solution of (4), there exists such that for all with ,

 νf(¯¯¯x)+∑mi=1ϕ(¯¯¯¯wi)≤νf(x)+∑mi=1ϕ(wi). (10)

Let where is the minimal nonzero component of . Fix an arbitrary . We shall establish the following inequality

 νf(¯¯¯x)+∥G(¯¯¯x)∥0≤νf(x)+∥G(x)∥0, (11)

and consequently is a local optimal solution to (2). If ,

 ∥G(x)∥0−∥G(¯¯¯x)∥0≥1>νLfδ≥νLf∥x−¯¯¯x∥≥νf(¯¯¯x)−νf(x).

This implies that inequality (11) holds. If , by using , we have , which implies that . Thus, and . From (10), we obtain or Along with , the inequality (11) follows.

Since is locally optimal to the problem (2), there exists such that for all with , it holds that Fix an arbitrary with . Then, for any ,

 νf(¯¯¯x)+∑mi=1ϕ(¯¯¯¯wi)+ρ⟨e−¯¯¯¯w,G(¯¯¯x)⟩ =νf(¯¯¯x)+∥G(¯¯¯x)∥0≤νf(x)+∥G(x)∥0 ≤νf(x)+∑mi=1ϕ(wi) ≤νf(x)+∑mi=1ϕ(wi)+ρ⟨e−w,G(x)⟩.

where the second inequality is due to (3). Thus, is locally optimal to (6).

## 3 GEP-MSCRA for computing the estimator ˆx

From the last section, to compute the estimator , one only needs to solve the penalty problem (6) associated to with , where the threshold is easily estimated once is given since . For a given , although the problem (6) is nonconvex due to the coupled term , its special structure makes it much easier to cope with than do the problem (2). Specifically, when the variable is fixed, the problem (6) reduces to a convex minimization in ; and when the variable is fixed, it reduces to a convex minimization in which, as will be shown below, has a closed-form solution. Motivated by this, we propose a multi-stage convex relaxation approach for computing by solving (6) in an alternating way.

###### Algorithm 3.1
(GEP-MSCRA for computing ) (S.0) Choose , and an initial . Set and . (S.1) Compute the following minimization problem
(12)
If , by the information of select a suitable and set .
(S.2) Seek an optimal solution to the minimization problem
(13)
(S.3) Set , and then go to Step (S.1).

###### Remark 3.1

(a) By the definition of , clearly, is an optimal solution to (13) if and only if . Since is a convex function in , the subdifferential is easily characterized by [27]; for example, for the function , it holds that

Thus, the main computation work in each iterate of the GEP-MSCRA is to solve (12).

(b) When , since for , we have for all , which means that the subproblem (12) for has a similar form to the one yielded by applying the linear approximation technique in [41] to . Together with part (a), the GEP-MSCRA is analogous to the LLA algorithm in [41] for nonconvex penalized LS problems except the start-up and the weights. We see that the initial subproblem of the GEP-MSCRA depends explicitly on the dual variable , while the initial subproblem of the LLA algorithm depends implicitly on an initial estimator . This means that the start-up of the GEP-MSCRA is more easily controlled.

(c) By following the first part of proofs for Theorem 2.2, when an iterate satisfies , is a local optimal solution of (2), and then with is locally optimal to (6) for any by Theorem 2.2 .

## 4 Statistical guarantees

For convenience, throughout this section, we denote by the group support of the true , i.e., , and write . With and an integer , we define

 C(¯¯¯¯S,l):={z∈Rp:∃S⊃¯¯¯¯S with |S|≤l such that ∑i∈Sc∥zJi∥≤21−¯tϕ∑i∈S∥zJi∥}.

Recall that the matrix is said to satisfy the RSC of constant in a set if

 12n∥Ax∥2≥κ∥x∥2 ∀x∈C.

In this section, under an RSC assumption on over , we shall establish an error bound for the iterate to and verify that the error sequence is strictly decreasing as increases, and if in addition the nonzero group vectors of are not too small, the iterate of the GEP-MSCRA after finite steps satisfies . Throughout the analysis, we assume that the components of the noise vector are independent (not necessarily identically distributed) sub-Gaussians, i.e., the following assumption holds.

###### Assumption 1

Assume that are independent (but not necessarily identically distributed) sub-Gaussians, i.e., there exists such that for all and

 E[exp(tεi)]≤exp(σ2t2/2).

The proofs of the main results of this section are all included in Appendix C.

### 4.1 Theoretical performance bounds

First of all, we characterize the error bound for every iterate to the true vector .

###### Theorem 4.1

Let . Suppose that has the RSC of constant over , and If , then

 ∥xk−¯¯¯x∥≤⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩ν−1(4−2¯tϕ)κ(3−¯tϕ)√1.5¯¯¯rif k=1;ρν−1(4−2¯tϕ)κ(3−¯tϕ)√1.5¯¯¯rif k=2,3,…,. (14)
###### Remark 4.1

(a) When , the subproblem (12) reduces to the -regularized least squares problem, and the bound in (14) has the same order as the one in [24, Corollary 4] except that the coefficient there is improved to be . From the choice interval of , the worst bound of is and that of for is .

(b) The restriction on and implies that . Such a restriction on is also required in the analysis of the -regularized LS estimator [24, 18]. The choice interval of depends on the RSC property of in and the noise level. Clearly, for those problems in which has a better RSC property over or the noise is smaller, there is a larger choice interval for the parameter . In addition, those with larger and smaller can deliver a larger choice interval of .

(c) If or equivalently , the choice interval of in Theorem 4.1 is included in . In this case, each subproblem (12) is a convex approximation of the exact penalty problem (6) in a low dimensional space.

Theorem 4.1 provides an error bound for every iterate of the GEP-MSCRA, but it is unclear whether the error bound for the current iterate is better than that of the previous iterate , i.e., the error bound sequence is decreasing or not. We resolve this problem by bounding for with where . The following theorem states this main result, and its proof involves the index sets

 F0:=¯¯¯¯S  and  Fk:={i:∣∣∥xkJi∥−∥¯¯¯xJi∥∣∣≥1(1−t∗ϕ)ρ}  for k=1,2,…. (15)

Suppose that