# Generalized Conjugate Gradient Methods for ℓ_1 Regularized Convex Quadratic Programming with Finite Convergence

The conjugate gradient (CG) method is an efficient iterative method for solving large-scale strongly convex quadratic programming (QP). In this paper we propose some generalized CG (GCG) methods for solving the ℓ_1-regularized (possibly not strongly) convex QP that terminate at an optimal solution in a finite number of iterations. At each iteration, our methods first identify a face of an orthant and then either perform an exact line search along the direction of the negative projected minimum-norm subgradient of the objective function or execute a CG subroutine that conducts a sequence of CG iterations until a CG iterate crosses the boundary of this face or an approximate minimizer of over this face or a subface is found. We determine which type of step should be taken by comparing the magnitude of some components of the minimum-norm subgradient of the objective function to that of its rest components. Our analysis on finite convergence of these methods makes use of an error bound result and some key properties of the aforementioned exact line search and the CG subroutine. We also show that the proposed methods are capable of finding an approximate solution of the problem by allowing some inexactness on the execution of the CG subroutine. The overall arithmetic operation cost of our GCG methods for finding an ϵ-optimal solution depends on ϵ in O((1/ϵ)), which is superior to the accelerated proximal gradient method [2,23] that depends on ϵ in O(1/√(ϵ)). In addition, our GCG methods can be extended straightforwardly to solve box-constrained convex QP with finite convergence. Numerical results demonstrate that our methods are very favorable for solving ill-conditioned problems.

## Authors

• 18 publications
• 27 publications
10/31/2012

### Iterative Hard Thresholding Methods for l_0 Regularized Convex Cone Programming

In this paper we consider l_0 regularized convex cone programming proble...
06/08/2021

### Using a New Nonlinear Gradient Method for Solving Large Scale Convex Optimization Problems with an Application on Arabic Medical Text

Gradient methods have applications in multiple fields, including signal ...
09/05/2019

### A simple parallelizable method for the approximate solution of a quadratic transportation problem of large dimension with additional constraints

Complexity of the Operations Research Theory tasks can be often diminish...
05/31/2020

### Revisiting Frank-Wolfe for Polytopes: Strict Complementary and Sparsity

In recent years it was proved that simple modifications of the classical...
12/10/2021

The optimal value of the projected successive overrelaxation (PSOR) meth...
12/20/2019

### A New Preconditioning Approach for an Interior Point-Proximal Method of Multipliers for Linear and Convex Quadratic Programming

In this paper, we address the efficient numerical solution of linear and...
01/27/2021

### Krylov-Simplex method that minimizes the residual in ℓ_1-norm or ℓ_∞-norm

The paper presents two variants of a Krylov-Simplex iterative method tha...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The conjugate gradient (CG) method is an efficient numerical method for solving strongly convex quadratic programming (QP) in the form of

 minx∈Rn12xTBx−cTx, (1.1)

or equivalently, the linear system , where is a symmetric positive definite matrix and . It terminates at the unique optimal solution of (1.1

) in a finite number of iterations. Moreover, it is suitable for solving large-scale problems since it only requires matrix-vector multiplications per iteration (e.g., see

[24] for details). The CG method has also been generalized to minimize a convex quadratic function over a box or a ball (e.g., see [11, 12, 25, 26, 27]).

In this paper we are interested in generalizing the CG method to solve the regularized convex QP:

 F∗=minx∈RnF(x):=12xTAx−bTx+τ∥x∥1, (1.2)

where is a symmetric positive semidefinite matrix, and is a regularized parameter. Throughout this paper we make the following assumption for problem (1.2).

###### Assumption 1

The set of optimal solutions of problem (1.2), denoted by , is nonempty. 111Since the objective function of (1.2) is a convex piecewise quadratic function, problem (1.2) has at least an optimal solution if and only if its objective function is bounded below.

Over the last decade, a great deal of attention has been focused on problem (1.2

) due to numerous applications in image sciences, machine learning, signal processing and statistics (e.g., see

[8, 16, 5, 14, 30, 29] and the references therein). Considerable effort has been devoted to developing efficient algorithms for solving (1.2) (e.g., see [2, 23, 30, 33, 15, 32, 31, 22]). These methods are iterative methods and capable of producing an approximate solution to (1.2). Nevertheless, they generally cannot terminate at an optimal solution of (1.2). Recently, Byrd et al. [6] proposed a method called iiCG to solve (1.2) that combines the iterative soft-thresholding algorithm (ISTA) [2, 10, 30] with the CG method. Under the assumption that is symmetric positive definite, it was shown in [6] that the sequence generated by iiCG converges to the unique optimal solution of (1.2), and if additionally this solution satisfies strict complementarity, iiCG terminates in a finite number of iterations. Its convergence is, however, unknown when is positive semidefinite (but not definite), which is typical for many instances of (1.2) arising in applications.

In this paper we propose some generalized CG () methods for solving (1.2) that terminate at an optimal solution of (1.2) in a finite number of iterations with no additional assumption. At each iteration, our methods first identify a certain face of some orthant and then either perform an exact line search along the direction of the negative projected minimum-norm subgradient of or execute a CG subroutine that conducts a sequence of CG iterations until a CG iteration crosses the boundary of this face or an approximate minimizer of over this face or a subface is found. The purpose of the exact line search step is to release some zero components of the current iterate so that the value of is sufficiently reduced. The aim of executing a CG routine is to update the nonzero components of the current iterate, which also results in a reduction on . We determine which type of step should be taken by comparing the magnitude of some components of the minimum-norm subgradient of to that of its rest components. Our methods are substantially different from the iiCG method [6]. In fact, at each iteration, iiCG either performs a proximal gradient step or executes a single CG iteration. It determines which type of step should be conducted by comparing the magnitude of some components of a proximal gradient of to that of its rest components.

In order to analyze the convergence of our GCG methods, we establish some error bound results for problem (1.2). We also conduct some exclusive analysis on the aforementioned exact line search and the CG subroutine. Using these results, we show that the methods terminate at an optimal solution of (1.2) in a finite number of iterations. To the best of our knowledge, the GCG methods are the first methods for solving (1.2) with finite convergence. We also show that our methods are capable of finding an approximate solution of (1.2) by allowing some inexactness on the execution of the CG subroutine. The overall arithmetic operation cost of our GCG methods for finding an -optimal solution depends on in , which is superior to the accelerated proximal gradient method [2, 23] that depends on in . In addition, it shall be mentioned that these methods can be extended to solve the following box-constrained convex QP with finite convergence:

 minl≤x≤u12xTAx−bTx, (1.3)

where is symmetric positive semidefinite, , with . As for finite convergence, the existing CG type methods [11, 12] for (1.3), however, require that be symmetric positive definite. The extension of our methods to problem (1.3) is not included in this paper due to the length limitation.

The rest of the paper is organized as follows. In Section 2, we establish some results on error bound for problem (1.2). In Section 3, we propose several methods for solving problem (1.2) and establish their finite convergence. In Section 4, we discuss the application of our methods to solve the regularized least-squares problems and develop a practical termination criterion for them. We conduct numerical experiments in Section 5 to compare the performance of our methods with some state-of-the-art algorithms for solving problem (1.2). In Section 6 we present some concluding remarks. Finally, in the appendix we study some convergence properties of the standard CG method for solving (possibly not strongly) convex QP.

### 1.1 Notation and terminology

For a nonzero symmetric positive semidefinite matrix , we define a generalized condition number of as

 κ(A)=∥A∥∥A+∥=λmax(A)λ+min(A), (1.4)

where is the Moore-Penrose pseudoinverse of ,

is the largest eigenvalue of

and is the smallest positive eigenvalue of . Clearly, it reduces to the standard condition number when is symmetric positive definite. In addition, for any index set , is the cardinality of and is the submatrix of formed by its rows and columns indexed by . Analogously, is the subvector of formed by its components indexed by . In addition, the range space and rank of a matrix are denoted by and , respectively.

Let be the standard sign operator, which is conventionally defined as follows

 [sgn(x)]i=⎧⎪⎨⎪⎩1if xi>0;0if xi=0;−1if xi<0,i=1,….n.

Let be defined in (1.2) and

 f(x)=12xTAx−bTx. (1.5)

Let be the minimum-norm subgradient of at , which is the projection of the zero vector onto the subdifferential of at . It follows that

 vi(x)={∇if(x)+τ sgn(xi) if xi≠0;min(∇if(x)+τ,max(0,∇if(x)−τ)) if xi=0,i=1,…,n, (1.6)

where denotes the th partial derivative of at . It is known that is an optimal solution of problem (1.2) if and only if , where denotes the subdifferential of . Since is equivalent to , is an optimal solution of (1.2) if and only if .

For any , we define

 I−(x)={i:xi<0},I+(x)={i:xi>0},I0(x)={i:xi=0},Ic0(x)={i:xi≠0}, (1.7)

and also define

 H(x)={y∈Rn:yi=0, i∈I0(x)},F∗x=min{F(y):y∈H(x)}. (1.8)

In addition, given any closed set , denotes the distance from to , and denotes the projection of to . Finally, we define

 I∗={J⊆I0(x∗):x∗∈S∗},L(n)=max{ℓ:Ci∉I∗, i=1,…,ℓ, % are distinct subsets in {1,…,n}}+1. (1.9)

## 2 Error bound results

In this section we develop some error bound results for problem (1.2). To proceed, let for any , where and are defined in (1.2). We first bound the gap between and by for all .

###### Theorem 2.1

Let , and be defined in (1.2) and (1.6), respectively. Then for any , there exists some (depending on ) such that

 F(x)−F∗≤η∥v(x)∥2,∀x∈S(δ).

Proof. Let denote the set of optimal solutions of (1.2). Notice that is a convex piecewise quadratic function. By [21, Theorem 2.7], there exists some such that

 dist(x,X∗)≤ √η√F(x)−F∗,∀x∈S(δ). (2.1)

Let be such that . By and the convexity of , one has

 F(x)−F∗=F(x)−F(x∗)≤⟨v(x),x−x∗⟩≤∥v(x)∥∥x−x∗∥=∥v(x)∥dist(x,X∗),

which together with (2.1) implies that the conclusion holds.

We next bound the gap between and by the magnitude of some components of for all .

###### Theorem 2.2

Let and be defined in (1.2) and (1.8), respectively. Then for any , there exists some (depending on ) such that

 F(x)−F∗x≤^η∥[v(x)]J∥2,∀x∈S(δ),

where .

Proof. Let be arbitrarily chosen, and . If , it is clear that and hence . Also, by convention . These imply the conclusion holds. We now assume . Consider the problem

 ^F∗J=minz∈R|J|^FJ(z):=12zTAJJz−bTJz+τ∥z∥1. (2.2)

In view of the definitions of , , , , and , one can observe that

 ^FJ(xJ)=F(x),^F∗J=F∗x≥F∗.

This together with implies that . By (1.6), (2.2) and the definition of , we also observe that is the minimum-norm subgradient of at . In addition, notice that problem (2.2) is in the same form as (1.2). By these facts and applying Theorem 2.3 to problem (2.2), there exists some (depending on and ) such that

 F(x)−F∗x=^FJ(xJ)−^F∗J≤ηJ∥[v(x)]J∥. (2.3)

Let , which is finite due to the fact that all possible choices of are finite. The conclusion immediately follows from this and (2.3).

The error bound presented in Theorem 2.2 is a local error bound as it depends on . In addition, Theorem 2.2 only ensures the existence of some parameter for the error bound, but its actual value is generally unknown. We next derive a global error bound with a known for problem (1.2) when is symmetric positive definite. To proceed, we first establish a lemma as follows.

###### Lemma 2.1

Suppose and . Let be defined in (1.5) and . Then there holds:

 12∥A∥∥∇f(x)∥2≤f(x)−f∗≤∥A+∥2∥∇f(x)∥2,∀x∈Rn.

Proof. Let be all eigenvalues of and

the corresponding orthonormal eigenvectors. In addition, let

be an optimal solution of the problem . Clearly, . Moreover, for any , we have for some . These imply

 ∇f(x)=Ax−b=A(x−x∗)=n∑i=1λiαiui. (2.4)

Let . It follows that for all . In view of this and (2.4), we have

 ∥∇f(x)∥2=n∑i=1λ2iα2i=ℓ∑i=1λ2iα2i.

This together with the fact yields

 1λ1∥∇f(x)∥2=1λ1ℓ∑i=1λ2iα2i ≤ ℓ∑i=1λiα2i ≤ 1λℓℓ∑i=1λ2iα2i=1λℓ∥∇f(x)∥2.

Using the definitions of and , (2.4), and for all , one can observe that

 f(x)−f∗=12(x−x∗)TA(x−x∗)=12n∑i=1λiα2i=12ℓ∑i=1λiα2i.

The conclusion then immediately follows from the last two relations and the fact that and .

###### Theorem 2.3

Let and be defined in (1.2) and (1.8), respectively. Suppose that is symmetric positive definite. Then there holds:

 F(x)−F∗x≤∥A−1∥2∥[v(x)]J∥2,∀x∈Rn,

where .

Proof. Let be arbitrarily chosen and let . If , it is clear that and hence . Also, by convention . These imply the conclusion holds. We now assume . Consider the problem

 ~F∗J=minz∈R|J|~FJ(z):=12zTAJJz+(−bJ+sgn(xJ))Tz.

Since is positive definite, so is . It then follows that . By applying Lemma 2.1 to this problem, we obtain that

 ~FJ(xJ)−~F∗J≤∥(AJJ)−1∥2∥∇~FJ(xJ)∥2. (2.5)

In addition, by the definitions of , and , one can observe that for all , where is defined in (1.8). This together with the definitions of and implies . Also, we observe that and . Using these relations and (2.5), we have

 F(x)−F∗x ≤ ~FJ(xJ)−~F∗J ≤ ∥(AJJ)−1∥2∥∇~FJ(xJ)∥2 ≤ ∥A−1∥2∥[v(x)]J∥2,

and hence the conclusion holds.

## 3 Generalized conjugate gradient methods for (1.2)

In this section we propose several methods for solving problem (1.2), which terminate at an optimal solution in a finite number iterations. A key ingredient of these methods is to apply a truncated projected CG (TPCG) method to a sequence of convex QP over certain faces of some orthants in .

### 3.1 Truncated projected conjugate gradient methods

In this subsection we present two TPCG methods for finding an (perhaps very roughly) approximate solution to a convex QP on a face of some orthant in in the form of

 minxq(x):=f(x)+cTxs.t.xj=0,j∈J0,xj≤0,j∈J−,xj≥0,j∈J+, (3.1)

where is defined in (1.5), , and form a partition of . For convenience of presentation, we denote by the feasible region of (3.1).

For the first TPCG method, each iterate is obtained by applying the standard projected CG (PCG) method 222The PCG method applied to problem (3.2) is equivalent to the CG method applied to the problem , where is the complement of in . to the problem

 minx{q(x):xj=0, j∈J0} (3.2)

until an approximate solution of (3.2) is found or a PCG iterate crosses the boundary of . In the former case, the method outputs the resulting approximate solution. But in the latter case, it outputs the intersection point between the boundary of and the line segment joining the last two PCG iterates. Let be an arbitrary feasible point of problem (3.1) and be given. We now present the first TPCG method for problem (3.1).

Subroutine 1:

Input: , , , , , , , .

Set , , , , .

Repeat

• , where

• .

• If , return and terminate.

• .

• . If , return and terminate.

• .

• .

Output: .

Remark 1: The iterations of the above TPCG method are almost identical to those of PCG applied to problem (3.2) except that the step length is chosen to be an intermediate one when an iterate of PCG crosses the boundary of . In addition, if holds at some , the output is on the boundary of . If holds at some , the output is an approximate optimal solution of problem (3.1).

We next show that under a mild assumption the above method terminates in a finite number of iterations.

###### Theorem 3.1

Assume that problem (3.1) has at least an optimal solution. Suppose that is a feasible point of problem (3.1) and . Let , where is the complement of in . The following statements hold:

• If problem (3.2) is bounded below, Subroutine 1 terminates in at most 444By convention, we define . It follows that and hence when . iterations, where

 M=⌈logϵ−log(2√κ(B)∥p0∥)log(κ(B)−1)−log(κ(B)+1)⌉.
• If problem (3.2) is unbounded below, Subroutine 1 terminates in at most iterations.

Proof. (i) Assume that problem (3.2) is bounded. Suppose for contradiction that Subroutine 1 does not terminate in iterations. Then the iterates , of Subroutine 1 are identical to those generated by the PCG method applied to problem (3.2). Let denote the optimal value of (3.2). It follows from Theorem A.3 (iii) that for ,

 q(xk)−q∗≤4(√κ(B)−1√κ(B)+1)2k(q(x0)−q∗).

By the definition of and Lemma 2.1, we have

 ∥pk∥2≤2∥B∥(q(xk)−q∗),q(x0)−q∗≤∥B+∥∥p0∥2/2.

Using these relations, we obtain that

 ∥pk∥2≤4κ(B)(√κ(B)−1√κ(B)+1)2k∥p0∥2.

In view of this and Theorem A.2 (i), one can easily conclude that the PCG method must terminate at satisfying for some . This contradict the above supposition.

(ii) Assume that problem (3.2) is unbounded. Suppose for contradiction that Subroutine 1 does not terminate in iterations. Then the iterates , , of Subroutine 1 are identical to those generated by the PCG method applied to problem (3.2). By Theorem A.2 (ii), there must exist some such that as . Recall that is in and problem (3.1) has at least an optimal solution. Thus there exists a least such that lies on the boundary of and Subroutine 1 thus terminates at iteration , which is a contradiction to the above supposition.

Remark 2: It follows from Theorem 3.1 that when , executes at most (but possibly much less than) PCG iterations. On the other hand, when , the number of PCG iterations executed in depends on in .

As seen from step 3) of Subroutine 1, it immediately terminates once an iterate crosses the boundary of . In this case, the output may be a rather poor approximate solution to problem (3.1). In order to improve the quality of , we resort an active set approach by iteratively applying Subroutine 1 to minimize over a decremental subset of , which is formed by incorporating the active constraints of the iterate obtained from the immediately preceding execution of Subroutine 1. Let be an arbitrary feasible point of problem (3.1) and be given. We now present this improved TPCG method for problem (3.1) as follows.

Subroutine 2:

Input: , , , , , , , .

Set , , , , .

Repeat

• If , return and terminate.

• .

• , , , .

• .

Output: .

We next show that under some suitable assumptions, Subroutine 2 terminates in a finite number of iterations.

###### Theorem 3.2

Assume that problem (3.1) has at least an optimal solution. Let and be the feasible region of problem (3.1). Suppose that is a feasible point (3.1) and . Then the following statements hold:

• Subroutine 2 is well defined.

• Subroutine 2 terminates in at most iterations. Moreover, its output satisfies and , where is defined in (1.8).

• Suppose additionally that and for sufficiently small . Then , where is defined in (3.1).

Proof. (i) Observe that in step 2) of Subroutine 2, Subroutine 1 (namely, ) is applied to the problem

 minxq(x)s.t.xj=0,j∈Jk0,xj≤0,j∈Jk−,xj≥0,j∈Jk+, (3.3)

where is defined in (3.1). Let denote the feasible region of (3.3). In view of the updating scheme of Subroutine 2 and the definitions of , and , it is not hard to observe that . By the assumption that (3.1) has at least an optimal solution, so does (3.3). It then follows from Theorem 3.1 that shall be successfully generated by Subroutine 1. Using this observation and an inductive argument, we can conclude that Subroutine 2 is well defined.

(ii) Suppose for contradiction that Subroutine 2 does not terminate in iterations. Then for all . Since are generated by Subroutine 1, one can observe that and hence for every . It then follows from these and the definition of that for all ,

 ∥PHk(Axk+1−b+c)∥∞≥∥PHk+1(Axk+1−b+c)∥∞>ϵ.

This implies that when Subroutine 1 is applied to (3.3), it terminates at a boundary point of the feasible region of (3.3). It then follows that

 I0(x0)⊊I0(x1)⊊⋯⊊I0(xK).

Thus is strictly increasing, which along with and leads to . This contradicts the trivial fact . Therefore, Subroutine 2 must terminate at some in at most iterations. Clearly, . We now prove by considering two separate cases as follows.

Case 1): . In this case, Subroutine 2 terminates at and outputs . By and the definition of , one can see that and hence

 ∥PH(x0)(Ax0−b+c)∥∞≤∥PH0(Ax0−b+c)∥∞≤ϵ,

which together with implies .

Case 2): . In this case, Subroutine 2 must terminate at some iteration . It then follows that and . In addition, we observe from the definitions of and that for . It then immediately follows that .

(iii) We now prove statement (iii). Since , must be generated by calling the subroutine , whose first iteration performs a projected gradient step to find a point , where

 α∗=argminα≥0{q(x(α)):x(α)∈Ω},

and . By the assumption that for sufficiently small , one can see that and . We also observe that the value of is non-increasing along the subsequent iterates of the subroutine . These observations and the definition of imply that . In addition, is non-increasing along the iterates generated in Subroutine 1. Hence, for all . It then follows for all . Notice that for some . Hence, .

Remark 3: As seen from Theorem 3.2, the subroutine is executed in at most (but possibly much less than) times. In view of this and Remark 2, one can see that when , the number of PCG iterations executed in is at most . On the other hand, when , its number of PCG iterations depends on in .

### 3.2 The first generalized conjugate gradient method for (1.2)

In this subsection we propose a method for solving problem (1.2). We show that this method terminates at an optimal solution of (1.2) in a finite number of iterations. Before proceeding, we introduce some notations that will be used through the next several subsections.

Given any , we define

 I00(x)={i∈I0(x):0∈[∇if(x)−τ,∇if(x)+τ]},I+0(x)={i∈I0(x):∇if(x)+τ<0},I−0(x)={i∈I0(x):∇if(x)−τ>0}, (3.4)

where is given in (1.7). Also, we define as follows:

 ci(x;τ)=⎧⎪ ⎪⎨⎪ ⎪⎩τ if i∈I+(x)∪I+0(x);0 if i∈I00(x);−τ if i∈I−(x)∪I−0(x),i=1,…,n, (3.5)

where and are defined in (1.7). It then follows from (1.6) and (3.5) that

 vi(x)=∇if(x)+ci(x;τ),∀i∉I00(x). (3.6)

In addition, given any , we define

 Q(x;y)=f(x)+c(y;τ)Tx.

The main idea of our method is as follows. Given a current iterate , we check to see whether or not. If yes, then is an optimal solution of (