# A Generalized Matrix Splitting Algorithm

Composite function minimization captures a wide spectrum of applications in both computer vision and machine learning. It includes bound constrained optimization, ℓ_1 norm regularized optimization, and ℓ_0 norm regularized optimization as special cases. This paper proposes and analyzes a new Generalized Matrix Splitting Algorithm (GMSA) for minimizing composite functions. It can be viewed as a generalization of the classical Gauss-Seidel method and the Successive Over-Relaxation method for solving linear systems in the literature. Our algorithm is derived from a novel triangle operator mapping, which can be computed exactly using a new generalized Gaussian elimination procedure. We establish the global convergence, convergence rate, and iteration complexity of GMSA for convex problems. In addition, we also discuss several important extensions of GMSA. Finally, we validate the performance of our proposed method on three particular applications: nonnegative matrix factorization, ℓ_0 norm regularized sparse coding, and ℓ_1 norm regularized Dantzig selector problem. Extensive experiments show that our method achieves state-of-the-art performance in term of both efficiency and efficacy.

There are no comments yet.

## Authors

• 7 publications
• 36 publications
• 22 publications
• 55 publications
• ### Model Function Based Conditional Gradient Method with Armijo-like Line Search

The Conditional Gradient Method is generalized to a class of non-smooth ...
01/23/2019 ∙ by Yura Malitsky, et al. ∙ 0

• ### Online and stochastic Douglas-Rachford splitting method for large scale machine learning

Online and stochastic learning has emerged as powerful tool in large sca...
08/22/2013 ∙ by Ziqiang Shi, et al. ∙ 0

• ### KL property of exponent 1/2 of ℓ_2,0-norm and DC regularized factorizations for low-rank matrix recovery

This paper is concerned with the factorization form of the rank regulari...
08/24/2019 ∙ by Shujun Bi, et al. ∙ 0

• ### A Block Decomposition Algorithm for Sparse Optimization

Sparse optimization is a central problem in machine learning and compute...
05/27/2019 ∙ by Ganzhao Yuan, et al. ∙ 0

• ### Stochastic Orthant-Wise Limited-Memory Quasi-Newton Methods

The ℓ_1-regularized sparse model has been popular in machine learning so...
04/26/2017 ∙ by Jianqiao Wangni, et al. ∙ 0

• ### Understanding Trainable Sparse Coding via Matrix Factorization

Sparse coding is a core building block in many data analysis and machine...
09/01/2016 ∙ by Thomas Moreau, et al. ∙ 0

• ### Regularized Orthogonal Tensor Decompositions for Multi-Relational Learning

Multi-relational learning has received lots of attention from researcher...
12/26/2015 ∙ by Fanhua Shang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this paper, we focus on the following composite function minimization problem:

 minx f(x)≜q(x)+h(x); q(x)=12xTAx+xTb (1)

where , is a symmetric positive semidefinite matrix, and is a piecewise separable function (i.e. ) but not necessarily convex. Typical examples of include the bound constrained function and the and norm functions. We assume that is bounded below for any feasible solution .

The optimization in (1) is flexible enough to model a variety of applications of interest in both computer vision and machine learning, including compressive sensing [9], nonnegative matrix factorization [20, 22, 11], sparse coding [21, 1, 2, 35]

[15][47], subspace clustering [10], to name a few. Although we only focus on the quadratic function , our method can be extended to handle general non-quadratic composite functions by considering a Newton approximation of the objective [42, 50] and to solve general linear constrained problems by using its associated augmented Lagrangian function of the problem [12, 13].

The most popular method for solving problem (1) is perhaps the proximal gradient method [31, 3]. It considers a fixed-point proximal iterative procedure based on the current gradient . Here the proximal operator can often be evaluated analytically, is the step size with being the local (or global) Lipschitz constant. It is guaranteed to decrease the objective at a rate of , where is the iteration number. The accelerated proximal gradient method can further boost the rate to

. Tighter estimates of the local Lipschitz constant leads to better convergence rate, but it scarifies additional computation overhead to compute

. Our method is also a fixed-point iterative method, but it does not rely on a sparse eigenvalue solver or line search backtracking to compute such a Lipschitz constant, and it can exploit the specified structure of the quadratic Hessian matrix

.

The proposed method is essentially a generalization of the classical Gauss-Seidel (GS) method and Successive Over-Relaxation (SOR) method [8, 37]. In numerical linear algebra, the Gauss-Seidel method, also known as the successive displacement method, is a fast iterative method for solving a linear system of equations. It works by solving a sequence of triangular matrix equations. The method of SOR is a variant of the GS method and it often leads to faster convergence. Similar iterative methods for solving linear systems include the Jacobi method and symmetric SOR. Our proposed method can solve versatile composite function minimization problems, while inheriting the efficiency of modern linear algebra techniques.

Our method is closely related to coordinate gradient descent and its variants such as randomized coordinate descent [15, 34], cyclic coordinate descent [39], block coordinate descent [30, 4, 14], randomized block coordinate descent [36, 26], accelerated randomized coordinate descent [30, 23, 25] and others [24, 18, 52]. However, all these work are based on gradient-descent type iterations and a constant Lipschitz step size. They work by solving a first-order majorization/surrogate function via closed form updates. Their algorithm design and convergence result cannot be applied here. In contrast, our method does not rely on computing the Lipschicz constant step size, yet it adopts a triangle matrix factorization strategy, where the triangle subproblem can be solved by an alternating cyclic coordinate strategy.

We are aware that matrix splitting algorithm has been considered to solve symmetric linear complementarity problems [27, 28, 17] and second-order cone complementarity problems [53] in the literature. However, we focus on minimizing a general separable nonsmooth composite function which is different from theirs. In addition, our algorithm is derived from a novel triangle operator mapping, which can be computed exactly using a new Gaussian elimination procedure. It is worthwhile to mention that matrix splitting has been extended to operator splitting recently to solve multi-term nonsmooth convex composite optimization problems [38].

Contributions. (i) We propose a new Generalized Matrix Splitting Algorithm (GMSA) for composite function minimization (See Section 2). Our method is derived from a novel triangle proximal operator (See Subsection 2.1). We establish the global convergence, convergence rate, and iteration complexity of GMSA for convex problems (See Subsection 2.2). (ii) We discuss several important extensions of GMSA (see Section 3). First, we consider a new correction strategy to achieve pointwise contraction for the proposed method (See Subsection 3.1). Second, we discuss using Richardson extrapolation technique to further accelerate GMSA (See Subsection 3.2). Third, we extend GMSA to solve nonconvex problems with global convergence guarantee (See Subsection 3.3). Fourth, we discuss how to adapt GMSA to minimize non-quadratic functions (See Subsection 3.4). Fifth, we show how to incorporate GMSA into the general optimization framework of Alternating Direction Method of Multipliers (ADMM) (See Subsection 3.5). (iii) Our extensive experiments on nonnegative matrix factorization, sparse coding and Danzig selectors have shown that GMSA achieves state-of-the-art performance in term of both efficiency and efficacy (See Section 4). A preliminary version of this paper appeared in [49].

Notation. We use lowercase and uppercase boldfaced letters to denote real vectors and matrices respectively. The Euclidean inner product between and is denoted by or . We denote , , and as the spectral norm (i.e.

the largest singular value) of

. We denote the element of vector as and the element of matrix as . is a column vector formed from the main diagonal of . and indicate that the matrix is positive semidefinite and positive definite, respectively. Here is not necessarily symmetric 111. We denote as a diagonal matrix of and as a strictly lower triangle matrix of 222For example, when , and take the following form:
. Thus, we have . Throughout this paper, denotes the value of at -th iteration if is a variable, and denotes the -th power of if is a constant scalar. We use to denote any solution of the optimal solution set of (1). For notation simplicity, we denote:

 rk≜xk−x∗, dk≜xk+1−xkuk≜f(xk)−f(x∗), fk≜f(xk), f∗≜f(x∗)

## 2 Proposed Algorithm

This section presents our proposed Generalized Matrix Splitting Algorithm (GMSA) for solving (1). Throughout this section, we assume that is convex and postpone the discussion for nonconvex to Section 3.3.

Our solution algorithm is derived from a fixed-point iterative method based on the first-order optimal condition of (1). It is not hard to validate that a solution is the optimal solution of (1) if and only if satisfies the following nonlinear equation (“” means define):

 0∈∂f(x)=∇q(x)+∂h(x)=Ax+b+∂h(x) (2)

where and denote the gradient of and the sub-gradient of in , respectively. In numerical analysis, a point is called a fixed point if it satisfies the equation , for some operator . Converting the transcendental equation algebraically into the form , we obtain the following iterative scheme with recursive relation:

 xk+1∈T(xk), k=0,1,2,... (3)

We now discuss how to adapt our algorithm into the iterative scheme in (3). First, we split the matrix in (2) using the following strategy:

 A=L+1ωD+ϵIB+LT+ω−1ωD−ϵIC (4)

Here, is a relaxation parameter and is a parameter for strong convexity that enforces . These parameters are specified by the user beforehand. Using these notations, we obtain the following optimality condition which is equivalent to (2):

 −Cx−b∈(B+∂h)(x)

Then, we have the following equivalent fixed-point equation:

 x∈T(x;A,b,h)≜(B+∂h)−1(−Cx−b) (5)

For notation simplicity, we denote as since can be known from the context.

We name the triangle proximal operator, which is novel in this paper333This is in contrast with Moreau’s proximal operator [33]: , where the mapping is called the resolvent of the subdifferential operator .. Due to the triangle property of the matrix and the element-wise separable structure of , the triangle proximal operator in (5) can be computed exactly and analytically, by a generalized Gaussian elimination procedure (discussed later in Section 2.1). Our generalized matrix splitting algorithm iteratively applies until convergence. We summarize our algorithm in Algorithm 1.

In what follows, we show how to compute in (5) in Section 2.1, and then we study the convergence properties of Algorithm 1 in Section 2.2.

### 2.1 Computing the Triangle Proximal Operator

We now present how to compute the triangle proximal operator in (5), which is based on a new generalized Gaussian elimination procedure. Notice that (5) seeks a solution that satisfies the following nonlinear system:

 0∈Bz∗+u+∂h(z∗), % where u=b+Cxk (6)

By taking advantage of the triangular form of and the element-wise/decomposable structure of , the elements of can be computed sequentially using forward substitution. Specifically, the above equation can be written as a system of nonlinear equations:

 0∈⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣B1,10000B2,1B2,2000⋮⋮⋱00Bn−1,1Bn−1,2⋯Bn−1,n−10Bn,1Bn,2⋯Bn,n−1Bn,n⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣z∗1z∗2⋮z∗n−1z∗n⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦+u+∂h(z∗)

If satisfies the equations above, it must solve the following one-dimensional subproblems:

 0∈Bj,jz∗j+wj+∂hj(z∗j), ∀j=1,2, ... ,n, wj=uj+∑j−1i=1Bj,iz∗i

This is equivalent to solving the following one-dimensional problem for all :

 z∗j=t∗≜argmint  12Bj,jt2+wjt+hj(t) (7)

Note that the computation of uses only the elements of that have already been computed and a successive displacement strategy is applied to find .

We remark that the one-dimensional subproblem in (7) often admits a closed form solution for many problems of interest. For example, when with denoting an indicator function on the box constraint , the optimal solution can be computed as: ; when (i.e. in the case of the norm), the optimal solution can be computed as: .

Our generalized Gaussian elimination procedure for computing is summarized in Algorithm 2. Note that its computational complexity is , which is the same as computing a matrix-vector product.

### 2.2 Convergence Analysis

In what follows, we present our convergence analysis for Algorithm 1.

The following lemma characterizes the optimality of the triangle proximal operator for any .

###### Lemma 1.

For all , it holds that:

 (i) 0∈ ∇q(T(x))+∂h(T(x))+C(x−T(x)) (8)
 (ii) h(T(x))−h(y)+⟨∇q(T(x)),T(x)−y⟩≤⟨C(T(x)−x), T(x)−y⟩ (9)
 (iii) f(T(x))−f(y)≤ ⟨C(T(x)−x),T(x)−y⟩−12∥T(x)−y∥2A (10)
###### Proof.

(i) Using the optimality of in (6), we derive the following results: , where step uses and step uses the definition of in (2).

(ii) Since is convex, we have:

 ∀s, z, h(s)−h(z)≤⟨h′,s−z⟩, ∀h′∈∂h(s). (11)

Letting , we derive the following inequalities: , where step uses (8).

(iii) Since is a quadratic function, we have:

 (12)

We naturally derive the following results: , where step uses the definition of ; step uses (12) with and ; step uses (9).

Remarks. Both (8) and (9) can be used to characterize the optimality of (1). Recall that we have the following sufficient and necessary conditions for the optimal solution: . When occurs, (8) and (9) coincide with the optimal condition and one can conclude that is the optimal solution.

###### Theorem 1.

(Proof of Global Convergence) We define and let be chosen such that . Algorithm 1 is globally convergent.

###### Proof.

(i) First, the following results hold for all :

 zT(A−2C)z= zT(B−C)z = zT(L−LT+2−ωωD+2ϵI)z = zT(2ϵI+2−ωωD)z≥δ∥z∥22 (13)

where we have used the definition of and , and the fact that .

We invoke (10) in Lemma 1 with and combine the inequality in (1) to obtain:

 fk+1−fk≤−12⟨dk,(A−2C)dk⟩≤−δ2∥dk∥22 (14)

(ii) Second, summing (14) over , we have:

 δ2∑k−1i=0∥di∥22≤f0−fk(a)≤f0−f∗ ⇒ δ2mini=0,...,k−1 ∥di∥22≤(f0−f∗)/k

where step uses the fact that . Note that is bounded below. As , we have , which implies the convergence of the algorithm. Invoking (8) in Lemma 1 with , we obtain: . The fact that implies that is the global optimal solution of the convex problem.

Note that guaranteeing can be achieved by simply choosing and setting to a small number. ∎

Remarks. (i) When is empty and , Algorithm 1 reduces to the classical Gauss-Seidel method () and Successive Over-Relaxation method (). (ii) When contains zeros in its diagonal entries, one needs to set to a strictly positive number. This is to guarantee the strong convexity of the one dimensional subproblem and a bounded solution for any in (7). The introduction of the parameter is novel in this paper and it removes the assumption that is strictly positive-definite or strictly diagonally dominant, which is used in the classical result of GS and SOR method [37, 8].

We now prove the convergence rate of Algorithm 1. We make the following assumption, which characterizes the relations between and for any .

###### Assumption 1.

If is not the optimum of (1), there exists a constant such that .

Remarks. Assumption 1 is similar to the classical local proximal error bound assumption in the literature [29, 42, 41, 51], and it is mild. Firstly, if is not the optimum, we have . This is because when , we have (refer to the optimal condition of in (8)), which contradicts with the condition that is not the optimal solution. Secondly, by the boundedness of and , there exists a sufficiently large constant such that .

We now prove the convergence rate of Algorithm 1.

###### Theorem 2.

(Proof of Convergence Rate) We define and let be chosen such that . Assuming that is bound for all , we have:

 f(xk)−f(x∗)≤(C11+C1)k[f(x0)−f(x∗)], (15) ∥xk−xk+1∥22≤2δ(C11+C1)k[f(x0)−f(x∗)]. (16)

where .

###### Proof.

Invoking Assumption 1 with , we obtain:

 ∥xk−x∗∥≤η∥xk−T(xk)∥ ⇒ ∥rk∥≤η∥dk∥ (17)

We derive the following inequalities:

 fk+1−f∗ (a)≤ ⟨rk+1,Cdk⟩−12⟨rk+1,Ark+1⟩ (18) (c)≤ −⟨rk, Bdk⟩+0−δ2∥dk∥22 (d)≤ ∥rk∥∥B∥∥dk∥−δ2∥dk∥22(e)≤ (η∥B∥−δ2)∥dk∥22 (f)≤ (η∥B∥−δ2)2δ(fk−fk+1)(g)= C1(fk−fk+1) (19)

where step uses (10) in Lemma 1 with ; step uses the fact that and ; step uses and the inequality that which is due to (1); step () uses the Cauchy-Schwarz inequality and the norm inequality ; step () uses (17); step uses the descent condition in (14); step uses the definition of .

Rearranging the last inequality in (19), we have , leading to: . Solving this recursive formulation, we obtain (15). In other words, the sequence converges to linearly in the quotient sense. Using (14), we derive the following inequalities: . Therefore, we obtain (16).

The following lemma is useful in our proof of iteration complexity.

###### Lemma 2.

Suppose a nonnegative sequence satisfies for some constant . It holds that: .

###### Proof.

The proof of this lemma can be obtained by mathematical induction. We denote . (i) When , we have . (ii) When , we assume that holds. We derive the following results: . Here, step uses ; step uses the inequality that ; step uses .

We now prove the iteration complexity of Algorithm 1.

###### Theorem 3.

(Proof of Iteration Complexity) We define and let be chosen such that . Assuming that for all , we have:

 fk−f∗≤⎧⎪⎨⎪⎩u0(2C32C3+1)k,if\leavevmode\nobreak\ √fk−fk+1≥C2C3, ∀k≤˘kC4k,if\leavevmode\nobreak\ √fk−fk+1

where , , , and is some unknown iteration index.

###### Proof.

We have the following inequalities:

 uk+1 (a)≤⟨rk+1,Cdk⟩−12⟨rk+1,Ark+1⟩ (b)≤⟨rk+dk,Cdk⟩+0 (d)≤∥rk∥⋅∥C∥⋅∥dk∥+∥C∥⋅∥dk∥22 (d)≤2R∥C∥⋅∥dk∥+∥C∥⋅∥dk∥22 (e)≤2R∥C∥⋅√2δ(uk−uk+1)+∥C∥⋅2δ⋅(uk−uk+1) (f)=C2√uk−uk+1+C3(uk−uk+1) (20)

where step uses (18); step uses the fact that ; step uses the Cauchy-Schwarz inequality and the norm inequality; step uses the fact that ; step uses (14); step uses the definition of and .

Now we consider the two cases for the recursion formula in (20): (i) for some (ii) for some . In case (i), (20) implies that we have and rearranging terms gives: . Thus, we have: . We now focus on case (ii). When , (20) implies that we have and rearranging terms yields:. Solving this quadratic inequality, we have: ; solving this recursive formulation by Lemma 2, we obtain .

Remarks. The convergence result in Theorem 3 is weaker than that in Theorem 2, however, it does not rely on Assumption 1 and the unknown constant .

We now derive the convergence rate when is strongly convex.

###### Theorem 4.

(Proof of Convergence Rate when is Strongly Convex) We define and let be chosen such that . Assuming that