Extended Gauss-Newton and Gauss-Newton-ADMM Algorithms for Low-Rank Matrix Optimization

06/10/2016 ∙ by Quoc Tran-Dinh, et al. ∙ University of North Carolina at Chapel Hill 0

We develop a generic Gauss-Newton (GN) framework for solving a class of nonconvex optimization problems involving low-rank matrix variables. As opposed to standard Gauss-Newton method, our framework allows one to handle general smooth convex cost function via its surrogate. The main complexity-per-iteration consists of the inverse of two rank-size matrices and at most six small matrix multiplications to compute a closed form Gauss-Newton direction, and a backtracking linesearch. We show, under mild conditions, that the proposed algorithm globally and locally converges to a stationary point of the original nonconvex problem. We also show empirically that the Gauss-Newton algorithm achieves much higher accurate solutions compared to the well studied alternating direction method (ADM). Then, we specify our Gauss-Newton framework to handle the symmetric case and prove its convergence, where ADM is not applicable without lifting variables. Next, we incorporate our Gauss-Newton scheme into the alternating direction method of multipliers (ADMM) to design a GN-ADMM algorithm for solving the low-rank optimization problem. We prove that, under mild conditions and a proper choice of the penalty parameter, our GN-ADMM globally converges to a stationary point of the original problem. Finally, we apply our algorithms to solve several problems in practice such as low-rank approximation, matrix completion, robust low-rank matrix recovery, and matrix recovery in quantum tomography. The numerical experiments provide encouraging results to motivate the use of nonconvex optimization.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Various practical models in low-rank embedded problems, function learning, matrix completion in recommender systems, inpainting and compression in image processing, robust principle component analysis in statistics, and semidefinite programming relaxations in combinatorial optimization often require to recover a low-rank matrix from either given high-rank datasets or observations corrupted by noise. Such practical models can be formulated into low-rank matrix optimization problems, see, e.g.,

Candes2012b ; Candes2011a ; Esser2010a ; goldfarb2011convergence ; kyrillidis2014matrix ; liu2015efficient ; wen2012solving .

1.1 Problem statement

In this paper, we consider the following class of low-rank matrix optimization problems:

(1)

where is a proper, closed and convex function; is a linear operator, defined as for matrices in ; and

is an observed vector. We are interested in the low-rank case, where

.

Clearly, (1) is nonconvex due to the bilinear term . Hence, it is NP-hard natarajan1995sparse , and numerical methods for solving (1) aim at obtaining a local optimum or a stationary point. While (1) covers a wide range of applications, we briefly recall some emerging problems which have been recently attracted a great attention. The most common case is when , where (1) becomes a least-squares low-rank matrix approximation problem using in compressive sensing (see, e.g., kyrillidis2014matrix ):

(2)

Here, the linear operator is often assumed to satisfy a restricted isometric property (RIP) Candes2006 that allows us to recover an exact solution from a few number of observations in . In particular, if , the projection on a given index subset , then (2) covers the matrix completion model as a special case:

(3)

where is the observed entries in . If is an identity operator and is a given low-rank matrix, then (2) becomes a low-rank matrix factorization problem

(4)

Especially, if and is symmetric positive definite, then (4) reduces to

(5)

which was considered in liu2015efficient . Alternatively, if we choose in (2), then (1) reduces to the case studied in Bhojanapallli2015 . While both special cases, (4) and (5

), possess a closed form solution via a truncated SVD and an eigenvalue decomposition, respectively, Gauss-Newton methods can also be applied to solve these problems. In

liu2015efficient , the authors demonstrated the advantages of a Gauss-Newton method for solving (5) via several encouraging numerical examples.

1.2 Related work

Low-rankness is key to recast many existing problems into new frameworks or to design new models with the aid of regularizers to promote solution structures in concrete applications including matrix completion (MC) Candes2012b , robust principle component analysis (RPCA) Candes2011

and their variants. So far, extensions to group structured sparsity, low-rankness, tree models and tensor representation have attracted a great attention, see, e.g.,

fazel2002matrix ; grasedyck2013literature ; huang2011learning ; kyrillidis2015structured ; Recht2010 ; Signoretto2014 ; Yu2014

. A majority of research for low-rank models focusses on estimating sample complexity results for specific instances of (

1), while various recent papers revolve around the RPCA settings, matrix completion, and their extensions Candes2011 ; Candes2012b ; johnson1990matrix . Simple convex models originating from (1) have several advantages for designing solution methods, they unfortunately do not capture well the original model. Trading-off these two ingredients is key to achieve appropriate models for practical purposes.

Along with problem modeling, solution algorithms for (1) are a core step for solving practical problems in low-rank matrix completion and recovery. From an optimization-based viewpoint, we observe that existing methods for (1) either focus on specific applications or are limited to some classes of problems, where advanced computational techniques can be exploited. Among many others, convex optimization is perhaps one of the most powerful tools to solve several instances of (1) including MC, RPCA and their variants and extensions. Unfortunately, convex models only provide an approximation to the low-rank model (1) by convex relaxations using, e.g., nuclear or max norms, which may not adequately approximate the desired rank. Alternatively, nonconvex as well as discrete optimization methods have also been considered for solving (1), see, e.g., burer2003nonlinear ; kyrillidis2014matrix ; Lin2009 ; Shen2012 ; wen2012solving ; yu2014parallel . While these approaches work directly on the original problem (1), they can only find a local optimum or a critical point, and strongly depend on the priori knowledge of problems, the initial points of algorithms, and a predicted ranks. However, several empirical evidence have been provided to support these approaches, and surprisingly, in many cases, they outperform convex optimization approaches in terms of “accuracy” to the original model, and the overall computational time kyrillidis2014matrix ; Lin2009 ; wen2012solving

. Other approaches such as stochastic gradient descent, Riemann manifold descent, greedy methods, parallel and distributed algorithms have also been studied recently, see, e.g.,

bottou2010large ; johnson1990matrix ; keshavan2009gradient ; vandereycken2013low ; yu2014parallel .

1.3 Motivation and contributions

Optimization researchers have observed that Gauss-Newton (GN) methods work extremely well for nonlinear least-square problems Bjorck1996 . Within the quadratic objective , when the residual term in (1) is small or zero at solutions, GN methods can achieve local superlinear and even quadratic convergence rate. With a “good” initial guess (i.e., close to the solution set), GN methods often reach a stationary point within few iterations Deuflhard2006 . In many practical problems, we can often predict a “good” initial point using priori knowledge of our problem and underlying algorithm (e.g., steady states of dynamical systems, or previous iterations of algorithms) as a warm-start.

As in classical GN methods, we design an iterative scheme for solving (1) based on the linearization of the residual term and the quadratic surrogate of . At each iteration, it requires to solve a simple convex problem to form a GN search direction and then incorporates with a globalization strategy to update the next iteration. In our setting, computing GN search direction reduces to solving a linear least-squares problem, which is similar to alternating direction method (ADM) wen2012solving . While ADM alternatively solves for each and , GN simultaneously solves for and using the linearization of . We have experienced that (cf. Subsection 7.1) GN uses a linearization of providing a good local approximate model to compared to the alternating form (or ), when (or is relatively large. This makes ADM saturated and does not significantly improve the objective values. In addition, without regularization, ADM may fail to converge as a counterexample in gonzalez2005accelerating . Moreover, ADM is not applicable to solve the symmetric case of (1), where we have without reformulation or introducing lifting variables, but the GN method is.

While Gauss-Newton methods have been widely used in nonlinear least squares Nocedal2006 , they are still mostly unexploited for matrix nonlinear optimization. Our aim in this paper is to extend the GN method for solving a class of problems (1) to the two aspects: general convex objective , and low-rank matrix variable settings. This work is also inspirited by a recent work in liu2015efficient , where the authors considered a simple symmetric instance of (1), and demonstrated very encouraging numerical experiments via GN methods compared to other state-of-the-art algorithms.

Our contributions:

Our specific contributions can be summarized as follows:

  • We extend the Gauss-Newton method to solve low-rank matrix optimization problems of the form (1) involving general convex objective. We point out the existence of GN directions and provide their closed form formulation. We show empirically that our GN method achieve much higher accurate solutions compared to the common used alternating direction methods within the same number of iterations.

  • We show that there exists an explicit step-size to guarantee a descent property of the Gauss-Newton direction. This allows us to perform a simple backtracking linesearch procedure to guarantee global convergence of the proposed method. We then specify our framework to the symmetric case with global convergence guarantee.

  • We prove a local linear and quadratic convergence rate of the full-step Gauss-Newton variant under standard assumptions imposed on (1) at its solution set.

  • Then, we develop a joint treatment between ADMM and the Gauss-Newton method to obtain a new algorithm for (1) that handles general objective . Under standard assumptions on (1), we prove global convergence of the proposed algorithm.

Unlike alternating direction methods whose only achieve sublinear convergence even with good initial points, GN methods may slightly require additional computation for GN directions, but they can achieve a fast local linear or quadratic convergence rate, which is key for online and real-time implementations. Alternatively, gradient descent-based methods can achieve local linear convergence but often require much strong assumptions imposed on (1). In contrast, GN methods work for “small residual” settings under mild assumptions, and can easily achieve high accuracy solutions.

1.4 Outline of the paper

The rest of this paper is organized as follows. We first review basic concepts related to problem (1) in Section 2. Section 3 presents a linesearch Gauss-Newton method for solving (1) and proves its convergence properties. Section 4 designs a Gauss-Newton ADMM algorithm for solving (1) and investigates its global convergence. Section 5 specifies the GN algorithm to the symmetric case and proves its convergence. Section 6 discusses the implementation aspects of our algorithms and their extension to nonsmooth objective function. Numerical experiments are conducted in Section 7 with several applications in different fields. For clarity of presentation, we move all the proofs in the main text to the appendix.

2 Basic notations and optimality condition

We briefly state some basic notations, the optimality condition for (1), and the fundamental assumption.

2.1 Basic notation and concepts

For a matrix , and

denote its positive smallest and largest singular values, respectively. If

is symmetric, then and denote its smallest and largest eigenvalues, respectively. We use for SVD and for eigenvalue decomposition.

We denote by the Moore-Penrose pseudo-inverse of , i.e., when has full-column rank. We also define the projection onto the range space of , and the orthogonal projection of , i.e., , where is an identity. Clearly, .

For a matrix , we define the vectorization of . Given the operator, we define the inverse of such that . For two matrices and , denotes the Kronecker product of and . We always have and . For a linear operator , denotes its adjoint operator.

2.2 Optimality condition and fundamental assumptions

We define as a joint variable of and in (1). First, we assume that in (1) is smooth. Then, the optimality condition of (1) can be written as follows:

(6)

Any satisfying (6) is called a stationary point of (1). We denote by the set of stationary points of (1). Since , the solution of (6) is generally nonunique. Our aim is to design algorithms for generating a sequence converging to relying on the following fundamental assumption.

Assumption A. 2.1

Problem (1) satisfies the following conditions:

  is -smooth and -convex, i.e., satisfies with that

(7)

 The set of stationary points of (1) is nonempty.

If is Lipschitz continuous with the Lipschitz constant , then (7) holds Nesterov2004 . We allow , which also covers the non-strongly convex case. Since is smooth and is linear, is also smooth. In addition, Assumption A.2.1

(a) covers a wide range of applications including logistic loss, Huber loss and entropy in nonlinear regression and machine learning

Boyd2011 .

3 Line-search Gauss-Newton method

We develop here a linesearch Gauss-Newton (Ls-GN) algorithm for solving (1) which has convergence guarantee.

3.1 Forming a surrogate of the objective

By Assumption A.2.1, it follows from (7) and that

(8)

for any , , , and , where is the gradient of , and is the Lipschitz constant of the gradient of .

Descent optimization methods rely on finding a descent direction of by approximately minimizing the right-hand side surrogate of in (8). Unfortunately, this surrogate remains nonconvex due to the bilinear term . Our next step is to linearize this bilinear term around a given point as follows:

(9)

Then, the minimization of the right-hand side surrogate (8) is approximated by

(10)

This problem is a linear least-squares, and can be solved by standard linear algebra.

3.2 Computing Gauss-Newton direction

Let , and . Then, we rewrite (10) as

(11)

The optimality condition of (11) becomes

(12)

As usual, we can refer to (12) as the normal equation of (11). We show in Lemma 1 a closed form solution of (12), whose proof is in Appendix A.1.

Lemma 1

The rank of the square linear system (12) does not exceed . In addition, (12) has solution. If , then the solution of (12) is given explicitly by

(13)

which forms a linear subspace in , where and are the Moore-Penrose pseudo inverses of and , respectively, and is an arbitrary matrix.

In particular, if we choose , then

(14)

Moreover, the optimal value of (11) is .

Lemma 1 also shows that if either is in the null space of or is in the null space of , then . We note that (14) only gives us one choice for . We can choose to obtain a simple GN search direction.

Remark 1

Let . If we assume that , then and

Clearly, is a linear subspace, and its dimension is .

3.3 The damped-step Gauss-Newton scheme

Using Lemma 1, we can form a damped step Gauss-Newton scheme as follows:

(15)

where and defined in (14) is the Gauss-Newton direction, and is a given step-size determined in the next lemma.

Since the GN direction computed by solving (11) is not unique, we need to choose an appropriate such that this direction is a descent direction of at . We prove in the following lemma that the GN direction computed by (14) is indeed a descent direction of at . The proof of this lemma is given in Appendix A.2.

Lemma 2

Let be a non-stationary point of (1) and be computed by (14). If and is chosen such that with

(16)

then we have

(17)

where , , , and . Consequently, is a descent direction of .

Lemma 2 shows that if the residual term is sufficient small near , then we obtain full-step since both and are small.

The existence of the GN direction in Lemma 1 requires and to be full-rank. This condition is shown in the following lemma whose proof is in Appendix A.3.

Lemma 3

If , then updated by (15) using the step-size in (16) satisfies

(18)

Hence, (15) preserves the rank of and , i.e., .

3.4 The algorithm and its global convergence

Theoretically, we can use the step-size in Lemma 2 for (15). However, in practice, computing requires a high computational intensity. We instead incorporate the GN scheme (15) with an Armijo’s backtracking linesearch to find an appropriate step-size for given .

Given , find the smallest number such that and

(19)

where and are given (e.g., and ).

By Lemma 2, this linesearch procedure is terminated after finite iterations :

(20)

where is given by (16). Now, we present the complete linesearch Gauss-Newton algorithm for solving (1) as in Algorithm 1.

1:  Initialization:
2:  Given a tolerance . Choose . Set and .
3:  for  to  do
4:      GN direction: Let . Compute :
5:     Stopping criterion: If stopping_criterion, then TERMINATE.
6:      Backtracking linesearch: Find the smallest number such that
Set .
7:     Update as and .
8:  end for
Algorithm 1 (Linesearch Gauss-Newton algorithm (Ls-GN))

Complexity-per-iteration:

The main steps of Algorithm 1 are Steps 4 and 6, where we need to compute and perform the linesearch routine, respectively.

 We can show that computing requires two inverses and of the size , and two matrix-matrix multiplications (of the size or ).

 Evaluating requires one matrix-matrix multiplication and one evaluation . When is a subset projection (e.g., in matrix completion), we can compute instead of the full matrix .

 Each step of the linesearch needs one matrix-matrix multiplication and one evaluation of . It requires at most linesearch iterations. However, we observe from numerical experiments that often varies from to on average.

Global convergence:

Since (1) is nonconvex, we only expect generated by Algorithm 1 to converge to a stationary point . However, Lemma 3 only guarantees the full-rankness of and at each iteration, but we may have or . In order to prove a global convergence of Algorithm 1, we require one additional condition: There exists such that:

(21)

Under Assumption A.2.1, the following sublevel set of is bounded for given :

We prove in Appendix A.4 a global convergence of Algorithm 1 stated as follows.

Theorem 3.1

Let with be the sequence generated by Algorithm 1. Then, under Assumption A.2.1, we have

(22)

If, in addition, the condition (21) holds and is bounded, then

(23)

There exists a limit point of , and any limit point is in .

3.5 Local linear convergence without strong convexity

We prove a local convergence of the full-step Gauss-Newton scheme (15) when . Generally, problem (1) does not satisfy the regularity assumption: Jacobian of the objective residual in (1) is not full-column rank, where is the matrix form of the linear operator . However, we can still guarantee a fast local convergence under the following conditions:

Assumption A. 3.1

Problem (1) satisfies the following conditions:

  is twice differentiable on a neighborhood of , and its Hessian is Lipschitz continuous in with the constant .

 The Hessian of at satisfies

(24)

where , , and .

Assumption A.3.1(b) relates to a “small residual condition”. For instance, if , and , the identity operator, then the residual term becomes , and the objective is . In this case, condition (24) holds if (i.e., we have a “small residual” case).

Now, we prove in Appendix A.7 local convergence of the full-step variant of (15).

Theorem 3.2

Let with be the sequence generated by (15) with step-size , and be a given stationary point of (1) such that . Assume that Assumptions A.2.1 and A.3.1 hold. Then, there exists a neighborhood of and a constant such that

(25)

Consequently, if in (24) i.e., zero residual, then there exists a constant such that starting from such that converges quadratically to .

If in (24) i.e., small residual, then, for any such that , converges to at a linear rate.

4 Gauss-Newton alternating direction method of multipliers

The GN method only works well and has a fast local convergence for the “small residual” case. In general, it may converge very slowly or even fails to converge. In this section, we propose to combine the GN scheme (15) and the alternating direction method of multipliers (ADMM) to develop a new algorithm for solving (1) called GN-ADMM. The main steps of this algorithm are described below.

4.1 The augmented Lagrangian and ADMM scheme

We introduce a slack variable and rewrite (1) as the following constrained problem:

(26)

We can define the augmented Lagrangian function associated with (26) as

(27)

where is a penalty parameter and is the corresponding Lagrange multiplier.

Next, we apply the standard ADMM scheme to (26) which leads to steps:

(28a)
(28b)
(28c)

Obviously, both subproblems (28a) and (28b) remain computationally intensive. While (28a) is nonconvex, (28b) is smooth and convex. Without any further step applying to (28), convergence theory for this nonconvex ADMM scheme can be found in several recent papers including li2015global ; wang2015global ; wen2012solving . However, (28) remains impractical since (28a) and (28b) cannot be solved with a closed form or highly accurate solution. Our next step is to approximately solve these two subproblems.

4.2 Approximation of the alternating steps

Gauss-Newton step for the -subproblem (28a):

We first apply on step of the GN scheme (15) to solve (28a) as follows. We first approximate by using the quadratic upper bound of and the linearization of with and . By letting with , we solve

(29)

Here, the Lipschitz constant can be computed by a power method Golub1996 . Using Lemma 1, we can compute as

(30)

The corresponding objective value is . Then, we update as

(31)

where is a given step-size computed by a linesearch procedure as in (19).

Gradient step for the -subproblem (28b):

If does not have tractable proximal operator, we approximate (28b) by using one gradient step as

(32)

where . Solve (32) directly, we get

(33)

4.3 The Gauss-Newton ADMM algorithm

By putting (31), (28c), and (28b) or (33) together, we obtain the following GN-ADMM scheme with two options:

(34)

Clearly, computing in (34) using the step-size in Lemma 2 is impractical. Similar to Algorithm 1, we find an appropriate by a backtracking linesearch on as

(35)

where and with and