Various practical models in low-rank embedded problems, function learning, matrix completion in recommender systems, inpainting and compression in image processing, robust principle component analysis in statistics, and semidefinite programming relaxations in combinatorial optimization often require to recover a low-rank matrix from either given high-rank datasets or observations corrupted by noise. Such practical models can be formulated into low-rank matrix optimization problems, see, e.g.,Candes2012b ; Candes2011a ; Esser2010a ; goldfarb2011convergence ; kyrillidis2014matrix ; liu2015efficient ; wen2012solving .
1.1 Problem statement
In this paper, we consider the following class of low-rank matrix optimization problems:
where is a proper, closed and convex function; is a linear operator, defined as for matrices in ; and
is an observed vector. We are interested in the low-rank case, where.
Clearly, (1) is nonconvex due to the bilinear term . Hence, it is NP-hard natarajan1995sparse , and numerical methods for solving (1) aim at obtaining a local optimum or a stationary point. While (1) covers a wide range of applications, we briefly recall some emerging problems which have been recently attracted a great attention. The most common case is when , where (1) becomes a least-squares low-rank matrix approximation problem using in compressive sensing (see, e.g., kyrillidis2014matrix ):
Here, the linear operator is often assumed to satisfy a restricted isometric property (RIP) Candes2006 that allows us to recover an exact solution from a few number of observations in . In particular, if , the projection on a given index subset , then (2) covers the matrix completion model as a special case:
where is the observed entries in . If is an identity operator and is a given low-rank matrix, then (2) becomes a low-rank matrix factorization problem
Especially, if and is symmetric positive definite, then (4) reduces to
), possess a closed form solution via a truncated SVD and an eigenvalue decomposition, respectively, Gauss-Newton methods can also be applied to solve these problems. Inliu2015efficient , the authors demonstrated the advantages of a Gauss-Newton method for solving (5) via several encouraging numerical examples.
1.2 Related work
Low-rankness is key to recast many existing problems into new frameworks or to design new models with the aid of regularizers to promote solution structures in concrete applications including matrix completion (MC) Candes2012b , robust principle component analysis (RPCA) Candes2011
and their variants. So far, extensions to group structured sparsity, low-rankness, tree models and tensor representation have attracted a great attention, see, e.g.,fazel2002matrix ; grasedyck2013literature ; huang2011learning ; kyrillidis2015structured ; Recht2010 ; Signoretto2014 ; Yu2014
. A majority of research for low-rank models focusses on estimating sample complexity results for specific instances of (1), while various recent papers revolve around the RPCA settings, matrix completion, and their extensions Candes2011 ; Candes2012b ; johnson1990matrix . Simple convex models originating from (1) have several advantages for designing solution methods, they unfortunately do not capture well the original model. Trading-off these two ingredients is key to achieve appropriate models for practical purposes.
Along with problem modeling, solution algorithms for (1) are a core step for solving practical problems in low-rank matrix completion and recovery. From an optimization-based viewpoint, we observe that existing methods for (1) either focus on specific applications or are limited to some classes of problems, where advanced computational techniques can be exploited. Among many others, convex optimization is perhaps one of the most powerful tools to solve several instances of (1) including MC, RPCA and their variants and extensions. Unfortunately, convex models only provide an approximation to the low-rank model (1) by convex relaxations using, e.g., nuclear or max norms, which may not adequately approximate the desired rank. Alternatively, nonconvex as well as discrete optimization methods have also been considered for solving (1), see, e.g., burer2003nonlinear ; kyrillidis2014matrix ; Lin2009 ; Shen2012 ; wen2012solving ; yu2014parallel . While these approaches work directly on the original problem (1), they can only find a local optimum or a critical point, and strongly depend on the priori knowledge of problems, the initial points of algorithms, and a predicted ranks. However, several empirical evidence have been provided to support these approaches, and surprisingly, in many cases, they outperform convex optimization approaches in terms of “accuracy” to the original model, and the overall computational time kyrillidis2014matrix ; Lin2009 ; wen2012solving
. Other approaches such as stochastic gradient descent, Riemann manifold descent, greedy methods, parallel and distributed algorithms have also been studied recently, see, e.g.,bottou2010large ; johnson1990matrix ; keshavan2009gradient ; vandereycken2013low ; yu2014parallel .
1.3 Motivation and contributions
Optimization researchers have observed that Gauss-Newton (GN) methods work extremely well for nonlinear least-square problems Bjorck1996 . Within the quadratic objective , when the residual term in (1) is small or zero at solutions, GN methods can achieve local superlinear and even quadratic convergence rate. With a “good” initial guess (i.e., close to the solution set), GN methods often reach a stationary point within few iterations Deuflhard2006 . In many practical problems, we can often predict a “good” initial point using priori knowledge of our problem and underlying algorithm (e.g., steady states of dynamical systems, or previous iterations of algorithms) as a warm-start.
As in classical GN methods, we design an iterative scheme for solving (1) based on the linearization of the residual term and the quadratic surrogate of . At each iteration, it requires to solve a simple convex problem to form a GN search direction and then incorporates with a globalization strategy to update the next iteration. In our setting, computing GN search direction reduces to solving a linear least-squares problem, which is similar to alternating direction method (ADM) wen2012solving . While ADM alternatively solves for each and , GN simultaneously solves for and using the linearization of . We have experienced that (cf. Subsection 7.1) GN uses a linearization of providing a good local approximate model to compared to the alternating form (or ), when (or is relatively large. This makes ADM saturated and does not significantly improve the objective values. In addition, without regularization, ADM may fail to converge as a counterexample in gonzalez2005accelerating . Moreover, ADM is not applicable to solve the symmetric case of (1), where we have without reformulation or introducing lifting variables, but the GN method is.
While Gauss-Newton methods have been widely used in nonlinear least squares Nocedal2006 , they are still mostly unexploited for matrix nonlinear optimization. Our aim in this paper is to extend the GN method for solving a class of problems (1) to the two aspects: general convex objective , and low-rank matrix variable settings. This work is also inspirited by a recent work in liu2015efficient , where the authors considered a simple symmetric instance of (1), and demonstrated very encouraging numerical experiments via GN methods compared to other state-of-the-art algorithms.
Our specific contributions can be summarized as follows:
We extend the Gauss-Newton method to solve low-rank matrix optimization problems of the form (1) involving general convex objective. We point out the existence of GN directions and provide their closed form formulation. We show empirically that our GN method achieve much higher accurate solutions compared to the common used alternating direction methods within the same number of iterations.
We show that there exists an explicit step-size to guarantee a descent property of the Gauss-Newton direction. This allows us to perform a simple backtracking linesearch procedure to guarantee global convergence of the proposed method. We then specify our framework to the symmetric case with global convergence guarantee.
We prove a local linear and quadratic convergence rate of the full-step Gauss-Newton variant under standard assumptions imposed on (1) at its solution set.
Unlike alternating direction methods whose only achieve sublinear convergence even with good initial points, GN methods may slightly require additional computation for GN directions, but they can achieve a fast local linear or quadratic convergence rate, which is key for online and real-time implementations. Alternatively, gradient descent-based methods can achieve local linear convergence but often require much strong assumptions imposed on (1). In contrast, GN methods work for “small residual” settings under mild assumptions, and can easily achieve high accuracy solutions.
1.4 Outline of the paper
The rest of this paper is organized as follows. We first review basic concepts related to problem (1) in Section 2. Section 3 presents a linesearch Gauss-Newton method for solving (1) and proves its convergence properties. Section 4 designs a Gauss-Newton ADMM algorithm for solving (1) and investigates its global convergence. Section 5 specifies the GN algorithm to the symmetric case and proves its convergence. Section 6 discusses the implementation aspects of our algorithms and their extension to nonsmooth objective function. Numerical experiments are conducted in Section 7 with several applications in different fields. For clarity of presentation, we move all the proofs in the main text to the appendix.
2 Basic notations and optimality condition
We briefly state some basic notations, the optimality condition for (1), and the fundamental assumption.
2.1 Basic notation and concepts
For a matrix , and
denote its positive smallest and largest singular values, respectively. Ifis symmetric, then and denote its smallest and largest eigenvalues, respectively. We use for SVD and for eigenvalue decomposition.
We denote by the Moore-Penrose pseudo-inverse of , i.e., when has full-column rank. We also define the projection onto the range space of , and the orthogonal projection of , i.e., , where is an identity. Clearly, .
For a matrix , we define the vectorization of . Given the operator, we define the inverse of such that . For two matrices and , denotes the Kronecker product of and . We always have and . For a linear operator , denotes its adjoint operator.
2.2 Optimality condition and fundamental assumptions
Any satisfying (6) is called a stationary point of (1). We denote by the set of stationary points of (1). Since , the solution of (6) is generally nonunique. Our aim is to design algorithms for generating a sequence converging to relying on the following fundamental assumption.
Assumption A. 2.1
Problem (1) satisfies the following conditions:
is -smooth and -convex, i.e., satisfies with that
The set of stationary points of (1) is nonempty.
If is Lipschitz continuous with the Lipschitz constant , then (7) holds Nesterov2004 . We allow , which also covers the non-strongly convex case. Since is smooth and is linear, is also smooth. In addition, Assumption A.2.1
(a) covers a wide range of applications including logistic loss, Huber loss and entropy in nonlinear regression and machine learningBoyd2011 .
3 Line-search Gauss-Newton method
We develop here a linesearch Gauss-Newton (Ls-GN) algorithm for solving (1) which has convergence guarantee.
3.1 Forming a surrogate of the objective
for any , , , and , where is the gradient of , and is the Lipschitz constant of the gradient of .
Descent optimization methods rely on finding a descent direction of by approximately minimizing the right-hand side surrogate of in (8). Unfortunately, this surrogate remains nonconvex due to the bilinear term . Our next step is to linearize this bilinear term around a given point as follows:
Then, the minimization of the right-hand side surrogate (8) is approximated by
This problem is a linear least-squares, and can be solved by standard linear algebra.
3.2 Computing Gauss-Newton direction
Let , and . Then, we rewrite (10) as
The optimality condition of (11) becomes
which forms a linear subspace in , where and are the Moore-Penrose pseudo inverses of and , respectively, and is an arbitrary matrix.
In particular, if we choose , then
Moreover, the optimal value of (11) is .
Let . If we assume that , then and
Clearly, is a linear subspace, and its dimension is .
3.3 The damped-step Gauss-Newton scheme
Using Lemma 1, we can form a damped step Gauss-Newton scheme as follows:
where and defined in (14) is the Gauss-Newton direction, and is a given step-size determined in the next lemma.
Since the GN direction computed by solving (11) is not unique, we need to choose an appropriate such that this direction is a descent direction of at . We prove in the following lemma that the GN direction computed by (14) is indeed a descent direction of at . The proof of this lemma is given in Appendix A.2.
Lemma 2 shows that if the residual term is sufficient small near , then we obtain full-step since both and are small.
3.4 The algorithm and its global convergence
Theoretically, we can use the step-size in Lemma 2 for (15). However, in practice, computing requires a high computational intensity. We instead incorporate the GN scheme (15) with an Armijo’s backtracking linesearch to find an appropriate step-size for given .
By Lemma 2, this linesearch procedure is terminated after finite iterations :
We can show that computing requires two inverses and of the size , and two matrix-matrix multiplications (of the size or ).
Evaluating requires one matrix-matrix multiplication and one evaluation . When is a subset projection (e.g., in matrix completion), we can compute instead of the full matrix .
Each step of the linesearch needs one matrix-matrix multiplication and one evaluation of . It requires at most linesearch iterations. However, we observe from numerical experiments that often varies from to on average.
Since (1) is nonconvex, we only expect generated by Algorithm 1 to converge to a stationary point . However, Lemma 3 only guarantees the full-rankness of and at each iteration, but we may have or . In order to prove a global convergence of Algorithm 1, we require one additional condition: There exists such that:
Under Assumption A.2.1, the following sublevel set of is bounded for given :
3.5 Local linear convergence without strong convexity
We prove a local convergence of the full-step Gauss-Newton scheme (15) when . Generally, problem (1) does not satisfy the regularity assumption: Jacobian of the objective residual in (1) is not full-column rank, where is the matrix form of the linear operator . However, we can still guarantee a fast local convergence under the following conditions:
Assumption A. 3.1
Problem (1) satisfies the following conditions:
is twice differentiable on a neighborhood of , and its Hessian is Lipschitz continuous in with the constant .
The Hessian of at satisfies
where , , and .
Assumption A.3.1(b) relates to a “small residual condition”. For instance, if , and , the identity operator, then the residual term becomes , and the objective is . In this case, condition (24) holds if (i.e., we have a “small residual” case).
Let with be the sequence generated by (15) with step-size , and be a given stationary point of (1) such that . Assume that Assumptions A.2.1 and A.3.1 hold. Then, there exists a neighborhood of and a constant such that
Consequently, if in (24) i.e., zero residual, then there exists a constant such that starting from such that converges quadratically to .
If in (24) i.e., small residual, then, for any such that , converges to at a linear rate.
4 Gauss-Newton alternating direction method of multipliers
The GN method only works well and has a fast local convergence for the “small residual” case. In general, it may converge very slowly or even fails to converge. In this section, we propose to combine the GN scheme (15) and the alternating direction method of multipliers (ADMM) to develop a new algorithm for solving (1) called GN-ADMM. The main steps of this algorithm are described below.
4.1 The augmented Lagrangian and ADMM scheme
We introduce a slack variable and rewrite (1) as the following constrained problem:
We can define the augmented Lagrangian function associated with (26) as
where is a penalty parameter and is the corresponding Lagrange multiplier.
Next, we apply the standard ADMM scheme to (26) which leads to steps:
Obviously, both subproblems (28a) and (28b) remain computationally intensive. While (28a) is nonconvex, (28b) is smooth and convex. Without any further step applying to (28), convergence theory for this nonconvex ADMM scheme can be found in several recent papers including li2015global ; wang2015global ; wen2012solving . However, (28) remains impractical since (28a) and (28b) cannot be solved with a closed form or highly accurate solution. Our next step is to approximately solve these two subproblems.
4.2 Approximation of the alternating steps
Gauss-Newton step for the -subproblem (28a):
The corresponding objective value is . Then, we update as
where is a given step-size computed by a linesearch procedure as in (19).