1 Introduction
Various practical models in lowrank embedded problems, function learning, matrix completion in recommender systems, inpainting and compression in image processing, robust principle component analysis in statistics, and semidefinite programming relaxations in combinatorial optimization often require to recover a lowrank matrix from either given highrank datasets or observations corrupted by noise. Such practical models can be formulated into lowrank matrix optimization problems, see, e.g.,
Candes2012b ; Candes2011a ; Esser2010a ; goldfarb2011convergence ; kyrillidis2014matrix ; liu2015efficient ; wen2012solving .1.1 Problem statement
In this paper, we consider the following class of lowrank matrix optimization problems:
(1) 
where is a proper, closed and convex function; is a linear operator, defined as for matrices in ; and
is an observed vector. We are interested in the lowrank case, where
.Clearly, (1) is nonconvex due to the bilinear term . Hence, it is NPhard natarajan1995sparse , and numerical methods for solving (1) aim at obtaining a local optimum or a stationary point. While (1) covers a wide range of applications, we briefly recall some emerging problems which have been recently attracted a great attention. The most common case is when , where (1) becomes a leastsquares lowrank matrix approximation problem using in compressive sensing (see, e.g., kyrillidis2014matrix ):
(2) 
Here, the linear operator is often assumed to satisfy a restricted isometric property (RIP) Candes2006 that allows us to recover an exact solution from a few number of observations in . In particular, if , the projection on a given index subset , then (2) covers the matrix completion model as a special case:
(3) 
where is the observed entries in . If is an identity operator and is a given lowrank matrix, then (2) becomes a lowrank matrix factorization problem
(4) 
Especially, if and is symmetric positive definite, then (4) reduces to
(5) 
which was considered in liu2015efficient . Alternatively, if we choose in (2), then (1) reduces to the case studied in Bhojanapallli2015 . While both special cases, (4) and (5
), possess a closed form solution via a truncated SVD and an eigenvalue decomposition, respectively, GaussNewton methods can also be applied to solve these problems. In
liu2015efficient , the authors demonstrated the advantages of a GaussNewton method for solving (5) via several encouraging numerical examples.1.2 Related work
Lowrankness is key to recast many existing problems into new frameworks or to design new models with the aid of regularizers to promote solution structures in concrete applications including matrix completion (MC) Candes2012b , robust principle component analysis (RPCA) Candes2011
and their variants. So far, extensions to group structured sparsity, lowrankness, tree models and tensor representation have attracted a great attention, see, e.g.,
fazel2002matrix ; grasedyck2013literature ; huang2011learning ; kyrillidis2015structured ; Recht2010 ; Signoretto2014 ; Yu2014. A majority of research for lowrank models focusses on estimating sample complexity results for specific instances of (
1), while various recent papers revolve around the RPCA settings, matrix completion, and their extensions Candes2011 ; Candes2012b ; johnson1990matrix . Simple convex models originating from (1) have several advantages for designing solution methods, they unfortunately do not capture well the original model. Tradingoff these two ingredients is key to achieve appropriate models for practical purposes.Along with problem modeling, solution algorithms for (1) are a core step for solving practical problems in lowrank matrix completion and recovery. From an optimizationbased viewpoint, we observe that existing methods for (1) either focus on specific applications or are limited to some classes of problems, where advanced computational techniques can be exploited. Among many others, convex optimization is perhaps one of the most powerful tools to solve several instances of (1) including MC, RPCA and their variants and extensions. Unfortunately, convex models only provide an approximation to the lowrank model (1) by convex relaxations using, e.g., nuclear or max norms, which may not adequately approximate the desired rank. Alternatively, nonconvex as well as discrete optimization methods have also been considered for solving (1), see, e.g., burer2003nonlinear ; kyrillidis2014matrix ; Lin2009 ; Shen2012 ; wen2012solving ; yu2014parallel . While these approaches work directly on the original problem (1), they can only find a local optimum or a critical point, and strongly depend on the priori knowledge of problems, the initial points of algorithms, and a predicted ranks. However, several empirical evidence have been provided to support these approaches, and surprisingly, in many cases, they outperform convex optimization approaches in terms of “accuracy” to the original model, and the overall computational time kyrillidis2014matrix ; Lin2009 ; wen2012solving
. Other approaches such as stochastic gradient descent, Riemann manifold descent, greedy methods, parallel and distributed algorithms have also been studied recently, see, e.g.,
bottou2010large ; johnson1990matrix ; keshavan2009gradient ; vandereycken2013low ; yu2014parallel .1.3 Motivation and contributions
Optimization researchers have observed that GaussNewton (GN) methods work extremely well for nonlinear leastsquare problems Bjorck1996 . Within the quadratic objective , when the residual term in (1) is small or zero at solutions, GN methods can achieve local superlinear and even quadratic convergence rate. With a “good” initial guess (i.e., close to the solution set), GN methods often reach a stationary point within few iterations Deuflhard2006 . In many practical problems, we can often predict a “good” initial point using priori knowledge of our problem and underlying algorithm (e.g., steady states of dynamical systems, or previous iterations of algorithms) as a warmstart.
As in classical GN methods, we design an iterative scheme for solving (1) based on the linearization of the residual term and the quadratic surrogate of . At each iteration, it requires to solve a simple convex problem to form a GN search direction and then incorporates with a globalization strategy to update the next iteration. In our setting, computing GN search direction reduces to solving a linear leastsquares problem, which is similar to alternating direction method (ADM) wen2012solving . While ADM alternatively solves for each and , GN simultaneously solves for and using the linearization of . We have experienced that (cf. Subsection 7.1) GN uses a linearization of providing a good local approximate model to compared to the alternating form (or ), when (or is relatively large. This makes ADM saturated and does not significantly improve the objective values. In addition, without regularization, ADM may fail to converge as a counterexample in gonzalez2005accelerating . Moreover, ADM is not applicable to solve the symmetric case of (1), where we have without reformulation or introducing lifting variables, but the GN method is.
While GaussNewton methods have been widely used in nonlinear least squares Nocedal2006 , they are still mostly unexploited for matrix nonlinear optimization. Our aim in this paper is to extend the GN method for solving a class of problems (1) to the two aspects: general convex objective , and lowrank matrix variable settings. This work is also inspirited by a recent work in liu2015efficient , where the authors considered a simple symmetric instance of (1), and demonstrated very encouraging numerical experiments via GN methods compared to other stateoftheart algorithms.
Our contributions:
Our specific contributions can be summarized as follows:

We extend the GaussNewton method to solve lowrank matrix optimization problems of the form (1) involving general convex objective. We point out the existence of GN directions and provide their closed form formulation. We show empirically that our GN method achieve much higher accurate solutions compared to the common used alternating direction methods within the same number of iterations.

We show that there exists an explicit stepsize to guarantee a descent property of the GaussNewton direction. This allows us to perform a simple backtracking linesearch procedure to guarantee global convergence of the proposed method. We then specify our framework to the symmetric case with global convergence guarantee.

We prove a local linear and quadratic convergence rate of the fullstep GaussNewton variant under standard assumptions imposed on (1) at its solution set.
Unlike alternating direction methods whose only achieve sublinear convergence even with good initial points, GN methods may slightly require additional computation for GN directions, but they can achieve a fast local linear or quadratic convergence rate, which is key for online and realtime implementations. Alternatively, gradient descentbased methods can achieve local linear convergence but often require much strong assumptions imposed on (1). In contrast, GN methods work for “small residual” settings under mild assumptions, and can easily achieve high accuracy solutions.
1.4 Outline of the paper
The rest of this paper is organized as follows. We first review basic concepts related to problem (1) in Section 2. Section 3 presents a linesearch GaussNewton method for solving (1) and proves its convergence properties. Section 4 designs a GaussNewton ADMM algorithm for solving (1) and investigates its global convergence. Section 5 specifies the GN algorithm to the symmetric case and proves its convergence. Section 6 discusses the implementation aspects of our algorithms and their extension to nonsmooth objective function. Numerical experiments are conducted in Section 7 with several applications in different fields. For clarity of presentation, we move all the proofs in the main text to the appendix.
2 Basic notations and optimality condition
We briefly state some basic notations, the optimality condition for (1), and the fundamental assumption.
2.1 Basic notation and concepts
For a matrix , and
denote its positive smallest and largest singular values, respectively. If
is symmetric, then and denote its smallest and largest eigenvalues, respectively. We use for SVD and for eigenvalue decomposition.We denote by the MoorePenrose pseudoinverse of , i.e., when has fullcolumn rank. We also define the projection onto the range space of , and the orthogonal projection of , i.e., , where is an identity. Clearly, .
For a matrix , we define the vectorization of . Given the operator, we define the inverse of such that . For two matrices and , denotes the Kronecker product of and . We always have and . For a linear operator , denotes its adjoint operator.
2.2 Optimality condition and fundamental assumptions
We define as a joint variable of and in (1). First, we assume that in (1) is smooth. Then, the optimality condition of (1) can be written as follows:
(6) 
Any satisfying (6) is called a stationary point of (1). We denote by the set of stationary points of (1). Since , the solution of (6) is generally nonunique. Our aim is to design algorithms for generating a sequence converging to relying on the following fundamental assumption.
Assumption A. 2.1
Problem (1) satisfies the following conditions:
is smooth and convex, i.e., satisfies with that
(7) 
The set of stationary points of (1) is nonempty.
If is Lipschitz continuous with the Lipschitz constant , then (7) holds Nesterov2004 . We allow , which also covers the nonstrongly convex case. Since is smooth and is linear, is also smooth. In addition, Assumption A.2.1
(a) covers a wide range of applications including logistic loss, Huber loss and entropy in nonlinear regression and machine learning
Boyd2011 .3 Linesearch GaussNewton method
We develop here a linesearch GaussNewton (LsGN) algorithm for solving (1) which has convergence guarantee.
3.1 Forming a surrogate of the objective
By Assumption A.2.1, it follows from (7) and that
(8) 
for any , , , and , where is the gradient of , and is the Lipschitz constant of the gradient of .
Descent optimization methods rely on finding a descent direction of by approximately minimizing the righthand side surrogate of in (8). Unfortunately, this surrogate remains nonconvex due to the bilinear term . Our next step is to linearize this bilinear term around a given point as follows:
(9) 
Then, the minimization of the righthand side surrogate (8) is approximated by
(10) 
This problem is a linear leastsquares, and can be solved by standard linear algebra.
3.2 Computing GaussNewton direction
Let , and . Then, we rewrite (10) as
(11) 
The optimality condition of (11) becomes
(12) 
As usual, we can refer to (12) as the normal equation of (11). We show in Lemma 1 a closed form solution of (12), whose proof is in Appendix A.1.
Lemma 1
Lemma 1 also shows that if either is in the null space of or is in the null space of , then . We note that (14) only gives us one choice for . We can choose to obtain a simple GN search direction.
Remark 1
Let . If we assume that , then and
Clearly, is a linear subspace, and its dimension is .
3.3 The dampedstep GaussNewton scheme
Using Lemma 1, we can form a damped step GaussNewton scheme as follows:
(15) 
where and defined in (14) is the GaussNewton direction, and is a given stepsize determined in the next lemma.
Since the GN direction computed by solving (11) is not unique, we need to choose an appropriate such that this direction is a descent direction of at . We prove in the following lemma that the GN direction computed by (14) is indeed a descent direction of at . The proof of this lemma is given in Appendix A.2.
Lemma 2
Lemma 2 shows that if the residual term is sufficient small near , then we obtain fullstep since both and are small.
3.4 The algorithm and its global convergence
Theoretically, we can use the stepsize in Lemma 2 for (15). However, in practice, computing requires a high computational intensity. We instead incorporate the GN scheme (15) with an Armijo’s backtracking linesearch to find an appropriate stepsize for given .
Given , find the smallest number such that and
(19) 
where and are given (e.g., and ).
By Lemma 2, this linesearch procedure is terminated after finite iterations :
(20) 
where is given by (16). Now, we present the complete linesearch GaussNewton algorithm for solving (1) as in Algorithm 1.
Complexityperiteration:
The main steps of Algorithm 1 are Steps 4 and 6, where we need to compute and perform the linesearch routine, respectively.
We can show that computing requires two inverses and of the size , and two matrixmatrix multiplications (of the size or ).
Evaluating requires one matrixmatrix multiplication and one evaluation . When is a subset projection (e.g., in matrix completion), we can compute instead of the full matrix .
Each step of the linesearch needs one matrixmatrix multiplication and one evaluation of . It requires at most linesearch iterations. However, we observe from numerical experiments that often varies from to on average.
Global convergence:
Since (1) is nonconvex, we only expect generated by Algorithm 1 to converge to a stationary point . However, Lemma 3 only guarantees the fullrankness of and at each iteration, but we may have or . In order to prove a global convergence of Algorithm 1, we require one additional condition: There exists such that:
(21) 
Under Assumption A.2.1, the following sublevel set of is bounded for given :
We prove in Appendix A.4 a global convergence of Algorithm 1 stated as follows.
3.5 Local linear convergence without strong convexity
We prove a local convergence of the fullstep GaussNewton scheme (15) when . Generally, problem (1) does not satisfy the regularity assumption: Jacobian of the objective residual in (1) is not fullcolumn rank, where is the matrix form of the linear operator . However, we can still guarantee a fast local convergence under the following conditions:
Assumption A. 3.1
Problem (1) satisfies the following conditions:
is twice differentiable on a neighborhood of , and its Hessian is Lipschitz continuous in with the constant .
The Hessian of at satisfies
(24) 
where , , and .
Assumption A.3.1(b) relates to a “small residual condition”. For instance, if , and , the identity operator, then the residual term becomes , and the objective is . In this case, condition (24) holds if (i.e., we have a “small residual” case).
Theorem 3.2
Let with be the sequence generated by (15) with stepsize , and be a given stationary point of (1) such that . Assume that Assumptions A.2.1 and A.3.1 hold. Then, there exists a neighborhood of and a constant such that
(25) 
Consequently, if in (24) i.e., zero residual, then there exists a constant such that starting from such that converges quadratically to .
If in (24) i.e., small residual, then, for any such that , converges to at a linear rate.
4 GaussNewton alternating direction method of multipliers
The GN method only works well and has a fast local convergence for the “small residual” case. In general, it may converge very slowly or even fails to converge. In this section, we propose to combine the GN scheme (15) and the alternating direction method of multipliers (ADMM) to develop a new algorithm for solving (1) called GNADMM. The main steps of this algorithm are described below.
4.1 The augmented Lagrangian and ADMM scheme
We introduce a slack variable and rewrite (1) as the following constrained problem:
(26) 
We can define the augmented Lagrangian function associated with (26) as
(27) 
where is a penalty parameter and is the corresponding Lagrange multiplier.
Next, we apply the standard ADMM scheme to (26) which leads to steps:
(28a)  
(28b)  
(28c) 
Obviously, both subproblems (28a) and (28b) remain computationally intensive. While (28a) is nonconvex, (28b) is smooth and convex. Without any further step applying to (28), convergence theory for this nonconvex ADMM scheme can be found in several recent papers including li2015global ; wang2015global ; wen2012solving . However, (28) remains impractical since (28a) and (28b) cannot be solved with a closed form or highly accurate solution. Our next step is to approximately solve these two subproblems.
4.2 Approximation of the alternating steps
GaussNewton step for the subproblem (28a):
We first apply on step of the GN scheme (15) to solve (28a) as follows. We first approximate by using the quadratic upper bound of and the linearization of with and . By letting with , we solve
(29) 
Here, the Lipschitz constant can be computed by a power method Golub1996 . Using Lemma 1, we can compute as
(30) 
The corresponding objective value is . Then, we update as
(31) 
where is a given stepsize computed by a linesearch procedure as in (19).
Comments
There are no comments yet.