Nonconvex Optimization Meets Low-Rank Matrix Factorization: An Overview

09/25/2018 ∙ by Yuejie Chi, et al. ∙ Princeton University Harvard University Carnegie Mellon University 0

Substantial progress has been made recently on developing provably accurate and efficient algorithms for low-rank matrix factorization via nonconvex optimization. While conventional wisdom often takes a dim view of nonconvex optimization algorithms due to their susceptibility to spurious local minima, simple iterative methods such as gradient descent have been remarkably successful in practice. The theoretical footings, however, had been largely lacking until recently. In this tutorial-style overview, we highlight the important role of statistical models in enabling efficient nonconvex optimization with performance guarantees. We review two contrasting approaches: (1) two-stage algorithms, which consist of a tailored initialization step followed by successive refinement; and (2) global landscape analysis and initialization-free algorithms. Several canonical matrix factorization problems are discussed, including but not limited to matrix sensing, phase retrieval, matrix completion, blind deconvolution, robust principal component analysis, phase synchronization, and joint alignment. Special care is taken to illustrate the key technical insights underlying their analyses. This article serves as a testament that the integrated thinking of optimization and statistics leads to fruitful research findings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern information processing and machine learning often have to deal with (structured) low-rank matrix factorization. Given a few observations

about a matrix of rank , one seeks a low-rank solution compatible with this set of observations as well as other prior constraints. Examples include low-rank matrix completion [1, 2, 3], phase retrieval [4], blind deconvolution and self-calibration [5, 6], robust principal component analysis [7, 8], synchronization and alignment [9, 10]

, to name just a few. A common goal of these problems is to develop reliable, scalable, and robust algorithms to estimate a low-rank matrix from potentially noisy, nonlinear, and highly incomplete observations.

1.1 Optimization-based methods

Towards this goal, arguably one of the most popular approaches is optimization-based methods. By factorizing a candidate solution as with low-rank factors and , one attempts recovery by solving an optimization problem in the form of

(1a)
subject to (1b)
(1c)

Here, is certain empirical risk function (e.g. Euclidean loss, negative log-likelihood) that evaluates how well a candidate solution fits the observations, and the set encodes additional prior constraints, if any. This problem is often highly nonconvex and appears daunting to solve to global optimality at first sight. After all, conventional wisdom usually perceives nonconvex optimization as a computationally intractable task that is susceptible to local minima.

To bypass the challenge, one can do convex relaxation, an effective strategy that already enjoys theoretical success in addressing a large number of problems. The basic idea is to convexify the problem by, amongst others, dropping or replacing the low-rank constraint [cf. (1b)] by a nuclear norm constraint [11, 12, 1, 13, 2, 3], and solving the convexified problem in the full matrix space (i.e. the space of ). While such convex relaxation schemes exhibit intriguing performance guarantees in several aspects (e.g. near-minimal sample complexity, stability against noise), its computational cost often scales at least cubically in the size of the matrix , which often far exceeds the time taken to read the data. In addition, the prohibitive storage complexity associated with the convex relaxation approach presents another hurdle that limits its applicability to large-scale problems.

This overview article focuses on provable low-rank matrix estimation based on nonconvex optimization. This approach operates over the parsimonious factorized representation [cf. (1b)] and optimizes the nonconvex loss directly over the low-rank factors and . The advantage is clear: adopting economical representation of the low-rank matrix results in low storage requirements, affordable per-iteration computational cost, amenability to parallelization, and scalability to large problem size, when performing iterative optimization methods like gradient descent. However, despite its wide use and remarkable performance in practice [14, 15], the foundational understanding of generic nonconvex optimization is far from mature. It is often unclear whether an optimization algorithm can converge to the desired global solution and, if so, how fast this can be accomplished. For many nonconvex problems, theoretical underpinnings had been lacking until very recently.

1.2 Nonconvex optimization meets statistical models

Fortunately, despite general intractability, some important nonconvex problems may not be as hard as they seem. For instance, for several low-rank matrix factorization problems, it has been shown that: under proper statistical models, simple first-order methods are guaranteed to succeed in a small number of iterations, achieving low computational and sample complexities simultaneously (e.g. [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]

). The key to enabling guaranteed and scalable computation is to concentrate on problems arising from specific statistical signal estimation tasks, which may exhibit benign structures amenable to computation and rule out undesired “hard” instances by focusing on the average-case performance. Two messages deserve particular attention when we examine the geometry of associated nonconvex loss functions:

  • Basin of attraction. For several statistical problems of this kind, there often exists a reasonably large basin of attraction around the global solution, within which an iterative method like gradient descent is guaranteed to be successful and converge fast. Such a basin might exist even when the sample complexity is quite close to the information-theoretic limit [16, 17, 18, 19, 20, 21, 22].

  • Benign global landscape. Several problems provably enjoy benign optimization landscape when the sample size is sufficiently large, in the sense that there is no spurious local minima, i.e. all local minima are also global minima, and that the only undesired stationary points are strict saddle points [27, 28, 29, 30, 31].

These important messages inspire a recent flurry of activities in the design of two contrasting algorithmic approaches:

  • Two-stage approach. Motivated by the existence of a basin of attraction, a large number of works follow a two-stage paradigm: (1) initialization, which locates an initial guess within the basin; (2) iterative refinement, which successively refines the estimate without leaving the basin. This approach often leads to very efficient algorithms that run in time proportional to that taken to read the data.

  • Saddle-point escaping algorithms. In the absence of spurious local minima, a key challenge boils down to how to efficiently escape undesired saddle points and find a local minimum, which is the focus of this approach. This approach does not rely on carefully-designed initialization.

The research along these lines highlights the synergy between statistics and optimization in signal processing and machine learning. The algorithmic choice often needs to properly exploit the underlying statistical models in order to be truly efficient, in terms of both statistical accuracy and computational efficiency.

1.3 This paper

Understanding the effectiveness of nonconvex optimization is currently among the most active areas of research in signal processing, machine learning, optimization and statistics. Many exciting new developments in the last several years have significantly advanced our understanding of this approach for various statistical problems. This article aims to provide a thorough technical overview of important recent results in this exciting area, targeting the broader signal processing, machine learning, statistics, and optimization communities.

The rest of this paper is organized as follows. Section 2 reviews some preliminary facts on optimization that are instrumental to understanding the materials in this paper. Section 3 uses a toy (but non-trivial) example (i.e. rank-1 matrix factorization) to illustrate why it is hopeful to solve a nonconvex problem to global optimality, through both local and global lenses. Section 4 introduces a few canonical statistical estimation problems that will be visited multiple times in the sequel. Section 5 and Section 6 review gradient descent and its many variants as a local refinement procedure, followed by a discussion of other methods in Section 7. Section 8 discusses the spectral method, which is commonly used to provide an initialization within the basin of attraction. Section 9 provides a global landscape analysis, in conjunction with algorithms that work without the need of careful initialization. We conclude the paper in Section 10 with some discussions and remarks. Furthermore, a short note is provided at the end of several sections to cover some historical remarks and provide further pointers.

1.4 Notations

It is convenient to introduce a few notations that will be used throughout. We use boldfaced symbols to represent vectors and matrices. For any vector

, we let , and denote its , , and norm, respectively. For any matrix , let , , , , and

stand for the spectral norm (i.e. the largest singular value), the Frobenius norm, the nuclear norm (i.e. the sum of the singular values), the

norm (i.e. the largest norm of the rows), and the entrywise norm (the largest magnitude of all entries), respectively. We denote by (resp. ) its

th largest singular value (resp. eigenvalue), and let

(resp. ) represent its th row (resp. column). The condition number of is denoted by . In addition, , and indicate the transpose, the conjugate transpose, and the entrywise conjugate of , respectively. For two matrices and of the same size, we define their inner product as , where stands for the trace. The matrix denotes the identity matrix, and denotes the th column of . For any linear operator , denote by its adjoint operator. For example, if maps to , then . We also let denote the vectorization of a matrix . The indicator function equals to when the event holds true, and otherwise. Further, the notation denotes the set of orthonormal matrices. Let and be two square matrices. We write (resp. ) if their difference is a positive definite (resp. positive semidefinite) matrix.

Additionally, the standard notation or means that there exists a constant such that , means that there exists a constant such that , and means that there exist constants such that .

2 Preliminaries in optimization theory

We start by reviewing some basic concepts and preliminary facts in optimization theory that play a vital role in the main development of the theory. For simplicity of presentation, this section focuses on an unconstrained problem

(2)

The optimal solution, if exists, is denoted by

(3)

When is strictly convex,111Recall that is said to be strictly convex if and only if for any and , one has . is unique. But it may be non-unique when is nonconvex.

2.1 Gradient descent for locally strongly convex functions

To solve (2), arguably the simplest method is (vanilla) gradient descent (GD), which follows the update rule

(4)

Here, is the step size or learning rate at the th iteration, and is the initial point. This method and its variants are widely used in practice, partly due to their simplicity and scalability to large-scale problems.

A central question is when GD converges fast to the global minimum . As is well-known in the optimization literature, GD is provably convergent at a linear rate when is (locally) strongly convex and smooth. Here, an algorithm is said to converge linearly if the error converges to 0 as a geometric series. To formally state this result, we define two concepts that commonly arise in the optimization literature.

Definition 1 ().

A twice continuously differentiable function is said to be -strongly convex in a set if

(5)
Definition 2 ().

A twice continuously differentiable function is said to be -smooth in a set if

(6)

With these in place, we have the following standard result:

Lemma 1.

Suppose that is -strongly convex and -smooth within a local ball , and that . If , then GD obeys

Proof of Lemma 1.

The optimality of indicates that , which allows to rewrite the GD update rule as

where . Here, the second line arises from the fundamental theorem of calculus [32, Theorem 4.2]. If , then it is self-evident that , which combined with the assumption of Lemma 1 gives

Therefore, as long as (and hence ), we have

This together with the sub-multiplicativity of yields

By setting , we arrive at the desired error contraction, namely,

(7)

A byproduct is: if , then the next iterate also falls in . Consequently, applying the above argument repetitively and recalling the assumption , we see that all GD iterates remain within . Hence, (7) holds true for all . This immediately concludes the proof. ∎

This result essentially implies that: to yield -accuracy (in a relative sense), i.e. , the number of iterations required for GD — termed iteration complexity — is at most

if we initialize GD properly such that lies in the local region . In words, the iteration complexity scales linearly with the condition number — the ratio of smoothness to strong convexity parameters. As we shall see, for multiple problems considered herein, the radius of this locally strongly convex and smooth ball can be reasonably large (e.g. on the same order of .

Figure 1: An example of taken from [20]. Here, if and if . It satisfies with some and , but is clearly nonconvex.

2.2 Convergence under regularity conditions

Another condition that has been extensively employed in the literature is the Regularity Condition (RC), which accommodates algorithms beyond vanilla GD, as well as is applicable to possibly nonsmooth functions. Specifically, consider the iterative algorithm

(8)

for some general mapping . In vanilla GD, , but can also incorporate several variants of GD; see Section 6. The regularity condition is defined as follows.

Definition 3 ().

is said to obey the regularity condition for some if

(9)

for all .

This condition basically implies that at any feasible point , the associated negative search direction is positively correlated with the error , and hence the update rule (8) — in conjunction with a sufficiently small step size — drags the current point closer towards the global solution. It follows from the Cauchy-Schwarz inequality that one must have .

It is worth noting that this condition does not require to be differentiable. Also, when , it does not require to be convex within ; see Fig. 1 for an example. Instead, the regularity condition can be viewed as a combination of smoothness and “one-point strong convexity” (as the condition is stated w.r.t. a point ) defined as follows

(10)

for all . To see this, observe that

(11)

where the second line follows from an equivalent definition of the smoothness condition [33, Theorem 5.8]. Combining (10) and (11) arrives at the regularity condition with and .

Under the regularity condition, (8) converges linearly:

Lemma 2.

Under , the iterative algorithm (8) with and obeys

Proof of Lemma 2.

Assuming that , we can obtain

where the second inequality comes from , and the last line arises if . By setting , we arrive at

(12)

which also shows that . The claim then follows by induction under the assumption .

In view of Lemma 2, the iteration complexity to reach -accuracy (i.e. ) is at most

as long as a suitable initialization is provided.

2.3 Critical points

An iterative algorithm like gradient descent often converges to one of its fixed points [34]. For gradient descent, the associated fixed points are (first-order) critical points or stationary points of the loss function, defined as follows.

Definition 4 ().

A first-order critical point (stationary point) of is any point that satisfies

Moreover, we call a point an -first-order critical point, for some , if it satisfies .

A critical point can be a local minimum, a local maximum, or a saddle point of , depending on the curvatures at / surrounding the point. Specifically, denote by the Hessian matrix at , and let be its minimum eigenvalue. Then for any first-order critical point :

  • if , then is a local maximum;

  • if , then is a local minimum;

  • , then is either a local minimum or a degenerate saddle point;

  • , then is a strict saddle point.

Another concept that will be useful is second-order critical points, defined as follows.

Definition 5 ().

A point is said to be a second-order critical point (stationary point) if and .

Clearly, second-order critical points do not encompass local maxima and strict saddle points, and, as we shall see, are of more interest for the nonconvex problems considered herein. Since we are interested in minimizing the loss function, we do not distinguish local maxima and strict saddle points.

3 A warm-up example: rank-1 matrix factorization

For pedagogical reasons, we begin with a self-contained study of a simple nonconvex matrix factorization problem, demonstrating local convergence in the basin of attraction in Section 3.1 and benign global landscape in Section 3.2. The analysis in this section requires only elementary calculations.

Figure 2: Illustration of the function , where and .

Specifically, consider a positive semidefinite matrix (which is not necessarily low-rank) with eigendecomposition . We assume throughout this section that there is a gap between the 1st and 2nd largest eigenvalues

(13)

The aim is to find the best rank- approximation. Clearly, this can be posed as the following problem:222Here, the preconstant is introduced for the purpose of normalization and does not affect the solution.

(14)

where is a degree-four polynomial and highly nonconvex. The solution to (14

) can be expressed in closed form as the scaled leading eigenvector

. See Fig. 2 for an illustration of the function when

. This problem stems from interpreting principal component analysis (PCA) from an optimization perspective, which has a long history in the literature of (linear) neural networks and unsupervised learning; see for example

[35, 36, 37, 38, 39, 40].

We attempt to minimize the nonconvex function directly in spite of nonconvexity. This problem, though simple, plays a critical role in understanding the success of nonconvex optimization, since several important nonconvex estimation problems can be regarded as randomized versions or extensions of this problem.

3.1 Local linear convergence of gradient descent

To begin with, we demonstrate that gradient descent, when initialized at a point sufficiently close to the true optimizer (i.e. ), is guaranteed to converge fast.

Theorem 1.

Consider the problem (14). Suppose and . Then the GD iterates (4) obey

Remark 1.

By symmetry, Theorem 1 continues to hold if is replaced by .

In a nutshell, Theorem 1 establishes linear convergence of GD for rank-1 matrix factorization, where the convergence rate largely depends upon the eigen-gap (relative to the largest eigenvalue). This is a local result, assuming that a suitable initialization is present in the basin of attraction . Its radius, which is not optimized in this theorem, is given by . This also depends on the relative eigen-gap .

Proof of Theorem 1.

The proof mainly consists of showing that is locally strongly convex and smooth, which allows us to invoke Lemma 1. The gradient and the Hessian of are given respectively by

(15)
(16)

For notational simplicity, let . A little algebra yields that if , then

(17)
(18)

We start with the smoothness condition. The triangle inequality gives

where the last line follows from (18).

Next, it comes from the definition of and (17) that

Substitution into (16) yields a strong convexity lower bound:

where the last inequality is an immediate consequence of (18).

In summary, for all obeying , one has

Applying Lemma 1 establishes the claim. ∎

The question then comes down to whether one can secure an initial guess of this quality. One popular approach is spectral initialization, obtained by computing the leading eigenvector of . For this simple problem, this already yields a solution of arbitrary accuracy. As it turns out, such a spectral initialization approach is particularly useful when dealing with noisy and incomplete measurements. We refer the readers to Section 8 for detailed discussions on spectral initialization methods.

3.2 Global optimization landscape

We then move on to examining the optimization landscape of this simple problem. In particular, what kinds of critical points does have? This is addressed as follows.

Theorem 2.

Consider the problem (14). All local minima of are global optima. The rest of the critical points are either local maxima or strict saddle points.

Proof of Theorem 2.

Recall that is critical point if and only if , or equivalently,

As a result, is a critical point if either it aligns with an eigenvector of or . Given that the eigenvectors of obey , by properly adjusting the global scaling, we determine the set of critical points

To further categorize the critical points, we need to examine the Hessian matrices as given by (16). Regarding the critical points , we have

We can then categorize them as follows:

  1. With regards to the points , one has

    and hence they are (equivalent) local minima of ;

  2. For the points , one has

    and therefore they are strict saddle points of .

  3. Finally, the critical point at the origin satisfies

    and is hence either a local maxima (if ) or a strict saddle point (if ).

This result reveals the benign geometry of the problem (14) amenable to optimization. All undesired fixed points of gradient descent are strict saddles, which have negative directional curvature and may not be difficult to escape or avoid.

4 Formulations of a few canonical problems

For a paper of this length, it is impossible to cover all nonconvex statistical problems of interest. Instead, we decide to focus on a few concrete and fundamental matrix factorization problems. This section presents formulations of several such examples that will be visited multiple times throughout this article. Unless otherwise noted, the assumptions made in this section (e.g. restricted isometry for matrix sensing, Gaussian design for phase retrieval) will be imposed throughout the rest of the paper.

4.1 Matrix sensing

Suppose that we are given a set of measurements of of the form

(19)

where is a collection of sensing matrices known a priori. We are asked to recover — which is assumed to be of rank — from these linear matrix equations [12, 41].

When is positive semidefinite, this can be cast as solving the least-squares problem

(20)

Clearly, we cannot distinguish from for any orthonormal matrix , as they correspond to the same low-rank matrix . This simple fact implies that there exist multiple global optima for (20), a phenomenon that holds for most problems discussed herein.

For the general case where , we wish to minimize

(21)

Similarly, we cannot distinguish from for any invertible matrix , as . Throughout the paper, we denote by

(22)

the true low-rank factors, where

stands for its singular value decomposition.

In order to make the problem well-posed, we need to make proper assumptions on the sensing operator defined by:

(23)

A useful property for the sensing operator that enables tractable algorithmic solutions is the Restricted Isometry Property (RIP), which says that the operator preserves the Euclidean norm of the input matrix when restricted to the set of low-rank matrices. More formally:

Definition 6 ([42]).

An operator is said to satisfy -RIP with RIP constant if

(24)

holds simultaneously for all of rank at most .

As an immediate consequence, the inner product between two low-rank matrices is also nearly preserved if satisfies the RIP. Therefore, behaves like an isometry when restricting its operations over low-rank matrices.

Lemma 3 ([42]).

If an operator satisfies -RIP with RIP constant , then

(25)

holds simultaneously for all and of rank at most .

Notably, many random sensing designs are known to satisfy the RIP with high probability, with one example given below.

Fact 1 ( [12, 41]).

If the entries of are i.i.d. Gaussian entries , then [cf. (23)] satisfies the -RIP with RIP constant with high probability as soon as .

4.2 Phase retrieval and quadratic sensing

Imagine that we have access to quadratic measurements of a rank-1 matrix :

(26)

where is the design vector known a priori. How can we reconstruct — or equivalently, — from this collection of quadratic equations about ? This problem, often dubbed as phase retrieval, arises for example in X-ray crystallography, where one needs to recover a specimen based on intensities of the diffracted waves scattered by the object [43, 4, 44, 45]. Mathematically, the problem can be posed as finding a solution to the following program

(27)

More generally, consider the quadratic sensing problem, where we collect quadratic measurements of a rank- matrix with :

(28)

This subsumes phase retrieval as a special case, and comes up in applications such as covariance sketching for streaming data [46, 47], and phase space tomography under the name coherence retrieval [48, 49]. Here, we wish to solve

(29)

Clearly, this is equivalent to the matrix sensing problem by taking . Here and throughout, we assume i.i.d. Gaussian design, a tractable model that has been extensively studied recently.

Assumption 1.

Suppose that the design vectors .

4.3 Matrix completion

Suppose we observe partial entries of a low-rank matrix of rank , indexed by the sampling set . It is convenient to introduce a projection operator such that for an input matrix ,

(30)

The matrix completion problem then boils down to recovering from (or equivalently, from the partially observed entries of ) [1, 13]. This arises in numerous scenarios; for instance, in collaborative filtering, we may want to predict the preferences of all users about a collection of movies, based on partially revealed ratings from the users. Throughout this paper, we adopt the following random sampling model:

Assumption 2.

Each entry is observed independently with probability , i.e.

(31)

For the positive semidefinite case where , the task can be cast as solving

(32)

When it comes to the more general case where , the task boils down to solving

(33)

One parameter that plays a crucial role in determining the feasibility of matrix completion is a certain coherence measure, defined as follows [1].

Definition 7 ().

A rank- matrix with singular value decomposition (SVD) is said to be -incoherent if