Recovering a low-rank matrix from noisy linear measurements has become an increasingly central task in data science. Important and well-studied examples include phase retrieval[55, 12, 42], blind deconvolution [1, 38, 41, 57], matrix completion [9, 21, 56]
, covariance matrix estimation[18, 40]
, and robust principal component analysis[15, 11]
. Optimization-based approaches for low-rank matrix recovery naturally lead to nonconvex formulations, which are NP hard in general. To overcome this issue, in the last two decades researchers have developed convex relaxations that succeed with high probability under appropriate statistical assumptions. Convex techniques, however, have a well-documented limitation: the parameter space describing the relaxations is usually much larger than that of the target problem. Consequently, standard algorithms applied on convex relaxations may not scale well to the large problems. Consequently, there has been a renewed interest in directly optimizing nonconvex formulations with iterative methods within the original parameter space of the problem. Aside from a few notable exceptions on specific problems[33, 3, 32], most algorithms of this type proceed in two-stages. The first stage—initialization—yields a rough estimate of an optimal solution, often using spectral techniques. The second stage—local refinement—uses a local search algorithm that rapidly converges to an optimal solution, when initialized at the output of the initialization stage.
This work focuses on developing provable low-rank matrix recovery algorithms based on nonconvex problem formulations. We focus primarily on local refinement and describe a set of unifying sufficient conditions leading to rapid local convergence of iterative methods. In contrast to the current literature on the topic, which typically relies on smooth problem formulations and gradient-based methods, our primary focus is on nonsmooth formulations that exhibit sharp growth away from the solution set. Such formulations are well-known in the nonlinear programming community to be amenable to rapidly convergent local-search algorithms. Along the way, we will observe an apparent benefit of nonsmooth formulations over their smooth counterparts. All nonsmooth formulations analyzed in this paper are “well-conditioned,” resulting in fast “out-of-the-box” convergence guarantees. In contrast, standard smooth formulations for the same recovery tasks can be poorly conditioned, in the sense that classical convergence guarantees of nonlinear programming are overly pessimistic. Overcoming the poor conditioning typically requires nuanced problem and algorithmic specific analysis (e.g. [57, 42, 46, 17]), which nonsmooth formulations manage to avoid for the problems considered here.
Setting the stage, consider a rank matrix and a linear map from the space of matrices to the space of measurements. The goal of low-rank matrix recovery is to recover
from the image vector, possibly corrupted by noise. Typical nonconvex approaches proceed by choosing some penalty function with which to measure the residual for a trial solution . Then, in the case that is symmetric and positive semidefinite, one may focus on the formulation
or when is rectangular, one may instead use the formulation
Here, is a convex set that incorporates prior knowledge about and is often used to enforce favorable structure on the decision variables. The penalty is chosen specifically to penalize measurement misfit and/or enforce structure on the residual errors.
Algorithms and conditioning for smooth formulations
Most widely-used penalties are smooth and convex. Indeed, the squared -norm is ubiquitous in this context. With such penalties, problems (1.1) and (1.2) are smooth and thus are amenable to gradient-based methods. The linear rate of convergence of gradient descent is governed by the “local condition number” of . Indeed, if the estimate, holds for all in a neighborhood of the solution set, then gradient descent converges to the solution set at the linear rate . It is known that for several widely-studied problems including phase retrieval, blind deconvolution, and matrix completion, the ratio scales inversely with the problem dimension. Consequently, generic nonlinear programming guarantees yield efficiency estimates that are far too pessimistic. Instead, near-dimension independent guarantees can be obtained by arguing that is well conditioned along the “relevant” directions or that is well-conditioned within a restricted region of space that the iterates never escape (e.g. [57, 42, 46]). Techniques of this type have been elegantly and successfully used over the past few years to obtain algorithms with near-optimal sample complexity. One byproduct of such techniques, however, is that the underlying arguments are finely tailored to each particular problem and algorithm at hand. We refer the reader to the recent surveys  for details.
Algorithms and conditioning for nonsmooth formulations
The goal of our work is to justify the following principle:
where is a smooth map on the space of matrices and is a closed convex set. Indeed, in the symmetric and positive semidefinite case, we identify with matrices and define , while in the asymmetric case, we identify with pairs of matrices and define . Though compositional problems (1.3) have been well-studied in nonlinear programming [6, 7, 31], their computational promise in data science has only begun recently to emerge. For example, the papers [28, 22, 26] discuss stochastic and inexact algorithms on composite problems, while the papers [27, 24], , and  investigate applications to phase retrieval, blind deconvolution, and matrix sensing, respectively.
A number of algorithms are available for problems of the form
(1.3), and hence for (1.1)
and (1.2). Two most notable ones are the projected subgradient111 Here, the subdifferential is
formally obtained through the chain rule
Here, the subdifferential is formally obtained through the chain rule, where is the subdifferential in the sense of convex analysis. method [23, 34]
Notice that each iteration of the subgradient method is relatively cheap, requiring access only to the subgradients of and the nearest-point projection onto . The prox-linear method in contrast requires solving a strongly convex problem in each iteration. That being said, the prox-linear method has much stronger convergence guarantees than the subgradient method, as we will review shortly.
The local convergence guarantees of both methods are straightforward to describe, and underlie what we mean by “good conditioning”. Define , and for any define the convex model . Suppose there exist constants satisfying the two properties:
(approximation) for all ,
(sharpness) for all .
The approximation and sharpness properties have intuitive meanings. The former says that the nonconvex function is well approximated by the convex model , with quality that degrades quadratically as deviates from . In particular, this property guarantees that the quadratically perturbed function is convex on . Yet another consequence of the approximation property is that the epigraph of admits a supporting concave quadratic with amplitute at each of its points. Sharpness, in turn, asserts that must grow at least linearly as moves away from the solution set. In other words, the function values should robustly distinguish between optimal and suboptimal solutions. In statistical contexts, one can interpret sharpness as strong identifiability of the statistical model. The three figures below illustrate the approximation and sharpness properties for idealized objectives in phase retrieval, blind deconvolution, and robust PCA problems.
Approximation and sharpness, taken together, guarantee rapid convergence of numerical methods when initialized within the tube:
For common low-rank recovery problems, has an intuitive interpretation: it consists of those matrices that are within constant relative error of the solution. We note that standard spectral initialization techniques, in turn, can generate such matrices with nearly optimal sample complexity. We refer the reader to the survey , and references therein, for details.
The following is the guiding algorithmic principle of this work:
When initialized at , the prox-linear algorithm converges quadratically to the solution set ; the subgradient method, in turn, converges linearly with a rate governed by ratio , where is the Lipschitz constant of on .222Both the parameters and must be properly chosen for these guarantees to take hold.
In light of this observation, our strategy can be succinctly summarized as follows. We will show that for a variety of low-rank recovery problems, the parameters (or variants) are dimension independent under standard statistical assumptions. Consequently, the formulations (1.1) and (1.2) are “well-conditioned”, and subgradient and prox-linear methods converge rapidly when initialized within constant relative error of the optimal solution.
Approximation and sharpness via the Restricted Isometry Property
We begin verifying our thesis by showing that the composite problems, (1.1) and (1.2), are well-conditioned under the following Restricted Isometry Property (RIP): there exists a norm and numerical constants so that
for all matrices of rank at most . We argue that under RIP,
the nonsmooth norm is a natural penalty function to use.
Indeed, as we will show, the composite loss in the symmetric setting admits constants
that depend only on the RIP parameters and the extremal singular values of:
In particular, the initialization ratio scales as and the condition number scales as . Consequently, the rapid local convergence guarantees previously described immediately take-hold. The asymmetric setting is slightly more nuanced since the objective function is sharp only on bounded sets. Nonetheless, it can be analyzed in a similar way leading to analogous rapid convergence guarantees. Incidentally, we show that the prox-linear method converges rapidly without any modification; this is in contrast to smooth methods, which typically require incorporating an auxiliary regularization term into the objective (e.g. ). We note that similar results in the symmetric setting were independently obtained in the complimentary work , albeit with a looser estimate of ; the two treatments of the asymmetric setting are distinct, however.333The authors of  provide a bound on that scales with the Frobenius norm . We instead derive a sharper bound that scales as . As a byproduct, the linear rate of convergence for the subgradient method scales only with the condition number instead of .
After establishing basic properties of the composite loss, we turn our attention to verifying RIP in several concrete scenarios. We note that the seminal works [50, 13] showed that if arises from a Gaussian ensemble, then in the regime RIP holds with high probability for the scaled norm . More generally when is highly structured, RIP may be most naturally measured in a non-Euclidean norm. For example, RIP with respect to the scaled norm holds for phase retrieval [29, 27], blind deconvolution , and quadratic sensing ; in contrast, RIP relative to the scaled norm fails for all three problems. In particular, specializing our results to the aforementioned recovery tasks yields solution methodologies with best known sample and computational complexity guarantees. Notice that while one may “smooth-out” the norm by squaring it, we argue that it may be more natural to optimize the norm directly as a nonsmooth penalty. Moreover, we show that penalization enables exact recovery even if a constant fraction of measurements is corrupted by outliers.
Beyond RIP: matrix completion and robust PCA
The RIP assumption provides a nice vantage point for analyzing the problem parameters . There are, however, a number of important problems, which do not satisfy RIP. Nonetheless, the general paradigm based on the interplay of sharpness and approximation is still powerful. We consider two such settings, matrix completion and robust principal component analysis (PCA), leveraging some intermediate results from .
The goal of the matrix completion problem  is to recover a low rank matrix from its partially observed entries. We focus on the formulation
where is the projection onto the index set of observed entries and
is the set of incoherent matrices. To analyze the conditioning of this formulation, we assume that the indices in are chosen as i.i.d. Bernoulli with parameter and that all nonzero singular values of are equal to one. Using results of , we quickly deduce sharpness with high probability. The error in approximation, however, takes the following nonstandard form. In the regime for some constants and , the estimate holds with high probability:
The following modification of the prox-linear method therefore arises naturally:
We show that subgradient methods and the prox-linear method, thus modified, both converge at a dimension independent linear rate when initialized near the solution. Namely, as long as and are below some constant thresholds, both the subgradient and the modified prox-linear methods converge linearly with high probability:
Here is a numerical constant. Notice that the prox-linear method enjoys a much faster rate of convergence that is independent of any unknown constants or problem parameters—an observation fully supported by our numerical experiments.
As the final example, we consider the problem of robust PCA [11, 15], which aims to decompose a given matrix into a sum of a low-rank and a sparse matrix. We consider two different problem formulations:
where and are appropriately defined convex regions. Under standard incoherence assumptions, we show that the formulation (1.5) is well-conditioned, and therefore subgradient and prox-linear methods are applicable. Still, formulation (1.5) has a major drawback in that one must know properties of the optimal sparse matrix in order to define the constraint set , in order to ensure good conditioning. Consequently, we analyze formulation (1.6) as a more practical alternative.
The analysis of (1.6) is more challenging than that of (1.5). Indeed, it appears that we must replace the Frobenius norm in the approximation/sharpness conditions with the sum of the row norms . With this set-up, we verify the convex approximation property in general:
and sharpness only when . We conjecture, however, that an analogous sharpness bound holds for all . It is easy to see that the quadratic convergence guarantees for the prox-linear method do not rely on the Euclidean nature of the norm, and the algorithm becomes applicable. To the best of our knowledge, it is not yet known how to adapt linearly convergent subgradient methods to the non-Euclidean setting.
Robust recovery with sparse outliers and dense noise
The aforementioned guarantees lead to exact recovery of under noiseless or sparsely corrupted measurements . A more realistic noise model allows for further corruption by a dense noise vector of small norm. Exact recovery is no longer possible with such errors. Instead, we should only expect to recover up to a tolerance proportional to the size of . Indeed, we show that appropriately modified subgradient and prox-linear algorithms converge linearly and quadratically, respectively, up to the tolerance for an appropriate norm . Finally, we discuss in detail the case of recovering a low rank PSD matrix from the corrupted measurements , where represents sparse outliers and represents small dense noise. To the best of our knowledge, theoretical guarantees for this error model have not been previously established in the nonconvex low-rank recovery literature. Surprisingly, we show it is possible to recover the matrix up to a tolerance independent of the norm or location of the outliers .
We conclude with an experimental evaluation of our theoretical findings on quadratic and bilinear matrix sensing, matrix completion, and robust PCA problems. In the first set of experiments, we test the robustness of the proposed methods against varying combinations of rank/corruption level by reporting the empirical recovery rate across independent runs of synthetic problem instances. All the aforementioned model problems exhibit sharp phase transitions, yet our methods succeed for more than moderate levels of corruption (or unobserved entries in the case of matrix completion). For example, in the case of matrix sensing, we can corrupt almost half of the measurementsand still retain perfect recovery rates. Interestingly, our experimental findings indicate that the prox-linear method can tolerate slightly higher levels of corruption compared to the subgradient method, making it the method of choice for small-to-moderate dimensions.
We then demonstrate that the convergence rate analysis is fully supported by empirical evidence. In particular, we test the subgradient and prox-linear methods for different rank/corruption configurations. In the case of quadratic/bilinear sensing and robust PCA, we observe that the subgradient method converges linearly and the prox-linear method converges quadratically, as expected. In particular, our numerical experiments appear to support our sharpness conjecture for the robust PCA problem. In the case of matrix completion, both algorithms converge linearly. The prox-linear method in particular, converges extremely quickly, reaching high accuracy solutions in under iterations for reasonable values of .
In the noiseless setting, we compare against gradient descent with constant step-size on smooth formulations of each problem (except for robust PCA). We notice that the Polyak subgradient method outperforms gradient descent in all cases. That being said, one can heuristically equip gradient descent with the Polyak step-size as well. To the best of our knowledge, the gradient method with Polyak step-size has has not been investigated on smooth problem formulations we consider here. Experimentally, we see that the Polyak (sub)gradient methods on smooth and nonsmooth formulations perform comparably in the noiseless setting.
Outline of the paper
The outline of the paper is as follows. Section 2 records some basic notation we will use. Section 3 informally discusses the sharpness and approximation properties, and their impact on convergence of the subgradient and prox-linear methods. Section 4 analyzes the parameters under RIP. Section 5 rigorously discusses convergence guarantees of numerical methods under regularity conditions. Section 6 reviews examples of problems satisfying RIP and deduces convergence guarantees for subgradient and prox-linear algorithms. Sections 7 and 8 discuss the matrix completion and robust PCA problems, respectively. Section 9 discusses robust recovery up to a noise tolerance. The final Section 10 illustrates the developed theory and algorithms with numerical experiments on quadratic/bi-linear sensing, matrix completion, and robust PCA problems.
In this section, we summarize the basic notation we will use throughout the paper. Henceforth, the symbol will denote a Euclidean space with inner product and the induced norm . The closed unit ball in will be denoted by , while a closed ball of radius around a point will be written as . For any point and a set , the distance and the nearest-point projection in -norm are defined by
respectively. For any pair of functions and on , the notation will mean that there exists a numerical constant such that for all . Given a linear map between Euclidean spaces, , the adjoint map will be written as . We will use for the
-dimensional identity matrix and
for the zero matrix with variable sizes. The symbolwill be shorthand for the set
We will always endow the Euclidean space of vectors with the usual dot-product and the induced -norm. More generally, the norm of a vector will be denoted by . Similarly, we will equip the space of rectangular matrices with the trace product and the induced Frobenius norm . The operator norm of a matrix will be written as . The symbol will denote the vector of singular values of a matrix in nonincreasing order. We also define the row-wise matrix norms . The symbols , , , and will denote the sets of symmetric, positive semidefinite, orthogonal, and invertible matrices, respectively.
Nonsmooth functions will play a central role in this work. Consequently, we will require some basic constructions of generalized differentiation, as described for example in the monographs [52, 45, 4]. Consider a function and a point , with finite. The subdifferential of at , denoted by , is the set of all vectors satisfying
Here denotes any function satisfying as . Thus, a vector lies in the subdifferential precisely when the linear function lower-bounds up to first-order around . Standard results show that for a convex function the subdifferential reduces to the subdifferential in the sense of convex analysis, while for a differentiable function it consists only of the gradient: . For any closed convex functions and and -smooth map , the chain rule holds [52, Theorem 10.6]:
We say that a point is stationary for whenever the inclusion holds. Equivalently, stationary points are precisely those that satisfy first-order necessary conditions for minimality: the directional derivative is nonnegative in every direction.
We say a that a random vector in is -sub-gaussian whenever for all unit vectors . The sub-gaussian norm
of a real-valued random variableis defined to be , while the sub-exponential norm is defined by .
3 Regularity conditions and algorithms (informal)
As outlined in Section 1, we consider the low-rank matrix recovery problem within the framework of compositional optimization:
where is a closed convex set, is a finite convex function and is a -smooth map. We depart from previous work on low-rank matrix recovery by allowing to be nonsmooth. We primary focus on those algorithms for (3.1) that converge rapidly (linearly or faster) when initialized sufficiently close to the solution set.
Such rapid convergence guarantees rely on some regularity of the optimization problem. In the compositional setting, regularity conditions take the following appealing form.
Suppose that the following properties hold for the composite optimization problem (3.1) for some real numbers .
(Approximation accuracy) The convex models satisfy the estimate
(Sharpness) The set of minimizers is nonempty and we have
(Subgradient bound) The bound, , holds for any in the tube
As pointed out in the introduction, these three properties are quite intuitive: The approximation accuracy guarantees that the objective function is well approximated by the convex model , up to a quadratic error relative to the basepoint . Sharpness stipulates that the objective function should grow at least linearly as one moves away from the solution set. The subgradient bound, in turn, asserts that the subgradients of are bounded in norm by on the tube . In particular, this property is implied by Lipschitz continuity on .
Lemma 3.1 (Subgradient bound and Lipschitz continuity [52, Theorem 9.13]).
Suppose a function is -Lipschitz on an open set . Then the estimate holds for all .
The definition of the tube might look unintuitive at first. Some thought, however, shows that it arises naturally since it provably contains no extraneous stationary points of the problem. In particular, will serve as a basin of attraction of numerical methods; see the forthcoming Section 5 for details. The following general principle has recently emerged [23, 27, 24, 16]. Under Assumption A, basic numerical methods converge rapidly when initialized within the tube . Let us consider three such procedures and briefly describe their convergence properties. Detailed convergence guarantees are deferred to Section 5.
Algorithm 1 is the so-called Polyak subgradient method. In each iteration , the method travels in the negative direction of a subgradient , followed by a nearest-point projection onto . The step-length is governed by the current functional gap . In particular, one must have the value explicitly available to implement the procedure. This value is sometimes known; case in point, the minimal value of the penalty formulations (1.1) and (1.2) for low-rank recovery is zero when the linear measurements are exact. When the minimal value is not known, one can instead use Algorithm 2, which replaces the step-length with a preset geometrically decaying sequence. Notice that the per iteration cost of both subgradient methods is dominated by a single subgradient evaluation and a projection onto . Under appropriate parameter settings, Assumption A guarantees that both methods converge at a linear rate governed by the ratio , when initialized within . The prox-linear algorithm (Algorithm 2), in contrast, converges quadratically to the optimal solution, when initialized within . The caveat is that each iteration of the prox-linear method requires solving a strongly convex subproblem. Note that for low-rank recovery problems (1.1) and (1.2), the size of the subproblems is proportional to the size of the factors and not the size of the matrices.
In the subsequent sections, we show that Assumption A (or a close variant) holds with favorable parameters for common low-rank matrix recovery problems.
4 Regularity under RIP
In this section, we consider the low-rank recovery problems (1.1) and (1.2), and show that restricted isometry properties of the map naturally yield well-conditioned compositional formulations.444The guarantees we develop in the symmetric setting are similar to those in the recent preprint , albeit we obtain a sharper bound on ; the two sets of results were obtained independently. The guarantees for the asymmetric setting are different and are complementary to each other: we analyze the conditioning of the basic problem formulation (1.2), while  introduces a regularization term that improves the basin of attraction for the subgradient method by a factor of the condition number of . The arguments are short and elementary, and yet apply to such important problems as phase retrieval, blind deconvolution, and covariance matrix estimation.
Setting the stage, consider a linear map , an arbitrary rank matrix , and a vector modeling a corrupted estimate of the measurements . Recall that the goal of low-rank matrix recovery is to determine given and . By the term symmetric setting, we mean that is symmetric and positive semidefinite, whereas by asymmetric setting we mean that is an arbitrary rank matrix. We will treat the two settings in parallel. In the symmetric setting, we use to denote any fixed matrix for which the factorization holds. Similarly, in the asymmetric case, and denote any fixed and matrices, respectively, satisfying .
We are interested in the set of all possible factorization of . Consequently, we will often appeal to the following representations:
Henceforth, fix an arbitrary norm on . The following property, widely used in the literature on low-rank recovery, will play a central role in this section.
Assumption B (Restricted Isometry Property (RIP)).
There exist constants such that for all matrices of rank at most the following bound holds:
Assumption B is classical and is satisfied in various important problems with the rescaled -norm and -norm .555In the latter case, RIP also goes by the name of Restricted Uniform Boundedness (RUB) . In Section 6 we discuss a number of such examples including matrix sensing under (sub-)Gaussian design, phase retrieval, blind deconvolution, and quadratic/bilinear sensing. We summarize the RIP properties for these examples in Table 1 and refer the reader to Section 6 for the precise statements.
|Quadratic sensing I|
|Quadratic sensing II|
while the asymmetric formulation (1.2) becomes
Our immediate goal is to show that under Assumption B, the problems (4.3) and (4.4) are well-conditioned in the sense of Assumption A. We note that the asymmetric setting is more nuanced that its symmetric counterpart because Assumption A can only be guaranteed to hold on bounded sets. Nonetheless, as we discuss in Section 5, a localized version of Assumption A suffices to guarantee rapid local convergence of subgradient and prox-linear methods. In particular, our analysis of the local sharpness in the asymmetric setting is new and illuminating; it shows that the regularization technique suggested in  is not needed at all for the prox-linear method. This conclusion contrasts with known techniques in the smooth setting, where regularization is often used.
4.1 Approximation and Lipschitz continuity
We begin with the following elementary proposition, which estimates the subgradient bound and the approximation modulus in the symmetric setting. In what follows, we will use the expressions
Proposition 4.1 (Approximation accuracy and Lipschitz continuity (symmetric)).
Suppose Assumption B holds. Then for all the following estimates hold:
The estimates of and in the asymmetric setting are completely analogous; we record them in the following proposition.
Proposition 4.2 (Approximation accuracy and Lipschitz continuity (asymmetric)).
Suppose Assumption B holds. Then for all and the following estimates hold:
To see the first estimate, observe that
where the last estimate follows from Young’s inequality Next, we successively compute:
The result follows by noting that for all .
We next move on to estimates of the sharpness constant . We first deal with the noiseless setting in Section 4.2.1, and then move on to the general case when the measurements are corrupted by outliers in Section 4.2.2.
4.2.1 Sharpness in the noiseless regime
We begin with with the symmetric setting in the noiseless case . By Assumption B, we have the estimate
It follows that the set of minimizers coincides with the set of minimizers of the function , namely
Thus to argue sharpness of it suffices to estimate the sharpness constant of the function . Fortunately, this calculation was already done in [57, Lemma 5.4].
Proposition 4.3 ([57, Lemma 5.4]).
For any matrices , we have the bound
Consequently if Assumption B holds in the noiseless setting , then the bound holds:
We next consider the asymmetric case. By exactly the same reasoning as before, the set of minimizers of coincides with the set of minimizers of the function , namely
Thus to argue sharpness of it suffices to estimate the sharpness constant of the function . Such a sharpness guarantee in the rank one case was recently shown in [16, Proposition 4.2].
Proposition 4.4 ([16, Proposition 4.2]).
Fix a rank matrix and a constant . Then for any and satisfying
the following estimate holds:
Notice that in contrast to the symmetric setting, the sharpness estimate is only valid on bounded sets. Indeed, this is unavoidable even in the setting . To see this, define and for any set and . It is routine to compute
Therefore letting tend to zero (or infinity) the quotient tends to zero.
The following corollary is a higher rank extension of Proposition 4.4.
Theorem 4.5 (Sharpness (asymmetric and noiseless)).
Fix a constant and define and , where is any compact singular value decomposition of
is any compact singular value decomposition of. Then for all and satisfying