1 Introduction
The ubiquity of advanced sensing and imaging technologies produce vast amounts of data at an unprecedented rate. A fundamental goal of signal processing is to extract, and possibly track the evolution of, the relevant structural information faithfully from such highdimensional data, ideally with a minimal amount of computation, storage and human intervention. To overcome the curse of dimensionality, it is important to exploit the fact that realworld data often possess some
lowdimensional geometric structures. In particular, such structures allow for a succinct description of the data by a number of parameters much smaller than the ambient dimension. One popular postulate of lowdimensional structures is sparsity, that is, a signal can be represented using a few nonzero coefficients in a proper domain. For instance, a natural image often has a sparse representation in the wavelet domain. The field of compressed sensing [1, 2] has made tremendous progress in capitalizing on the sparsity structures, particularly in solving underdetermined linear systems arising from samplestarved applications such as medical imaging, spectrum sensing and network monitoring. In these applications, compressed sensing techniques allow for faithful estimation of the signal of interest from a number of measurements that is proportional to the sparsity level — much fewer than is required by traditional techniques. The power of compressed sensing has made it a disruptive technology in many applications such as magnetic resonance imaging (MRI): a Cardiac Cine scan can now be performed within 25 seconds with the patients breathing freely. This is in sharp contrast to the previous status quo, where the scan takes up to six minutes and the patients need to hold their breaths several times [3].While the sparsity model is powerful, the original framework of compressed sensing mainly focuses on vectorvalued signals that admit sparse representations in an
a priori knowndomain. However, knowledge of such sparsifying domains is not always available, thus limiting its applications. Fortunately, one can resort to a more general notion of sparsity that is more versatile when handling matrixvalued signals — or an ensemble of vectorvalued signals — without the need of specifying a sparsifying basis. In this paper, we will review this powerful generalization of sparsity, termed the lowrank model, which captures a much broader class of lowdimensional structures. Roughly speaking, this model postulates that the matrixvalued signal is approximately lowrank. If we view each column of the matrix as a data vector, then this is equivalent to saying that the data approximately lies in a lowdimensional but unknown subspace. Historically, the exploitation of lowrank structures may begin even earlier than that of sparsity. In particular, the lowrank assumption is what underlies classical Principal Component Analysis (PCA)
[4], which builds on the observation that realworld data has most of its variance in the first few top principal components. Such lowrank structures may arise due to various physical reasons and engineering designs. In face recognition, face images are found to trace out a
dimensional subspace if they are approximately convex and reflect light according to Lambert’s law [5]. In radar and sonar signal processing, the signals reside approximately in a lowdimensional subspace due to transmitting using a small set of waveforms to construct certain beam patterns. Lowrank structures also arise from modeling interactions between different objects. For example, in clustering or embedding, the pairwise interactions between objects can often be expressed as a lowrank matrix [6].Given the collected data, the key problem is to infer the hidden lowdimensional subspace that captures most of the information relevant for subsequent tasks such as detection, clustering, and parameter estimation. Traditional methods such as Singular Value Decomposition (SVD) for finding principal subspaces typically require the data to be fully observed. However, modern data applications often involve estimation problems with a number of measurements that is much smaller than the ambient dimension, a regime similar to the setting of compressed sensing. We refer to this problem as
lowrank matrix estimation, emphasizing the fact that one only has undersampled measurements or partial observations. Examples of such problems are abundant. In recommendation systems, the goal is to estimate the missing ratings given a small number of observed ones. In sensor networks, an important problem is to infer the locations of the sensors from pairwise distance measures, which are available only for sensors within a certain radius of each other. In wideband spectrum sensing, to reduce the sampling rate, a popular approach is to estimate the signal subspace and bearing parameters by randomly subsampling the outputs of the array.In these applications, it is desirable to develop lowrank matrix estimation algorithms that are both statistically efficient — achieving low estimation errors with a minimal amount of (noisy) measurements — and computationally efficient — having low running time and storage cost. A particular focus of this paper is on algorithms that come with provable guarantees for their statistical and computation efficiency. The search for such algorithms is in part motivated by the remarkable success story of compressed sensing, for which many provable methods have been developed for sparse models. Handling the more general lowrank structures poses a new set of challenges as well as opportunities. The study of lowrank matrix estimation has attracted the attention of many researchers from diverse communities including signal processing, machine learning, statistics, mathematical programming and computer science [7]–[11]. As we elaborate below, this enterprise has been much fruitful, resulting in many powerful algorithms, novel analytical techniques, and deep theoretical insights.
This survey article is complementary to the nice overview article on lowrank matrix recovery by Davenport and Romberg [12], with different focuses. In particular, by focusing on recent algorithmic advancements with computational and statistical guarantees, we highlight the effectiveness of firstorder methods in both convex and nonconvex optimization. We also put specific emphasis on a unique set of applications involving structured matrix completion.
1.1 Paper Organizations
The rest of this paper is organized as follows. Section 2 motivates lowrank models from the perspectives of modeling data correlations and lifting vectorvalued problems. Section 3 describes the basic mathematical setup of the lowrank estimation problem. Section 4 discusses the theory and algorithms for lowrank matrix estimation via convex optimization. Section 5 discusses the theory and algorithms for lowrank matrix estimation via nonconvex optimization. Section 6 discusses structured matrix completion, where the lowrank matrices have additional structural constraints, using several concrete examples. Numerical examples on a realworld recommendation dataset are showcased in Section 7. The paper is concluded in Section 8.
1.2 Notations
Throughout this paper, we use boldface capital letters such as to denote matrices, with being its transpose and being its th entry. Similarly, we use boldface lowercase letters such as to denote vectors, with being its conjugate transpose and being its th entry. The expectation is denoted by . In addition, , , , , and , stand for the spectral norm (i.e. the largest singular value), the Frobenius norm, the norm (i.e. the largest norm of the rows), the trace, and the nuclear norm (i.e. the sum of singular values) of the matrix . For two matrices and of the same size, denotes their trace inner product. The notation denotes a diagonal matrix whose diagonal entries are given by the vector . We use to denote the th standard basis vector of , for each .
2 The Ubiquity of LowRank Models
In this section, we elucidate the motivations for studying lowrank modeling and lowrank matrix estimation problems. We start with a classical viewpoint and justify the lowrank priors of a data matrix from the perspective of biasvariance tradeoffs for modeling correlations in data observations. We next argue that lowrank structures arise from a powerful reformulation of quadratic optimization problems by lifting them into a matrix space. Last but not least, we provide a list of other sources of lowrank structures in a wide range of science and engineering problems.
2.1 CorrelationAware Modeling of Data Matrices
We first motivate the use of lowrank models as a general principle for biasvariance tradeoff in signal estimation and processing given noisy data, which is a classical viewpoint articulated by Scharf and Tufts in [13]. Consider a stationary signal with covariance matrix
. Suppose that the eigenvalue decomposition of
is given by , whereare the eigenvectors, and
are the eigenvalues arranged as a nonincreasing sequence. Define the matrix , where is the matrix whose columns are the eigenvectors corresponding to the largest eigenvalues. One sees that is the projection operator onto the subspace spanned by the top eigenvectors .Imagine that we receive a noisy copy of the signal as
where the elements of the noise vector are independent with variance . Let us estimate the signal using a reducedrank model with rank by projecting observed data onto the dimensional principal subspace of . Such an estimate is given by
We make the crucial observation that one may decompose the mean squared error of the estimate into two terms:
(1) 
Here the first term corresponds to the model bias,
which arises due to approximating a fullrank signal by a rank model. The second term corresponds to the model variance,
which arises due to the presence of noise.
From the above decomposition, we see that as one increases the rank , the bias of the estimate decreases, whereas the corresponding variance increases. Therefore, the choice of the rank controls the tradeoff between the bias and the variance, whose sum constitutes the total estimation error. Importantly, many realworld datasets have a decaying spectrum, which in the above notation means that the eigenvalues of the covariance matrix decrease rapidly. As mentioned, this insight is the foundation of PCA [4] and moreover is observed across a wide range of applications including power systems, recommendation systems, Internet traffic and weather data. Consequently, it is beneficial to employ a small rank, so that the variance is controlled, while the bias remains small as long as the residual eigenvalues decreases quickly. With this in mind, we show in Fig. 1 the mean squared error as function of the rank, as well as the decomposition into the bias and the variance; here it is assumed that the spectrum decays at a rate of , and the signaltonoise ratio (SNR), defined as , is equal to dB with . It is clear from the figure that employing an appropriate lowrank estimator induces a much lower mean squared error than a fullrank one. In particular, the optimal rank may be much smaller than the ambient dimension when the spectrum decays fast.
2.2 Lifting for Quadratic and Bilinear Optimization Problems
Another important source of lowrank structures is solving quadratic/bilinear optimization problems. As an example, consider the phase retrieval problem [14], an important routine in Xray crystallography and optical imaging, where the goal is to recover a vector in or given only the magnitudes of its linear measurements, that is,
(2) 
Due to the nonlinear nature of these equations, it is difficult to solve them directly, particularly when the problem size is large. A popular approach for solving such equations is called lifting: one rewrites the above equations in terms of the matrix variable , and casts this problem as recovering the rankone matrix from a set of linear measurements [15]. A similar formulation has been used for the blind deconvolution problem; cf. [16].
The lifting approach can be applied to other classes of quadratic equations, whose lifting formulations may lead to lowrank matrices of rank larger than one. For instance, in the problem of sensor network localization [17], the goal is to determine the locations of a set of points/sensors lying in an dimensional Euclidean space, where , given a subset of their pairwise distances. The complete set of pairwise Euclidean distances can be arranged as a matrix , where , . Interestingly, each pairwise distance (that is, an entry of ) is in fact a linear function of the rank, positive semidefinite (PSD) matrix , where ; more precisely, one has
(3) 
Therefore, the problem of determining the locations of the sensors is equivalent to recovering the lowrank lifted matrix from a set of linear measurements in the form of (3); see [17] for a more detailed treatment of this powerful reformulation.
2.3 Other Sources of Lowrank Structures
There’re many potential sources of lowrank structures. Below we provide a few further examples drawn from different science and engineering domains:

In recommendation systems [19], the matrix of user ratings for a set of items is often approximately lowrank, as user preferences typically depend on a small number of underlying factors and hence their ratings correlate with each other.

In quantum state tomography, the density matrix of a pure or nearly pure quantum state is approximately lowrank, which can be exploited in the problem of state reconstruction from a small number of Pauli measurements [21].

In a sparse graphical model with latent variables, one can show, using the Schur complement, that the inverse marginal covariance matrix of the observed variables can be approximated by a matrix with rank equal to the number of latent variables [22].

Matrices with certain monotonicity properties can be wellapproximated by a matrix with rank much smaller than the ambient dimension. Such matrices arise, for example, when measuring the pairwise comparison scores of a set of objects that possess an underlying ordering [23].

The pairwise affinity matrix of a set of objects is often approximately lowrank due to the presence of clustering/community structures
[24] (cf. Section 6.2).
The list continues for much longer. The ubiquity of these structures, either as a physical property or as an engineering choice, is what makes lowrank models useful, and motivates the extensive study of the lowrank matrix estimation problem.
3 LowRank Matrix Estimation from Incomplete Observations
In this section, we formally define the problem of lowrank matrix estimation, that is, recovery of a lowrank matrix from a number of measurements much smaller than the dimension of the matrix. Let be the matrixvalued signal of interest.^{1}^{1}1Our discussions can be extended complexvalued matrices straightforwardly. Denote the SVD of by
where the singular values are organized in an nonincreasing order. The best rank approximation of is defined as
By the EckartYoung theorem, the optimal approximation is given by
(4) 
Correspondingly, the rank approximation error is given by , and we say that the matrix is approximately lowrank if its rank approximation error is small for some .
As mentioned, in many modern applications, one does not directly observe , but rather is given an underdetermined set of indirect noisy measurements of it. Here we assume that one has access to a set of linear measurements in the form
(5) 
where is the th measurement matrix, and is a noise term. We may rewrite these equations more compactly in a matrix form as
(6) 
where is the linear measurement operator defined by , and is the vector of noise terms. Denote by the conjugate operator of , where .
As we are primarily concerned with estimating from measurements, direct approximation via SVD and the EckartYoung theorem are impossible. Instead, we need to develop alternative methods to find an (approximate) lowrank solution that best fits the set of noisy underdetermined linear equations (6). We further categorize the lowrank matrix estimation problem into two main types based on the structure of the measurement operator:

LowRank Matrix Sensing, one observes linear combinations of the entries of , where each measurement matrix defining the linear combinations is typically dense.

LowRank Matrix Completion, one directly observes a subset of the entries of
, and aims to interpolate the missing entries. In this case, each
is a sparse matrix with a single entry equal to at the corresponding observed index.
For matrix completion, it is convenient to write the measurements in a matrix form as
(7) 
where is the collection of indices of the observed entries, is the entrywise partial observation operator defined by
and is the noise matrix supported on . With this notation, matrix completion is the problem of (approximately) recovering given and .
4 Theory and Algorithms for LowRank Matrix Estimation via Convex Optimization
The development of efficient algorithms for lowrank estimation owes much of its inspiration to the success of compressed sensing [1, 2]. There, the convex relaxation approach based on minimization is widely used for recovering sparse signals. For lowrank problems, the role of the norm is replaced by its matrix counterpart, namely the nuclear norm (also known as the trace norm), which is a convex surrogate for the rank. This idea gives rise to convex optimization approaches for lowrank estimation based on nuclear norm minimization, an approach put forth by Fazel et al. in the seminal work [25]. This approach has since been extensively developed and expanded, which remains the most mature and wellunderstood method (though not the only one) for estimating lowrank matrices. In this section we provide a survey of this algorithmic approach and the associated theoretical results.
4.1 Convex Relaxation via Nuclear Norm minimization
We begin by deriving the nuclear norm minimization algorithm as a convex relaxation for rank minimization. Recall our linear measurement model in (6), to which we seek a lowrank solution. A natural approach is to find the matrix with the minimum rank that is consistent with these measurement, which can be formulated as an optimization problem:
(8) 
The rank, however, is a nonconvex function of , and rank minimization (8) is known to be NPhard in general. To develop a tractable formulation, one observes that the rank of is equal to the number of its nonzero singular values. Therefore, analogously to using the norm as a convex surrogate of sparsity, we may replace the rank of by the sum of its singular values, a quantity known as the nuclear norm:
Then, instead of solving (8) directly, one solves for a matrix that minimizes the nuclear norm:
(9) 
In the case where the measurements are noisy, one seeks a matrix with a small nuclear norm that is approximately consistent with the measurements, which can be formulated either as a regularized optimization problem:
(10) 
or as a constrained optimization problem:
(11) 
where and are tuning parameters. Note that the nuclear norm can be represented using the solution to a semidefinite program [25],
(12)  
Consequently, the optimization problems (9)–(11) are convex, semidefinite programs.
4.2 Guarantees for Matrix Sensing via RIP
For there to be any hope of recovering from the output of the sensing process (6), the sensing operator needs to possess certain desirable properties so that it can distinguish different lowrank matrices. One such property is called the restricted isometry property (RIP). RIP stipulates that , viewed as a mapping to a lowerdimensional space, preserves the Euclidean distances between lowrank matrices, Below we give a general notion of RIP, where the distances after mapping may be measured in different norms:
Definition 1 (Restricted Isometry Property)
The operator is said to satisfy the RIP property of rank if for all matrices of rank at most , there holds the inequality
where and are some universal constants satisfying .
This definition is reminiscent of a similar notion with the same name used in the sparse signal recovery literature that is imposed on sparse vectors [1]. Certifying whether RIP holds for a given operator is known to be NPhard [26]
. Nevertheless, it turns out that a “generic” sensing operator, drawn from certain random distributions, satisfies RIP with high probability. For example:
When RIP holds, the nuclear norm minimization approach guarantees exact and stable recovery of the lowrank matrix in both noisefree and noisy cases, as shown in the following theorem adapted from [27, 28, 29].
Theorem 1
Suppose that the noise satisfies . If satisfies RIP,^{2}^{2}2When , we require , [27]; when , we require there exists a universal constant such that [28, 29]. Both requirements can be met with a sample complexity of . then the solution to the nuclear norm minimization algorithms (9)–(11) (with appropriate values for the tuning parameters) satisfies the error bound
(13) 
simultaneously for all , where and are positive numerical constants.
For studying the performance of nuclear norm minimization via other notions such as the null space property, we refer interested readers to [30].
4.3 Guarantees for Matrix Completion via Incoherence
In the case of matrix completion, an additional complication arises: it is impossible to recover a lowrank matrix that is also sparse. In particular, when one only samples a small subset of the entries of , it is very likely that most, if not all, of the nonzero entries of are missed. This means that the sensing operator used for matrix completion cannot satisfy the RIP. Therefore, for the problem to be wellposed, we need to restrict attention to lowrank matrices whose mass does not concentrate on a few entries. This property can be formalized by the notion of incoherence, which measures the alignment between the column/row spaces of the lowrank matrix with the standard basis vectors:
Definition 2 (Incoherence)
For a matrix with orthonormal columns, let be the orthogonal projection onto the column space of . The incoherence parameter of is defined as
(14) 
For a matrix with the SVD , the incoherence parameter of is defined as
It is easy to see that the incoherence parameter satisfies the bound . With a smaller , the column space of is more spread out over its coordinates. For a matrix , its incoherence parameters
is determined by its the singular vectors and is independent of its singular values. In the noiseless setting, nuclear norm minimization can perfectly recover an incoherent lowrank matrix as soon as the number of measurements is slightly larger than the degrees of freedom of the matrix. Such recovery guarantees were proved and refined in a series of work in
[8, 10, 11, 31, 32, 33]. The theorem below is adapted from [31], which is stateoftheart.Theorem 2
By a couponcollecting argument [8], one can in fact show that it is impossible to recover the matrix with less than measurements using any algorithm. Therefore, Theorem 2 shows that nuclear norm minimization is nearoptimal in terms of sample complexity — off by only a logarithmic factor — a remarkable fact considering that we are using a convex relaxation of the rank.
In the noisy case, one can study the performance of nuclear norm minimization in terms of its recovery error, which is done for example in [9, 34]. Here we state one such performance guarantee taken from [9]. Let us assume that the entries of the noise are independent with variance that scaled as . Then, by solving the regularized nuclear norm minimization problem (10) with parameter , we obtain a solution satisfying
with high probability in the moderate to low SNR regime.
4.4 FirstOrder Algorithms for Nuclear Norm Minimization
In principle, it is possible to solve the nuclear norm minimization problems in (9)–(11) to high numerical accuracy using offtheshelf semidefinite programming solvers (such as SDPT3 [35]). However, these solvers, typically based on interiorpoint methods, can be extremely slow when the size of the matrix is large. For example, SDPT3 can only handle matrices with dimensions no larger than a few thousands due to memory requirements. This computational issue motivates the development of fast alternatives that can handle significantly larger problems. Firstorder algorithms become an appealing candidate due to their low periteration cost, as well as the flexibility to incorporate the specific structures of the semidefinite programs that arise in lowrank matrix estimation. There is a long and still growing list of such algorithms, including singular value thresholding [36], accelerated proximal gradient descent [37], which is variant of FISTA for matrix completion [38], Augmented Lagrangian Multiplier methods [39], FrankWolfe [40, 41], CoGENT [42], and ADCG [43], just to name a few. Below, we discuss two representative algorithms: FISTA for solving the regularized problem (10), and FrankWolfe for solving the constrained problem (11). These two algorithms provide the stage for understanding many other algorithms.
An important subroutine in many of the aforementioned algorithms is the Singular Value Thresholding (SVT) operator [36]. Mathematically, is defined as the proximal mapping of with respect to the nuclear norm:
(15) 
The SVT operator admits a closedform expression; in particular, if the SVD of is with , then , where with
(16) 
The fact that the SVT operator can be efficiently computed via SVD is leveraged in many of the firstorder algorithms.
The FISTA algorithm for the regularized problem (10) is given in Algorithm 1, where is an upper bound of the Lipschitz constant of , with . FISTA makes use of Nesterov’s momentum acceleration to speed up the convergence. If we denote the objective function of (10) as , then to achieve accuracy, i.e. , we need iterations. In each iteration, only a partial SVD is needed to evaluate the SVT operator. Doing so for largescale problems may still be too slow and require large memory. In this case, one can make use of modern randomized techniques from numerical linear algebra to further speed up the computation of SVD [44, 45].
The standard FrankWolfe method [40] (also known as conditional gradient descent) for solving the constrained problem (11) is presented in Algorithm 2. Each iteration of algorithm only requires computing a rankone SVD, which can be done using power iteration or Lanczos methods. Therefore, FrankWolfe typically has a much lower computational cost periteration than methods based on the SVT operation. However, standard FrankWolfe may converge very slowly in practice. To achieve accuracy, i.e. , FrankWolfe requires iterations, which can be quite slow. Variants of FrankWolfe with faster convergence or lower memory footprint have been actively developed recently by exploiting the problem structures. The list, including CoGENT [42], InFace Extended FrankWolf [41], and ADCG [43], Block FrankWolfe [46], and sketchyCGM [47], is still growing.
5 Provable and Fast LowRank Matrix Estimation via Nonconvex Factorization
As we have seen, the computational concern of solving rank minimization problems is assuaged to some extent by the use of convex relaxation — the resulting semidefinite programs can be solved in time polynomial in the matrix dimension. However, for largescale problems where the dimension is on the order of millions, solving these semidefinite programs, even using firstorder methods, can still be computationally infeasible due to the fundamental bottleneck of storing and optimizing over a matrix variable. This issue severely limits the applicability of the convex relaxation methods.
To overcome this difficulty, a recent line of work studies more computationally efficient methods that are based on nonconvex optimization. These methods work directly with the original nonconvex, rankconstrained optimization problem, which can be generally written as
(17) 
where
is a given loss function, which typically is convex in
. The key idea is to use a reparametrization trick: by writing a rank matrix in its factorization form , where and , we enforce the lowrank constraint directly, leading to the following equivalent formulation of (17):(18) 
We refer to this formulation as the BurerMonteiro factorization, after the seminal work [48]. The optimization problem (18) can then be solved over the factor variables and . The lowrank factorization is in general not unique; in fact, any pair and with being an orthonormal matrix also corresponds to the same matrix , since . These pairs are all global optima of the problem under certain conditions to be discussed below. This reformulation brings a significant computational gain: since the rank is often much smaller than , the size of the variables is roughly linear in rather than quadratic, leading to the possibility of designing lineartime algorithms that are amenable to problems of very large scale.
Surprisingly, even though the BurerMonteiro formulation (18) is nonconvex, global optima can sometimes be found (or approximated) efficiently using various iterative procedures; moreover, rigorous guarantees can be derived for the statistical accuracy of the resulting solution. Indeed, several iterative schemes have a computational cost proportional to and the size of the input, at least per iteration, which is typically much lower than . These results are developed in a still growing line of recent work [49]–[69], and we devote the rest of this section to presenting the most representative results therein.
This line of work considers three major classes of iterative schemes for solving the BurerMonteiro formulation (18):

(Projected) gradient descent [48, 49, 67]: One runs (projected) gradient descent directly on the loss function with respect to the factor variables :
(19a) (19b) where is the step size and , denote the Euclidean projection onto the sets and , which are constraint sets that encode additional structures of the desired lowrank factors.

Singular value projection (SVP) [62, 63, 64]: One performs a gradient descent step of on the “full” matrix space, then projects back to the factor space via SVD:
(21) where is the step size, and returns the top rank factors of , that is, the pair assuming that is SVD of the best rank approximation of .
Because neither the function nor the set of lowrank matrices are convex, standard global convergence theory for convex optimization does not apply here. The recent breakthrough is based on the realization that convexity is in fact not necessary for the convergence of these iterative schemes; instead, as long as the gradients of function always point (approximately) towards the desired solution, the iterates will make progress along the right direction. Note that this property concerns the geometry of itself, and is largely independent of the specific choice of the algorithm [50]. Among the above three options, the projected gradient descent approach stands out due to its simple form, cheap periteration cost (no SVD or inner optimization is needed) and efficiency with constrained problems. We thus use projected gradient descent as the focal point of our survey.
Playing a key role here is the use of statistical modeling and probabilistic analysis: we will show that has the desired geometric properties with high probability under probabilistic generative models of the data, thereby circumventing the worstcase hardness of the lowrank matrix estimation problem and instead focusing on its averagecase behavior. Existing results in this direction can be divided into two categories. In the first line of work reviewed in Section 5.1, one shows that iterative algorithms converge to the desired solution rapidly when initialized within a large neighborhood around the ground truth; moreover, a good initial solution can be obtained efficiently by simple procedures (which typically involve computing a partial SVD). The second line of work, reviewed in Section 5.2, concerns the global landscape of the loss function, and aims to show that in spite of the nonconvexity of , all of its local minima are in fact close to the desired solution in an appropriate sense whereas all other stationary points (e.g., saddle points) possess a descent direction; these properties guarantee the convergence of iterative algorithms from any initial solution. Both of these two types of results have their own merits and hence are complementary to each other. Below we review the most representative results in each of these two categories for noiseless matrix sensing and matrix completion. The readers are referred to [49, 67, 70] for extensions to the noisy case.
5.1 Convergence Guarantees with Proper Initialization
For simplicity, we assume that the truth is exactly rank and has a bounded condition number , thus effectively hiding the dependence on . Moreover, to measure the convergence of the algorithms for the purpose of reconstructing , we shall consider directly the reconstruction error with respect to .
For matrix sensing, we take the loss function as
(22) 
where the second regularization term encourages and to have the same scale and ensures algorithmic stability. We can perform gradient descent as specified in (19) with and (that is, without additional projections).
Similarly, for matrix completion, we use the loss function
(23) 
Since we can only hope to recover matrices that satisfy the incoherence property, we perform projected gradient descent by projecting to the constraint set:
with defined similarly. Note that is convex, and depends on the initial solution . The projection is given by the rowwise “clipping” operation
for , where is the th row of ; the projection is given by a similar formula. This projection ensures that the iterates of projected gradient descent (19) remain incoherent.
The following theorems guarantee that if the initial solution is reasonably close to the desired solution, then the iterates converge linearly to the ground truth (in terms of the reconstruction error), under conditions similar to nuclear norm minimization. Moreover, we can find a provably good initial solution using the socalled spectral method, which involves performing a partial SVD. In particular, we compute for matrix sensing, and for matrix completion.
Theorem 3 (Matrix Sensing)
[49, 62, 51] Suppose that the sensing operator satisfies RIP with parameter for some sufficiently small constant . Then, the initial solution satisfies
(24) 
where is some sufficiently small constant. Furthermore, starting from any that satisfies (24), the gradient descent iterates with an appropriate step size satisfy the bound
where is a universal constant.
Theorem 4 (Matrix Completion)
[71] There exists a positive constant such that if
(25) 
then with probability at least for some positive constant , the initial solution satisfies
(26) 
Furthermore, starting from any that satisfies (26), the projected gradient descent iterates with an appropriate step size satisfy the bound
where is a universal constant.
The above theorems guarantees that the gradient descent iterates enjoy geometric convergence to a global optimum when the initial solution is sufficiently close to the ground truth. Comparing with the guarantees for nuclear norm minimization, (projected) gradient descent succeeds under a similar sample complexity condition (up to a polynomial term in and ), but the computational cost is significantly lower. To obtain accuracy, meaning that the final estimate satisfies
(27) 
we only need to run a total of iterations for matrix sensing, and iterations for matrix completion.
We now discuss the overall computational complexity, and for simplicity we shall assume . For matrix sensing, let be the maximum time of multiplying the matrix with a vector of compatible dimension. Each gradient step (19) can be performed in time . The complexity of computing the initial solution is a bit subtler. To this end, we first note that by standard matrix perturbation bounds, one can show that the matrix and its singular values are sufficiently close to those of under the condition of Theorem 3; in particular, one has , and the th and th singular values of are at most away from the corresponding singular values of , where is a small constant. With such properties of , one does not need to compute the exact singular values/vectors of in order to meet the initialization condition (24); rather, it suffices to find a rank approximation of with the property . This can be done using for example the Randomized SVD procedure in [45], which takes time to compute; see [45, Theorem 1.2 and Eq. (1.11)] with . Put together, the overall time complexity is for achieving accuracy.
For matrix completion, to obtain the initial solution, we can again follow similar arguments as above to show that it suffices to compute a rank approximation of the matrix , which is close to and has a sufficiently large spectral gap. Since is a sparse matrix with support , computing such an approximation can be done in time using the Randomize SVD procedure in [45]. Each step of gradient descent requires computing the gradient and the projection onto and . Both of them only involve operations on sparse matrices supported on and thin matrices, and can be done in time . Therefore, projected gradient descent achieves accuracy with running time .
Remark 1
Via a refined analysis of gradient descent, it is in fact possible to drop the projection step onto and in matrix completion without performance loss; cf. [67]. In particular, as long as for some constant , gradient descent converges geometrically, and needs iterations to reach accuracy for the reconstruction error measured not only in the Frobenius norm , but also in the spectral norm and the entrywise infinity norm . For the Singular Value Projection algorithm (21), geometric convergence in entrywise infinity norm is also established in the work [72] without the need of additional regularization or separate initialization procedure.
Remark 2
In the noisy setting, the algorithms can be applied without change, and the same error bounds hold with an additional term that depends on the noise. For matrix completion [49], this term is , where is the noise matrix supported on the observed indices. This term can be bounded under various noise models. For example, when has i.i.d. Gaussian or entries with zero mean and variance , then with high probability. The resulting error bound is optimal in an informationtheoretic sense [9]. See also [67] for the nearoptimal error control in the spectral norm and the entrywise infinity norm.
5.2 Global Geometry and SaddlePoint Escaping Algorithms
A very recent line of work studies the global geometry of the BurerMonteiro formulation (18), as well as its computational implications for algorithms starting at an arbitrary initial solution [54, 55, 69, 57]. These results are based on a geometric notion called the strict saddle property [54, 70].
Definition 3 (Strict Saddle)
A function is said to be strict saddle, if for each at least one of the following holds:

;

;

there exists a local minimum such that .
In Figure 2 and Figure 3 we provide examples of a onedimensional function and a twodimensional function, respectively, that satisfy the strict saddle property in Definition 3. Intuitively, the strict saddle property of ensures that whenever one is not already close to a local minimum, the current solution will have either a large gradient, or a descent direction due to the Hessian having a negative eigenvalue. Therefore, any local search algorithms that are capable of finding such a descent direction, will make progress in decreasing the value of and eventually converge to a local minimum. Many algorithms have been shown to enjoy this property, include cubic regularization [73], trustregion algorithms [74]
[75], and other more recent variants [76, 77, 78]. In Algorithm 3, we describe one such algorithm, namely the Perturbed Gradient Descent (PGD) algorithm from [76]. PGD is based on the standard gradient descent algorithm with the following additional steps: (i) when the gradient is small, indicating potential closeness to a saddle point, PGD adds a random perturbation to the current iterate (which is done at most once every iterations); (ii) if the last perturbation occurs iterations ago and the function value does not decrease sufficiently since, then PGD terminates and outputs the iterate before the last perturbation.We do not delve further into the details of these saddleescaping algorithms, as their parameter choices and runtime guarantees are somewhat technical. Rather, for the purpose of analyzing lowrank matrix estimation problems, we simply rely on the existence of such algorithms and the fact that their running time depends polynomially on the problem parameters. This is summarized in the following theorem, which abstracts out the key results in the work cited above.
Theorem 5 (Optimizing strict saddle functions)
Assume that is smooth and strict saddle. There exist algorithms (such as PGD in Algorithm 3 with appropriate choices of the parameters) that output a solution that is close to a local minimum of , with the required number of iterations upper bounded by a polynomial function of and .
Specializing to the lowrank matrix estimation problem, it remains to verify that (i) the loss function defined on is strict saddle and (ii) all its local minima are “good”, in the sense that they correspond to a lowrank matrix equal to (in the noisy case, close to) the true matrix . To this end, we consider the same loss function as before for matrix sensing:
where is defined in (22). For matrix completion, we consider a regularized loss function:
where is given in (23), the regularizer is given by [50]
and are regularization parameters, and . The regularization plays a similar role as the projections previously used: it encourages incoherence of and . Replacing projections with regularization leads to an unconstrained formulation that fits into the strictsaddle framework above. The follow theorems show that these loss functions indeed have the desired strictsaddle property with high probability under sample complexity conditions similar to before.
Comments
There are no comments yet.