# Provable Efficient Online Matrix Completion via Non-convex Stochastic Gradient Descent

Matrix completion, where we wish to recover a low rank matrix by observing a few entries from it, is a widely studied problem in both theory and practice with wide applications. Most of the provable algorithms so far on this problem have been restricted to the offline setting where they provide an estimate of the unknown matrix using all observations simultaneously. However, in many applications, the online version, where we observe one entry at a time and dynamically update our estimate, is more appealing. While existing algorithms are efficient for the offline setting, they could be highly inefficient for the online setting. In this paper, we propose the first provable, efficient online algorithm for matrix completion. Our algorithm starts from an initial estimate of the matrix and then performs non-convex stochastic gradient descent (SGD). After every observation, it performs a fast update involving only one row of two tall matrices, giving near linear total runtime. Our algorithm can be naturally used in the offline setting as well, where it gives competitive sample complexity and runtime to state of the art algorithms. Our proofs introduce a general framework to show that SGD updates tend to stay away from saddle surfaces and could be of broader interests for other non-convex problems to prove tight rates.

• 49 publications
• 60 publications
• 49 publications
11/05/2014

### Global Convergence of Stochastic Gradient Descent for Some Non-convex Matrix Problems

Stochastic gradient descent (SGD) on a low-rank factorization is commonl...
11/04/2014

### Fast Exact Matrix Completion with Finite Samples

Matrix completion is the problem of recovering a low rank matrix by obse...
04/07/2022

### Structured Gradient Descent for Fast Robust Low-Rank Hankel Matrix Completion

We study the robust matrix completion problem for the low-rank Hankel ma...
03/28/2018

### Non-Convex Matrix Completion Against a Semi-Random Adversary

Matrix completion is a well-studied problem with many machine learning a...
05/25/2016

### Fast Algorithms for Robust PCA via Gradient Descent

We consider the problem of Robust PCA in the fully and partially observe...
08/01/2018

### Matrix completion and extrapolation via kernel regression

Matrix completion and extrapolation (MCEX) are dealt with here over repr...
08/24/2022

### Accelerating SGD for Highly Ill-Conditioned Huge-Scale Online Matrix Completion

The matrix completion problem seeks to recover a d× d ground truth matri...

## 1 Introduction

Low rank matrix completion refers to the problem of recovering a low rank matrix by observing the values of only a tiny fraction of its entries. This problem arises in several applications such as video denoising [14], phase retrieval [3] and most famously in movie recommendation engines [16]. In the context of recommendation engines for instance, the matrix we wish to recover would be user-item rating matrix where each row corresponds to a user and each column corresponds to an item. Each entry of the matrix is the rating given by a user to an item. Low rank assumption on the matrix is inspired by the intuition that rating of an item by a user depends on only a few hidden factors, which are much fewer than the number of users or items. The goal is to estimate the ratings of all items by users given only partial ratings of items by users, which would then be helpful in recommending new items to users.

The seminal works of Candès and Recht [4] first identified regularity conditions under which low rank matrix completion can be solved in polynomial time using convex relaxation – low rank matrix completion could be ill-posed and NP-hard in general without such regularity assumptions [10]. Since then, a number of works have studied various algorithms under different settings for matrix completion: weighted and noisy matrix completion, fast convex solvers, fast iterative non-convex solvers, parallel and distributed algorithms and so on.

Most of this work however deals only with the offline setting where all the observed entries are revealed at once and the recovery procedure does computation using all these observations simultaneously. However in several applications [5, 19], we encounter the online setting where observations are only revealed sequentially and at each step the recovery algorithm is required to maintain an estimate of the low rank matrix based on the observations so far. Consider for instance recommendation engines, where the low rank matrix we are interested in is the user-item rating matrix. While we make an observation only when a user rates an item, at any point of time, we should have an estimate of the user-item rating matrix based on all prior observations so as to be able to continuously recommend items to users. Moreover, this estimate should get better as we observe more ratings.

Algorithms for offline matrix completion can be used to solve the online version by rerunning the algorithm after every additional observation. However, performing so much computation for every observation seems wasteful and is also impractical. For instance, using alternating minimization, which is among the fastest known algorithms for the offline problem, would mean that we take several passes of the entire data for every additional observation. This is simply not feasible in most settings. Another natural approach is to group observations into batches and do an update only once for each batch. This however induces a lag between observations and estimates which is undesirable. To the best of our knowledge, there is no known provable, efficient, online algorithm for matrix completion.

On the other hand, in order to deal with the online matrix completion scenario in practical applications, several heuristics (with no convergence guarantees) have been proposed in literature

[2, 20]. Most of these approaches are based on starting with an estimate of the matrix and doing fast updates of this estimate whenever a new observation is presented. One of the update procedures used in this context is that of stochastic gradient descent (SGD) applied to the following non-convex optimization problem

 minU,V∥M−UV⊤∥2Fs.t.U∈Rd1×k,V∈Rd2×k, (1)

where is the unknown matrix of size , is the rank of and is a low rank factorization of we wish to obtain. The algorithm starts with some and , and given a new observation , SGD updates the -row and the -row of the current iterates and respectively by

 U(i)t+1 V(j)t+1 (2)

where is an appropriately chosen stepsize, and denote the row of matrix . Note that each update modifies only one row of the factor matrices and , and the computation only involves one row of and the new observed entry and hence are extremely fast. These fast updates make SGD extremely appealing in practice. Moreover, SGD, in the context of matrix completion, is also useful for parallelization and distributed implementation [24].

### 1.1 Our Contributions

In this work we present the first provable efficient algorithm for online matrix completion by showing that SGD (2) with a good initialization converges to a true factorization of at a geometric rate. Our main contributions are as follows.

• We provide the first provable, efficient, online algorithm for matrix completion. Starting with a good initialization, after each observation, the algorithm makes quick updates each taking time and requires observations to reach accuracy, where is the incoherence parameter, , is the rank and is the condition number of .

• Moreover, our result features both sample complexity and total runtime linear in , and is competitive to even the best existing offline results for matrix completion. (either improve over or is incomparable, i.e., better in some parameters and worse in others, to these results). See Table 1 for the comparison.

• To obtain our results, we introduce a general framework to show SGD updates tend to stay away from saddle surfaces. In order to do so, we consider distances from saddle surfaces, show that they behave like sub-martingales under SGD updates and use martingale convergence techniques to conclude that the iterates stay away from saddle surfaces. While [25] shows that SGD updates stay away from saddle surfaces, the stepsizes they can handle are quite small (scaling as ), leading to suboptimal computational complexity. Our framework makes it possible to establish the same statement for much larger step sizes, giving us near-optimal runtime. We believe these techniques may be applicable in other non-convex settings as well.

### 1.2 Related Work

In this section we will mention some more related work.

Offline matrix completion: There has been a lot of work on designing offline algorithms for matrix completion, we provide the detailed comparison with our algorithm in Table 1. The nuclear norm relaxation algorithm [23] has near-optimal sample complexity for this problem but is computationally expensive. Motivated by the empirical success of non-convex heuristics, a long line of works, [15, 9, 13, 25] and so on, has obtained convergence guarantees for alternating minimization, gradient descent, projected gradient descent etc. Even the best of these are suboptimal in sample complexity by factors. Our sample complexity is better than that of [15] and is incomparable to those of [9, 13]. To the best of our knowledge, the only provable online algorithm for this problem is that of Sun and Luo [25]. However the stepsizes they suggest are quite small, leading to suboptimal computational complexity by factors of . The runtime of our algorithm is linear in , which makes improvements over it.

Other models for online matrix completion: Another variant of online matrix completion studied in the literature is where observations are made on a column by column basis e.g., [17, 27]. These models can give improved offline performance in terms of space and could potentially work under relaxed regularity conditions. However, they do not tackle the version where only entries (as opposed to columns) are observed.

Non-convex optimization

: Over the last few years, there has also been a significant amount of work in designing other efficient algorithms for solving non-convex problems. Examples include eigenvector computation

[6, 12], sparse coding [21, 1] etc. For general non-convex optimization, an interesting line of recent work is that of [7], which proves gradient descent with noise can also escape saddle point, but they only provide polynomial rate without explicit dependence. Later [18, 22] show that without noise, the space of points from where gradient descent converges to a saddle point is a measure zero set. However, they do not provide a rate of convergence. Another related piece of work to ours is [11], proves global convergence along with rates of convergence, for the special case of computing matrix squareroot. During the preparation of this draft, the recent work [8] was announced which proves the global convergence of SGD for matrix completion and can also be applied to the online setting. However, their result only deals with the case where is positive semidefinite (PSD) and their rate is still suboptimal by factors of .

### 1.3 Outline

The rest of the paper is organized as follows. In Section 2 we formally describe the problem and all relevant parameters. In Section 3, we present our algorithms, results and some of the key intuition behind our results. In Section 4 we give proof outline for our main results. We conclude in Section 5. All formal proofs are deferred to the Appendix.

## 2 Preliminaries

In this section, we introduce our notation, formally define the matrix completion problem and regularity assumptions that make the problem tractable.

### 2.1 Notation

We use to denote . We use bold capital letters to denote matrices and bold lowercase letters

to denote vectors.

means the entry of matrix . denotes the -norm of vector and // denotes the spectral/Frobenius/infinity norm of matrix . denotes the

largest singular value of

and denotes the smallest singular value of . We also let denote the condition number of (i.e., the ratio of largest to smallest singular value). Finally, for orthonormal bases of a subspace , we also use to denote the projection to the subspace spanned by .

### 2.2 Problem statement and assumptions

Consider a general rank matrix . Let be a subset of coordinates, which are sampled uniformly and independently from . We denote to be the projection of on set so that:

 [PΩ(M)]ij={Mij,if~{}(i,j)∈Ω0,if~{}(i,j)∉Ω

Low rank matrix completion is the task of recovering by only observing . This task is ill-posed and NP-hard in general [10]. In order to make this tractable, we make by now standard assumptions about the structure of .

###### Definition 2.1.

Let be an orthonormal basis of a subspace of of dimension . The coherence of is defined to be

 μ(W)def=dkmax1≤i≤d∥PWei∥2=dkmax1≤i≤d∥∥e⊤iW∥∥2
###### Assumption 2.2 (μ-incoherence[4, 23]).

We assume is -incoherent, i.e., , where are the left and right singular vectors of .

## 3 Main Results

In this section, we present our main result. We will first state result for a special case where is a symmetric positive semi-definite (PSD) matrix, where the algorithm and analysis are much simpler. We will then discuss the general case.

### 3.1 Symmetric PSD Case

Consider the special case where is symmetric PSD. We let , and we can parametrize a rank symmetric PSD matrix by where . Our algorithm for this case is given in Algorithm 1. The following theorem provides guarantees on the performance of Algorithm 1. The algorithm starts by using an initial set of samples to construct a crude approximation to the low rank of factorization of . It then observes samples from one at a time and updates its factorization after every observation. Note that each update step modifies two rows of and hence takes time .

###### Theorem 3.1.

Let be a rank , symmetric PSD matrix with -incoherence. There exist some absolute constants and such that if , learning rate , then for any fixed

, with probability at least

, we will have for all that:

 ∥UtU⊤t−M∥2F≤(1−12η⋅σmin(M))t(110σmin(M))2.

Remarks:

• The algorithm uses an initial set of observations to produce a warm start iterate , then enters the online stage, where it performs SGD.

• The sample complexity of the warm start phase is . The initialization consists of a top- SVD on a sparse matrix, whose runtime is .

• For the online phase (SGD), if we choose , the number of observations required for the error to be smaller than is .

• Since each SGD step modifies two rows of , its runtime is with a total runtime for online phase of .

Our proof approach is to essentially show that the objective function is well-behaved (i.e., is smooth and strongly convex) in a local neighborhood of the warm start region, and then use standard techniques to show that SGD obtains geometric convergence in this setting. The most challenging and novel part of our analysis comprises of showing that the iterate does not leave this local neighborhood while performing SGD updates. Refer Section 4 for more details on the proof outline.

### 3.2 General Case

Let us now consider the general case where can be factorized as with and . In this scenario, we denote . We recall our remarks from the previous section that our analysis of the performance of SGD depends on the smoothness and strong convexity properties of the objective function in a local neighborhood of the iterates. Having introduces additional challenges in this approach since for any nonsingular -by- matrix , and , we have . Suppose for instance is a very small scalar times the identity i.e., for some small . In this case, will be large while will be small. This drastically deteriorates the smoothness and strong convexity properties of the objective function in a neighborhood of .

To preclude such a scenario, we would ideally like to renormalize after each step by doing , where is the SVD of matrix . This algorithm is described in Algorithm 2. However, a naive implementation of Algorithm 2, especially the SVD step, would incur computation per iteration, resulting in a runtime overhead of over both the online PSD case (i.e., Algorithm 1) as well as the near linear time offline algorithms (see Table 1). It turns out that we can take advantage of the fact that in each iteration we only update a single row of and a single row of , and do efficient (but more complicated) update steps instead of doing an SVD on matrix. The resulting algorithm is given in Algorithm 3. The key idea is that in order to implement the updates, it suffices to do an SVD of and which are matrices. So the runtime of each iteration is at most . The following lemma shows the equivalence between Algorithms 2 and 3.

###### Lemma 3.2.

Algorithm 2 and Algorithm 3 are equivalent in the sense that: given same observations from and other inputs, the outputs of Algorithm 2, and those of Algorithm 3, satisfy .

Since the output of both algorithms is the same, we can analyze Algorithm 2 (which is easier than that of Algorithm 3), while implementing Algorithm 3 in practice. The following theorem is the main result of our paper which presents guarantees on the performance of Algorithm 2.

###### Theorem 3.3.

Let be a rank matrix with -incoherence and let . There exist some absolute constants and such that if , learning rate , then for any fixed , with probability at least , we will have for all that:

 ∥UtV⊤t−M∥2F≤(1−12η⋅σmin(M))t(110σmin(M))2.

Remarks:

• Just as in the case of PSD matrix completion (Theorem 3.1), Algorithm 2 needs a an initial set of observations to provide a warm start and after which it performs SGD.

• The sample complexity and runtime of the warm start phase are the same as in symmetric PSD case. The stepsize and the number of observations to achieve error in online phase (SGD) are also the same as in symmetric PSD case.

• However, runtime of each update step in online phase is with total runtime for online phase .

The proof of this theorem again follows a similar line of reasoning as that of Theorem 3.1 by first showing that the local neighborhood of warm start iterate has good smoothness and strong convexity properties and then use them to show geometric convergence of SGD. Proof of the fact that iterates do not move away from this local neighborhood however is significantly more challenging due to renormalization steps in the algorithm. Please see Appendix C for the full proof.

## 4 Proof Sketch

In this section we will provide the intuition and proof sketch for our main results. For simplicity and highlighting the most essential ideas, we will mostly focus on the symmetric PSD case (Theorem 3.1). For the asymmetric case, though the high-level ideas are still valid, a lot of additional effort is required to address the renormalization step in Algorithm 2. This makes the proof more involved.

First, note that our algorithm for the PSD case consists of an initialization and then stochastic descent steps. The following lemma provides guarantees on the error achieved by the initial iterate .

###### Lemma 4.1.

Let be a rank- PSD matrix with -incoherence. There exists a constant such that if , then with probability at least , the top- SVD of satisfies Then there exists universal constant , for any , we have:

 ∥M−U0U⊤0∥F≤120σmin(M)~{}and~{}maxj∥∥e⊤jU0∥∥2≤10μkκ(M)d∥M∥ (3)

By Lemma 4.1, we know the initialization algorithm already gives in the local region given by Eq.(3). Intuitively, stochastic descent steps should keep doing local search within this local region.

To establish linear convergence on and obtain final result, we first establish several important lemmas describing the properties of this local regions. Throughout this section, we always denote , where , and diagnal matrix . We postpone all the formal proofs in Appendix.

###### Lemma 4.2.

For function and any , we have:

 ∥∇f(U1)−∇f(U2)∥F≤16max{Γ2,∥M∥}⋅∥U1−U2∥F
###### Lemma 4.3.

For function and any , we have:

 ∥∇f(U)∥2F≥4γ2f(U)

Lemma 4.2 tells function is smooth if spectral norm of is not very large. On the other hand, not too small requires both and are not too small, where

is top-k eigenspace of

. That is, Lemma 4.3 tells function has a property similar to strongly convex in standard optimization literature, if is rank k in a robust sense ( is not too small), and the angle between the top k eigenspace of and the top k eigenspace is not large.

###### Lemma 4.4.

Within the region , we have:

 ∥U∥≤√2∥M∥,σmin(X⊤U)≥√σk(M)/2

Lemma 4.4 tells inside region , matrix always has a good spectral property which gives preconditions for both Lemma 4.2 and 4.3, where is both smooth and has a property very similar to strongly convex.

With above three lemmas, we already been able to see the intuition behind linear convergence in Theorem 3.1. Denote stochastic gradient

 SG(U)=2d2(UU⊤−M)ij(eie⊤j+eje⊤i)U (4)

where

is a random matrix depends on the randomness of sample

of matrix . Then, the stochastic update step in Algorithm 1 can be rewritten as:

 Ut+1←Ut−ηSG(Ut)

Let , By easy caculation, we know , that is is unbiased. Combine Lemma 4.4 with Lemma 4.2 and Lemma 4.3, we know within region specified by Lemma 4.4, we have function is -smooth, and .

Let’s suppose ideally, we always have inside region , this directly gives:

 Ef(Ut+1) =Ef(Ut)−ηE∥∇f(Ut)∥2F+16η2∥M∥⋅E∥SG(Ut)∥2F ≤(1−2ησmin(M))Ef(Ut)+16η2∥M∥⋅E∥SG(Ut)∥2F

One interesting aspect of our main result is that we actually show linear convergence under the presence of noise in gradient. This is true because for the second-order () term above, we can roughly see from Eq.(4) that , where is a factor depends on and always bounded. That is, enjoys self-bounded property — will goes to zero, as objective function goes to zero. Therefore, by choosing learning rate appropriately small, we can have the first-order term always dominate the second-order term, which establish the linear convergence.

Now, the only remaining issue is to prove that “ always stay inside local region ”. In reality, we can only prove this statement with high probability due to the stochastic nature of the update. This is also the most challenging part in our proof, which makes our analysis different from standard convex analysis, and uniquely required due to non-convex setting.

Our key theorem is presented as follows:

###### Theorem 4.5.

Let and . Suppose initial satisfying:

 f(U0)≤(σmin(M)20)2,maxigi(U0)≤10μkκ(M)2d∥M∥

Then, there exist some absolute constant such that for any learning rate , with at least probability, we will have for all that:

 f(Ut)≤(1−12ησmin(M))t(σmin(M)10)2,maxigi(Ut)≤20μkκ(M)2d∥M∥ (5)

Note function indicates the incoherence of matrix . Theorem 4.5 guarantees if inital is in the local region which is incoherent and is close to , then with high probability for all steps , , will always stay in a slightly relaxed local region, and has linear convergence.

It is not hard to show that all saddle point of satisfies , and all local minima are global minima. Since automatically stay in region with high probability, we know also stay away from all saddle points. The claim that

stays incoherent is essential to better control the variance and probability 1 bound of

, so that we can have large step size and tight convergence rate.

The major challenging in proving Theorem 4.5 is to both prove stays in the local region, and achieve good sample complexity and running time (linear in ) in the same time. This also requires the learning rate in Algorithm 1 to be relatively large. Let the event denote the good event where satisfies Eq.(5). Theorem 4.5 is claiming that is large. The essential steps in the proof is contructing two supermartingles related to and (where denote indicator function), and use Bernstein inequalty to show the concentration of supermartingales. The term allow us the claim all previous have all desired properties inside local region.

Finally, we see Theorem 3.1 as a immediate corollary of Theorem 4.5.

## 5 Conclusion

In this paper, we presented the first provable, efficient online algorithm for matrix completion, based on nonconvex SGD. In addition to the online setting, our results are also competitive with state of the art results in the offline setting. We obtain our results by introducing a general framework that helps us show how SGD updates self-regulate to stay away from saddle points. We hope our paper and results help generate interest in online matrix completion, and our techniques and framework prompt tighter analysis for other nonconvex problems.

## Appendix A Proof of Initialization

In this section, we will prove Lemma 4.1 and a corresponding lemma for asymmetric case as follows (which will be used to prove Theorem 3.3):

###### Lemma A.1.

Assume is a rank matrix with -incoherence, and is a subset unformly i.i.d sampled from all coordinate. Let be the top- SVD of , where . Let . Then there exists universal constant , for any , with probability at least , we have:

 ∥M−U0V⊤0∥F≤120σmin(M), maxi∥∥e⊤iU0V⊤0∥∥2≤10μkd1∥M∥,maxj∥∥e⊤jV0U⊤0∥∥2≤10μkd2∥M∥ (6)

We will focus mostly on Lemma A.1, and prove Lemma 4.1 as a special case. Most of the argument of this section follows from [15]. We include here for completeness. The remaining of this section can be viewed as proving both the Frobenius norm claim and incoherence claim of Lemma A.1 seperately.

In this section, We always denote . For simplicity, WLOG, we also assume in all proof. Also, when it’s clear from the context, we use to specifically to represent . Then . Also in the proof, we always denote , and , where and are diagonal matrix.

### a.1 Frobenius Norm of Initialization

###### Theorem A.2 (Matrix Bernstein [26]).

A finite sequence of independent, random matrices with dimension . Assume that each matrix satisfies:

 EXt=0,and∥Xt∥≤R~{} almost surely

Define

 σ2=max{∥∥ ∥∥∑tE(XtX⊤t)∥∥ ∥∥,∥∥ ∥∥∑tE(X⊤tXt)∥∥ ∥∥}

Then, for all ,

 Pr(∥∥ ∥∥∑tXt∥∥ ∥∥≥s)≤(d1+d2)⋅exp(−s2/2σ2+Rs/3)
###### Lemma A.3.

Let , then there exists universal constant , for any , with probability at least , we have:

###### Proof.

We know

and note:

 PΩ(M)−md1d2M=∑ijMij(Zij−md1d2)eie⊤j

where are independence Bernoullirandom variables. Let matrix

 ψij=Mij(Zij−md1d2)eie⊤j

By construction, we have:

 ∥∥ ∥∥∑ijψij∥∥ ∥∥=∥∥∥PΩ(M)−md1d2M∥∥∥

Clearly . Let , then by -incoherence of , with probability 1:

Also:

 ∥∥ ∥∥∑ijE(ψijψ⊤ij)∥∥ ∥∥= ∥∥ ∥∥∑ijEM2ij(Zij−md1d2)2eie⊤i∥∥ ∥∥≤md1d2(1−md1d2)∥∥ ∥∥∑ijM2ijeie⊤i∥∥ ∥∥ = ∥∥ ∥∥∑ijE(ψ⊤ijψij)∥∥ ∥∥= ∥∥ ∥∥∑ijEM2ij(Zij−md1d2)2eje⊤j∥∥ ∥∥≤md1d2(1−md1d2)∥∥ ∥∥∑ijM2ijeje⊤j∥∥ ∥∥ =

Then, by matrix Bernstein (Theorem A.2), we have:

 Pr(∥∥ ∥∥∑ijψij∥∥ ∥∥≥s)≤2(d1+d2)⋅exp(−s2/22mμdkd21d22∥M∥2+∥M∥μk3√d1d2s)

That is, with probability at least , for some universal constant , we have:

For , we finishes the proof. ∎

###### Theorem A.4.

Let be the top- SVD of , where then there exists universal constant , for any , with probability at least , we have:

 ∥∥M−U0V⊤0∥∥F≤120κ
###### Proof.

Since is a rank matrix, we know , thus

 σk+1(d1d2mPΩ(M))≤σk+1(M)+∥∥∥d1d2mPΩ(M)−M∥∥∥=∥∥∥d1d2mPΩ(M)−M∥∥∥

Therefore:

 ∥∥M−U0V⊤0∥∥≤ ∥∥∥M−d1d2mPΩ(M)∥∥∥+∥∥∥d1d2mPΩ(M)−U0V⊤0∥∥∥ ≤ ∥∥∥M−d1d2mPΩ(M)∥∥∥+σk+1(d1d2mPΩ(M))≤2∥∥∥M−d1d2mPΩ(M)∥∥∥

Meanwhile, since , , we know: , and therefore:

 ∥∥M−U0V⊤0∥∥F≤√2k∥∥M−U0V⊤0∥∥≤2√2k∥∥∥M−d1d2mPΩ(M)∥∥∥

by choosing for large enough constant and apply Lemma A.3, we finishes the proof. ∎

### a.2 Incoherence of Initialization

###### Lemma A.5.

Let be the top- SVD of , where . then there exists universal constant , for any , with probability at least , we have:

 maxj∥∥e⊤j(M⊤−VU⊤)∥∥≤2√μkd2
###### Proof.

Suppose . Denote and