1 Introduction
Low rank matrix completion refers to the problem of recovering a low rank matrix by observing the values of only a tiny fraction of its entries. This problem arises in several applications such as video denoising [14], phase retrieval [3] and most famously in movie recommendation engines [16]. In the context of recommendation engines for instance, the matrix we wish to recover would be useritem rating matrix where each row corresponds to a user and each column corresponds to an item. Each entry of the matrix is the rating given by a user to an item. Low rank assumption on the matrix is inspired by the intuition that rating of an item by a user depends on only a few hidden factors, which are much fewer than the number of users or items. The goal is to estimate the ratings of all items by users given only partial ratings of items by users, which would then be helpful in recommending new items to users.
The seminal works of Candès and Recht [4] first identified regularity conditions under which low rank matrix completion can be solved in polynomial time using convex relaxation – low rank matrix completion could be illposed and NPhard in general without such regularity assumptions [10]. Since then, a number of works have studied various algorithms under different settings for matrix completion: weighted and noisy matrix completion, fast convex solvers, fast iterative nonconvex solvers, parallel and distributed algorithms and so on.
Most of this work however deals only with the offline setting where all the observed entries are revealed at once and the recovery procedure does computation using all these observations simultaneously. However in several applications [5, 19], we encounter the online setting where observations are only revealed sequentially and at each step the recovery algorithm is required to maintain an estimate of the low rank matrix based on the observations so far. Consider for instance recommendation engines, where the low rank matrix we are interested in is the useritem rating matrix. While we make an observation only when a user rates an item, at any point of time, we should have an estimate of the useritem rating matrix based on all prior observations so as to be able to continuously recommend items to users. Moreover, this estimate should get better as we observe more ratings.
Algorithms for offline matrix completion can be used to solve the online version by rerunning the algorithm after every additional observation. However, performing so much computation for every observation seems wasteful and is also impractical. For instance, using alternating minimization, which is among the fastest known algorithms for the offline problem, would mean that we take several passes of the entire data for every additional observation. This is simply not feasible in most settings. Another natural approach is to group observations into batches and do an update only once for each batch. This however induces a lag between observations and estimates which is undesirable. To the best of our knowledge, there is no known provable, efficient, online algorithm for matrix completion.
On the other hand, in order to deal with the online matrix completion scenario in practical applications, several heuristics (with no convergence guarantees) have been proposed in literature
[2, 20]. Most of these approaches are based on starting with an estimate of the matrix and doing fast updates of this estimate whenever a new observation is presented. One of the update procedures used in this context is that of stochastic gradient descent (SGD) applied to the following nonconvex optimization problem(1) 
where is the unknown matrix of size , is the rank of and is a low rank factorization of we wish to obtain. The algorithm starts with some and , and given a new observation , SGD updates the row and the row of the current iterates and respectively by
(2) 
where is an appropriately chosen stepsize, and denote the row of matrix . Note that each update modifies only one row of the factor matrices and , and the computation only involves one row of and the new observed entry and hence are extremely fast. These fast updates make SGD extremely appealing in practice. Moreover, SGD, in the context of matrix completion, is also useful for parallelization and distributed implementation [24].
1.1 Our Contributions
In this work we present the first provable efficient algorithm for online matrix completion by showing that SGD (2) with a good initialization converges to a true factorization of at a geometric rate. Our main contributions are as follows.

We provide the first provable, efficient, online algorithm for matrix completion. Starting with a good initialization, after each observation, the algorithm makes quick updates each taking time and requires observations to reach accuracy, where is the incoherence parameter, , is the rank and is the condition number of .

Moreover, our result features both sample complexity and total runtime linear in , and is competitive to even the best existing offline results for matrix completion. (either improve over or is incomparable, i.e., better in some parameters and worse in others, to these results). See Table 1 for the comparison.

To obtain our results, we introduce a general framework to show SGD updates tend to stay away from saddle surfaces. In order to do so, we consider distances from saddle surfaces, show that they behave like submartingales under SGD updates and use martingale convergence techniques to conclude that the iterates stay away from saddle surfaces. While [25] shows that SGD updates stay away from saddle surfaces, the stepsizes they can handle are quite small (scaling as ), leading to suboptimal computational complexity. Our framework makes it possible to establish the same statement for much larger step sizes, giving us nearoptimal runtime. We believe these techniques may be applicable in other nonconvex settings as well.
1.2 Related Work
In this section we will mention some more related work.
Offline matrix completion: There has been a lot of work on designing offline algorithms for matrix completion, we provide the detailed comparison with our algorithm in Table 1. The nuclear norm relaxation algorithm [23] has nearoptimal sample complexity for this problem but is computationally expensive. Motivated by the empirical success of nonconvex heuristics, a long line of works, [15, 9, 13, 25] and so on, has obtained convergence guarantees for alternating minimization, gradient descent, projected gradient descent etc. Even the best of these are suboptimal in sample complexity by factors. Our sample complexity is better than that of [15] and is incomparable to those of [9, 13]. To the best of our knowledge, the only provable online algorithm for this problem is that of Sun and Luo [25]. However the stepsizes they suggest are quite small, leading to suboptimal computational complexity by factors of . The runtime of our algorithm is linear in , which makes improvements over it.
Other models for online matrix completion: Another variant of online matrix completion studied in the literature is where observations are made on a column by column basis e.g., [17, 27]. These models can give improved offline performance in terms of space and could potentially work under relaxed regularity conditions. However, they do not tackle the version where only entries (as opposed to columns) are observed.
Nonconvex optimization
: Over the last few years, there has also been a significant amount of work in designing other efficient algorithms for solving nonconvex problems. Examples include eigenvector computation
[6, 12], sparse coding [21, 1] etc. For general nonconvex optimization, an interesting line of recent work is that of [7], which proves gradient descent with noise can also escape saddle point, but they only provide polynomial rate without explicit dependence. Later [18, 22] show that without noise, the space of points from where gradient descent converges to a saddle point is a measure zero set. However, they do not provide a rate of convergence. Another related piece of work to ours is [11], proves global convergence along with rates of convergence, for the special case of computing matrix squareroot. During the preparation of this draft, the recent work [8] was announced which proves the global convergence of SGD for matrix completion and can also be applied to the online setting. However, their result only deals with the case where is positive semidefinite (PSD) and their rate is still suboptimal by factors of .Algorithm  Sample complexity  Total runtime  Online? 

Nuclear Norm [23]  No  
Alternating minimization [15]  No  
Alternating minimization [9]  No  
Projected gradient descent[13]  No  
SGD [25]  Yes  
SGD [8]^{4}^{4}4This result only applies to the case where is symmetric PSD  Yes  
Our result  Yes 
1.3 Outline
The rest of the paper is organized as follows. In Section 2 we formally describe the problem and all relevant parameters. In Section 3, we present our algorithms, results and some of the key intuition behind our results. In Section 4 we give proof outline for our main results. We conclude in Section 5. All formal proofs are deferred to the Appendix.
2 Preliminaries
In this section, we introduce our notation, formally define the matrix completion problem and regularity assumptions that make the problem tractable.
2.1 Notation
We use to denote . We use bold capital letters to denote matrices and bold lowercase letters
to denote vectors.
means the entry of matrix . denotes the norm of vector and // denotes the spectral/Frobenius/infinity norm of matrix . denotes thelargest singular value of
and denotes the smallest singular value of . We also let denote the condition number of (i.e., the ratio of largest to smallest singular value). Finally, for orthonormal bases of a subspace , we also use to denote the projection to the subspace spanned by .2.2 Problem statement and assumptions
Consider a general rank matrix . Let be a subset of coordinates, which are sampled uniformly and independently from . We denote to be the projection of on set so that:
Low rank matrix completion is the task of recovering by only observing . This task is illposed and NPhard in general [10]. In order to make this tractable, we make by now standard assumptions about the structure of .
Definition 2.1.
Let be an orthonormal basis of a subspace of of dimension . The coherence of is defined to be
3 Main Results
In this section, we present our main result. We will first state result for a special case where is a symmetric positive semidefinite (PSD) matrix, where the algorithm and analysis are much simpler. We will then discuss the general case.
3.1 Symmetric PSD Case
Consider the special case where is symmetric PSD. We let , and we can parametrize a rank symmetric PSD matrix by where . Our algorithm for this case is given in Algorithm 1. The following theorem provides guarantees on the performance of Algorithm 1. The algorithm starts by using an initial set of samples to construct a crude approximation to the low rank of factorization of . It then observes samples from one at a time and updates its factorization after every observation. Note that each update step modifies two rows of and hence takes time .
Theorem 3.1.
Let be a rank , symmetric PSD matrix with incoherence. There exist some absolute constants and such that if , learning rate , then for any fixed
, with probability at least
, we will have for all that:Remarks:

The algorithm uses an initial set of observations to produce a warm start iterate , then enters the online stage, where it performs SGD.

The sample complexity of the warm start phase is . The initialization consists of a top SVD on a sparse matrix, whose runtime is .

For the online phase (SGD), if we choose , the number of observations required for the error to be smaller than is .

Since each SGD step modifies two rows of , its runtime is with a total runtime for online phase of .
Our proof approach is to essentially show that the objective function is wellbehaved (i.e., is smooth and strongly convex) in a local neighborhood of the warm start region, and then use standard techniques to show that SGD obtains geometric convergence in this setting. The most challenging and novel part of our analysis comprises of showing that the iterate does not leave this local neighborhood while performing SGD updates. Refer Section 4 for more details on the proof outline.
3.2 General Case
Let us now consider the general case where can be factorized as with and . In this scenario, we denote . We recall our remarks from the previous section that our analysis of the performance of SGD depends on the smoothness and strong convexity properties of the objective function in a local neighborhood of the iterates. Having introduces additional challenges in this approach since for any nonsingular by matrix , and , we have . Suppose for instance is a very small scalar times the identity i.e., for some small . In this case, will be large while will be small. This drastically deteriorates the smoothness and strong convexity properties of the objective function in a neighborhood of .
To preclude such a scenario, we would ideally like to renormalize after each step by doing , where is the SVD of matrix . This algorithm is described in Algorithm 2. However, a naive implementation of Algorithm 2, especially the SVD step, would incur computation per iteration, resulting in a runtime overhead of over both the online PSD case (i.e., Algorithm 1) as well as the near linear time offline algorithms (see Table 1). It turns out that we can take advantage of the fact that in each iteration we only update a single row of and a single row of , and do efficient (but more complicated) update steps instead of doing an SVD on matrix. The resulting algorithm is given in Algorithm 3. The key idea is that in order to implement the updates, it suffices to do an SVD of and which are matrices. So the runtime of each iteration is at most . The following lemma shows the equivalence between Algorithms 2 and 3.
Lemma 3.2.
Since the output of both algorithms is the same, we can analyze Algorithm 2 (which is easier than that of Algorithm 3), while implementing Algorithm 3 in practice. The following theorem is the main result of our paper which presents guarantees on the performance of Algorithm 2.
Theorem 3.3.
Let be a rank matrix with incoherence and let . There exist some absolute constants and such that if , learning rate , then for any fixed , with probability at least , we will have for all that:
Remarks:

The sample complexity and runtime of the warm start phase are the same as in symmetric PSD case. The stepsize and the number of observations to achieve error in online phase (SGD) are also the same as in symmetric PSD case.

However, runtime of each update step in online phase is with total runtime for online phase .
The proof of this theorem again follows a similar line of reasoning as that of Theorem 3.1 by first showing that the local neighborhood of warm start iterate has good smoothness and strong convexity properties and then use them to show geometric convergence of SGD. Proof of the fact that iterates do not move away from this local neighborhood however is significantly more challenging due to renormalization steps in the algorithm. Please see Appendix C for the full proof.
4 Proof Sketch
In this section we will provide the intuition and proof sketch for our main results. For simplicity and highlighting the most essential ideas, we will mostly focus on the symmetric PSD case (Theorem 3.1). For the asymmetric case, though the highlevel ideas are still valid, a lot of additional effort is required to address the renormalization step in Algorithm 2. This makes the proof more involved.
First, note that our algorithm for the PSD case consists of an initialization and then stochastic descent steps. The following lemma provides guarantees on the error achieved by the initial iterate .
Lemma 4.1.
Let be a rank PSD matrix with incoherence. There exists a constant such that if , then with probability at least , the top SVD of satisfies Then there exists universal constant , for any , we have:
(3) 
By Lemma 4.1, we know the initialization algorithm already gives in the local region given by Eq.(3). Intuitively, stochastic descent steps should keep doing local search within this local region.
To establish linear convergence on and obtain final result, we first establish several important lemmas describing the properties of this local regions. Throughout this section, we always denote , where , and diagnal matrix . We postpone all the formal proofs in Appendix.
Lemma 4.2.
For function and any , we have:
Lemma 4.3.
For function and any , we have:
Lemma 4.2 tells function is smooth if spectral norm of is not very large. On the other hand, not too small requires both and are not too small, where
is topk eigenspace of
. That is, Lemma 4.3 tells function has a property similar to strongly convex in standard optimization literature, if is rank k in a robust sense ( is not too small), and the angle between the top k eigenspace of and the top k eigenspace is not large.Lemma 4.4.
Within the region , we have:
Lemma 4.4 tells inside region , matrix always has a good spectral property which gives preconditions for both Lemma 4.2 and 4.3, where is both smooth and has a property very similar to strongly convex.
With above three lemmas, we already been able to see the intuition behind linear convergence in Theorem 3.1. Denote stochastic gradient
(4) 
where
is a random matrix depends on the randomness of sample
of matrix . Then, the stochastic update step in Algorithm 1 can be rewritten as:Let , By easy caculation, we know , that is is unbiased. Combine Lemma 4.4 with Lemma 4.2 and Lemma 4.3, we know within region specified by Lemma 4.4, we have function is smooth, and .
Let’s suppose ideally, we always have inside region , this directly gives:
One interesting aspect of our main result is that we actually show linear convergence under the presence of noise in gradient. This is true because for the secondorder () term above, we can roughly see from Eq.(4) that , where is a factor depends on and always bounded. That is, enjoys selfbounded property — will goes to zero, as objective function goes to zero. Therefore, by choosing learning rate appropriately small, we can have the firstorder term always dominate the secondorder term, which establish the linear convergence.
Now, the only remaining issue is to prove that “ always stay inside local region ”. In reality, we can only prove this statement with high probability due to the stochastic nature of the update. This is also the most challenging part in our proof, which makes our analysis different from standard convex analysis, and uniquely required due to nonconvex setting.
Our key theorem is presented as follows:
Theorem 4.5.
Let and . Suppose initial satisfying:
Then, there exist some absolute constant such that for any learning rate , with at least probability, we will have for all that:
(5) 
Note function indicates the incoherence of matrix . Theorem 4.5 guarantees if inital is in the local region which is incoherent and is close to , then with high probability for all steps , , will always stay in a slightly relaxed local region, and has linear convergence.
It is not hard to show that all saddle point of satisfies , and all local minima are global minima. Since automatically stay in region with high probability, we know also stay away from all saddle points. The claim that
stays incoherent is essential to better control the variance and probability 1 bound of
, so that we can have large step size and tight convergence rate.The major challenging in proving Theorem 4.5 is to both prove stays in the local region, and achieve good sample complexity and running time (linear in ) in the same time. This also requires the learning rate in Algorithm 1 to be relatively large. Let the event denote the good event where satisfies Eq.(5). Theorem 4.5 is claiming that is large. The essential steps in the proof is contructing two supermartingles related to and (where denote indicator function), and use Bernstein inequalty to show the concentration of supermartingales. The term allow us the claim all previous have all desired properties inside local region.
5 Conclusion
In this paper, we presented the first provable, efficient online algorithm for matrix completion, based on nonconvex SGD. In addition to the online setting, our results are also competitive with state of the art results in the offline setting. We obtain our results by introducing a general framework that helps us show how SGD updates selfregulate to stay away from saddle points. We hope our paper and results help generate interest in online matrix completion, and our techniques and framework prompt tighter analysis for other nonconvex problems.
References
 [1] Sanjeev Arora, Rong Ge, Tengyu Ma, and Ankur Moitra. Simple, efficient, and neural algorithms for sparse coding. arXiv preprint arXiv:1503.00778, 2015.
 [2] Matthew Brand. Fast online svd revisions for lightweight recommender systems. In SDM, pages 37–46. SIAM, 2003.
 [3] Emmanuel J Candes, Yonina C Eldar, Thomas Strohmer, and Vladislav Voroninski. Phase retrieval via matrix completion. SIAM Review, 57(2):225–251, 2015.
 [4] Emmanuel J. Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717–772, December 2009.
 [5] James Davidson, Benjamin Liebald, Junning Liu, Palash Nandy, Taylor Van Vleet, Ullas Gargi, Sujoy Gupta, Yu He, Mike Lambert, Blake Livingston, et al. The youtube video recommendation system. In Proceedings of the fourth ACM conference on Recommender systems, pages 293–296. ACM, 2010.
 [6] Christopher De Sa, Kunle Olukotun, and Christopher Ré. Global convergence of stochastic gradient descent for some nonconvex matrix problems. arXiv preprint arXiv:1411.1134, 2014.
 [7] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. arXiv preprint arXiv:1503.02101, 2015.
 [8] Rong Ge, Jason D. Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. arXiv preprint arXiv:1605.07272, 2016.
 [9] Marcus Hardt. Understanding alternating minimization for matrix completion. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on, pages 651–660. IEEE, 2014.
 [10] Moritz Hardt, Raghu Meka, Prasad Raghavendra, and Benjamin Weitz. Computational limits for matrix completion. In COLT, pages 703–725, 2014.
 [11] Prateek Jain, Chi Jin, Sham M Kakade, and Praneeth Netrapalli. Computing matrix squareroot via non convex local search. arXiv preprint arXiv:1507.05854, 2015.
 [12] Prateek Jain, Chi Jin, Sham M Kakade, Praneeth Netrapalli, and Aaron Sidford. Matching matrix bernstein with little memory: Nearoptimal finite sample guarantees for oja’s algorithm. arXiv preprint arXiv:1602.06929, 2016.
 [13] Prateek Jain and Praneeth Netrapalli. Fast exact matrix completion with finite samples. arXiv preprint arXiv:1411.1087, 2014.
 [14] Hui Ji, Chaoqiang Liu, Zuowei Shen, and Yuhong Xu. Robust video denoising using low rank matrix completion. 2010.
 [15] Raghunandan Hulikal Keshavan. Efficient algorithms for collaborative filtering. PhD thesis, STANFORD UNIVERSITY, 2012.
 [16] Yehuda Koren. The BellKor solution to the Netflix grand prize, 2009.

[17]
Akshay Krishnamurthy and Aarti Singh.
Lowrank matrix and tensor completion via adaptive sampling.
In Advances in Neural Information Processing Systems, pages 836–844, 2013.  [18] Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent converges to minimizers. University of California, Berkeley, 1050:16, 2016.
 [19] G. Linden, B. Smith, and J. York. Amazon.com recommendations: itemtoitem collaborative filtering. IEEE Internet Computing, 7(1):76–80, Jan 2003.
 [20] Xin Luo, Yunni Xia, and Qingsheng Zhu. Incremental collaborative filtering recommender based on regularized matrix factorization. KnowledgeBased Systems, 27:271–280, 2012.

[21]
Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro.
Online learning for matrix factorization and sparse coding.
The Journal of Machine Learning Research
, 11:19–60, 2010.  [22] Ioannis Panageas and Georgios Piliouras. Gradient descent converges to minimizers: The case of nonisolated critical points. arXiv preprint arXiv:1605.00405, 2016.
 [23] Benjamin Recht. A simple approach to matrix completion, 2009.
 [24] Benjamin Recht and Christopher Ré. Parallel stochastic gradient algorithms for largescale matrix completion. Mathematical Programming Computation, 5(2):201–226, 2013.
 [25] Ruoyu Sun and ZhiQuan Luo. Guaranteed matrix completion via nonconvex factorization. In Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on, pages 270–289. IEEE, 2015.
 [26] Joel A Tropp. Userfriendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389–434, 2012.
 [27] SeYoung Yun, Marc Lelarge, and Alexandre Proutiere. Streaming, memory limited matrix completion with noise. arXiv preprint arXiv:1504.03156, 2015.
Appendix A Proof of Initialization
In this section, we will prove Lemma 4.1 and a corresponding lemma for asymmetric case as follows (which will be used to prove Theorem 3.3):
Lemma A.1.
Assume is a rank matrix with incoherence, and is a subset unformly i.i.d sampled from all coordinate. Let be the top SVD of , where . Let . Then there exists universal constant , for any , with probability at least , we have:
(6) 
We will focus mostly on Lemma A.1, and prove Lemma 4.1 as a special case. Most of the argument of this section follows from [15]. We include here for completeness. The remaining of this section can be viewed as proving both the Frobenius norm claim and incoherence claim of Lemma A.1 seperately.
In this section, We always denote . For simplicity, WLOG, we also assume in all proof. Also, when it’s clear from the context, we use to specifically to represent . Then . Also in the proof, we always denote , and , where and are diagonal matrix.
a.1 Frobenius Norm of Initialization
Theorem A.2 (Matrix Bernstein [26]).
A finite sequence of independent, random matrices with dimension . Assume that each matrix satisfies:
Define
Then, for all ,
Lemma A.3.
Let , then there exists universal constant , for any , with probability at least , we have:
Proof.
We know
and note:
where are independence Bernoullirandom variables. Let matrix
By construction, we have:
Clearly . Let , then by incoherence of , with probability 1:
Also:
Then, by matrix Bernstein (Theorem A.2), we have:
That is, with probability at least , for some universal constant , we have:
For , we finishes the proof. ∎
Theorem A.4.
Let be the top SVD of , where then there exists universal constant , for any , with probability at least , we have:
Proof.
Since is a rank matrix, we know , thus
Therefore:
Meanwhile, since , , we know: , and therefore:
by choosing for large enough constant and apply Lemma A.3, we finishes the proof. ∎
a.2 Incoherence of Initialization
Lemma A.5.
Let be the top SVD of , where . then there exists universal constant , for any , with probability at least , we have:
Proof.
Suppose . Denote and