In this paper, we introduce a new technique that is based on Leave-One-Out to the study of low-rank matrix completion problems. Matrix completion concerns recovering a low rank matrix from partial observation of a subset of its entries. To study the sample complexity and algorithmic behaviors of these problems, one often needs to analyze a recursive procedure in the presence of dependency across iterations and entries of the iterates. Such dependency creates significant difficulties in both the design and analysis of algorithms, often leading to sub-optimal bounds as well as complicated and unrealistic algorithms that are not used in practice.
Using the Leave-One-Out approach, we are able to isolate the effect of such dependency, and establish entry-wise control on the iterates of the recursive procedures. We apply this approach to the analysis of two archetypal algorithms for matrix completion—the convex relaxation method based on Nuclear Norm Minimization (NNM, Candès and Recht 2009), and the non-convex approach based on Singular Value Projection (SVP, Jain et al. 2010)—in two distinct ways. For NNM, we apply this technique to an iterative procedure that arises in the analysis, particularly for constructing a dual solution that certifies the optimality of the desired primal solution. For SVP, we employ this technique directly in studying the primal solution path of the algorithm.
1.1 Our contributions
Concretely, we consider the problem of recovering a low rank matrix with rank given a subset of its entries for . The set of observed indices is assumed to be generated from the standard Bernoulli model; that is, each entry of
is observed with probability, independently of all others. Under this setting, we summarize our main results and contributions below.
1.1.1 Nuclear norm minimization
NNM is a convex relaxation approach to matrix completion that replaces the non-convex rank function with the convex nuclear norm. To show that NNM recovers the underlying matrix exactly, it suffices to prove that satisfies the first-order optimality conditions by constructing a so-called dual certificate. The best existing sample complexity results are obtained using a celebrated argument called the Golfing Scheme, which builds the dual certificate via an iterate procedure (Gross et al., 2009; Gross, 2011). An essential ingredient of the Golfing Scheme is to split the observations into independent subsets that are used in different iterations. Doing so allows ones to circumvent the dependency across iterations even when a uniform bound does not hold. While this argument has proved to be fruitful (Recht, 2011; Candès et al., 2011; Chen, 2015), it is also well-recognized that sample splitting is inherently artificial, and leads to fundamentally sub-optimal sample complexity results.
Using Leave-One-Out, we are able to completely avoid sample splitting in the Golfing Scheme and obtain new sample complexity bounds. In particular, we show in Theorem 1 that NNM recovers with high probability given a number of observed entries, where is an absolute constant and is the incoherence parameter of (see Definition 2). To the best of our knowledge, this is the first sample complexity result for a tractable algorithm that enjoys the following two properties simultaneously: (a) it has absolutely no dependence on the condition number of , and (b) it has optimal dependence on the matrix dimension , without any superfluous factors. While the improvement of our bound over existing ones may appear quantitatively small, we believe that it paves the way to finally matching the lower bound of (Candès and Tao, 2010).
1.1.2 Singular Value Projection
SVP solves a non-convex rank-constrained least squares formulation of matrix completion by projected gradient descent. Analyzing SVP is more challenging than that of NNM, as the projection step of the SVP iterations involve an implicit and highly nonlinear mapping, one that is given by the Singular Value Decomposition (SVD). A major roadblock of the analysis involves showing that the (difference of) iterates remain incoherent/non-spiky—which roughly means they are entrywise well-bounded—in the presence of dependent data.
In the presence of this difficulty, previously work has again commonly made use of the sample splitting trick; that is, using a fresh set of independent observations in each iteration (Jain et al., 2013; Hardt and Wootters, 2014; Jain and Netrapalli, 2015). Unlike in NNM where sampling splitting is done only in the analysis, using this trick for SVP requires changing the actual algorithmic procedure. Doing so has several major drawbacks: (a) sample splitting is a wasteful way of using the data, and moreover leads to sample complexity bounds that scale (unnecessarily) with the desired accuracy of the algorithm output; (b) the trick is artificial, and results in a complicated algorithm that is rarely used in practice—a gap between theory and practice; (c) as observed by Hardt and Wootters (2014); Sun and Luo (2016), naive sample splitting (i.e., partitioning into disjoint subsets) actually does not ensure the required independence; rigorously addressing this technical subtlety leads to even more complicated algorithms that are sensitive to the generative model of and hardly practical.
Using the Leave-One-Out technique, we are able to study the original form of SVP, without any sample splitting. In particular, we show that the iterates remain well-bounded entrywise, even when the procedure itself does not have any explicit regularization mechanism for ensuring such a property—hence establishing a form of implicit regularization for SVP. In fact, we establish an even stronger conclusion that the iterates converge linearly in entrywise norm to the desired solution; see Theorem 2.
To put the results above in perspective, we note that NNM, and more broadly convex relaxation methods, remain one of the most important and versatile approaches to matrix completion and other high-dimensional problems. Similarly, SVP plays a unique role, both conceptually and algorithmically, in a growing line of recent work on non-convex optimization based approaches for matrix completion. In particular, SVP is recognized as a natural and simple algorithm, and does not require a two-step procedure of “initialization + local refinement”, which are common in other non-convex algorithms. In fact, other algorithms often rely on (one or a few steps of) SVP for initialization; examples include alternate minimization (Jain et al., 2013), projected gradient descent (Chen and Wainwright, 2015) and gradient descent with regularized objective (Sun and Luo, 2016; Zheng and Lafferty, 2016). Moreover, SVP involves one of the most well-understood non-convex problems and most well-studied numerical procedures—computing the best low-rank approximation of a matrix using SVD. Many other algorithms for matrix completion can often be viewed as approximate or noisy versions of SVP. More importantly, both NNM and SVP have proven themselves as prototypes for designing more sophisticated algorithms tailored to additional problem structures (e.g., clustering, ranking and matching/alignment problems). Hence, an in-depth understanding of these two fundamental algorithms—which is the goal of this paper—paves the way for designing and analyzing algorithms for more challenging settings.
In this sense, while we apply Leave-One-Out to NNM and SVP specifically, we believe this technique is useful more broadly in studying other iterative procedures—convex or non-convex—for matrix completion and related statistical problems with complex probabilistic structures. When preparing this manuscript, we became aware of the concurrent work of Ma et al. (2017), who use a related technique to analyze gradient descent procedures for non-convex formulations of matrix completion and phase retrieval problems.
1.2 Related work and comparison
The Leave-One-Out technique, broadly defined, is a classical idea in studying statistical problems with complicated probabilistic structures. It has been used to analyze high-dimensional problems with random designs such as robust M-estimation(El Karoui et al., 2013)
and linear regression with de-biased Lasso estimators(Javanmard and Montanari, 2015). More recently, the work of Zhong and Boumal (2017) applies this technique to analyze the generalized power method (GPM) for the phase synchronization problem, and the work of Abbe et al. (2017) uses it to study one-shot spectral algorithms for stochastic block models, matrix completion and -synchronization. This technique has also been used in Chen et al. (2017) to study spectral and gradient descent methods for computing the maximum likelihood estimators with convex objectives in ranking problems with pairwise comparisons. As mentioned, the contemporary work of Ma et al. (2017) applies it to study the convergence behaviors of gradient descent for non-convex formulations.
Compared to these recent results, our use of the Leave-One-Out technique is different in three major aspects:
study the gradient descent and GPM procedures. Both procedures involve relatively simple iterations with explicit and mostly linear operations. In contrast, the SVP iterations, which involve computing singular values and vectors, are far more complicated. In particular, the SVD of a matrix is a highly nonlinear function that is defined variationally and implicitly. This makes our problem significantly harder, and our analysis requires quite delicate use of matrix perturbation and concentration bounds.
The work of Abbe et al. (2017) and Chen et al. (2017) study “one-shot” spectral methods, which only involve a single SVD operation. In contrast, SVP is an iterative procedure with multiple sequential SVD operations. As will become clear in our analysis, even studying the second iteration of SVP involves a very different analysis than those for one-shot algorithms. To further track the propagation of errors and dependency across potentially an infinite number of iterations, we need to make use of careful induction and probabilistic arguments.
For analyzing NNM, the Leave-One-Out technique is used in a very different context. In particular, instead of studying an actually algorithmic procedure, here we use leave-one-out to study an iterative procedure that arises in the dual analysis of a convex program.
The exact low-rank matrix completion problem is studied in the seminar work of Candès and Recht (2009), who initialized the use of the nuclear norm minimization approach. Follow-up work on NNM includes Candès and Tao (2010); Gross (2011); Recht (2011). The state-of-art sample complexity bound for NNM in exact matrix completion, established in Chen (2015), takes the form —note the factor due to the use of sample splitting in the Golfing Scheme. The work of Balcan et al. (2018) also establishes competing sample complexity result for a related convex optimization approach, along with complementary results for an exponential-time algorithm. The SVP algorithm is first proposed in Jain et al. (2010), although no rigorous guarantees are provided for matrix completion. The follow-up work of Jain and Netrapalli (2015) establishes rigorous sample complexity bounds via splitting the samples.
Many other iterative algorithms have been proposed and studied for matrix completion along with sample complexity bounds; a partial list includes Keshavan et al. (2010); Hardt and Wootters (2014); Sun and Luo (2016); Chen and Wainwright (2015); Zheng and Lafferty (2016). A remarkable recent result shows that a nonconvex regularized formulation of matrix completion in fact has no spurious local minima (Ge et al., 2016). Most of these iterative algorithms require a number of observations that depend on the condition number of the underlying matrix. We provide a detailed quantitative comparison of these results after presenting our main theorems.
In Section 2, we present our main results on the sample complexity for NNM and the linear entrywise convergence of SVP. In Section 3 we discuss the main intuitions of our Leave-One-Out based technique, and provide a general recipe of using this technique. Following this recipe, we prove our sample complexity bounds for NNM in Section 4. The paper is concluded with a discussion in Section 5. The proof for SVP is deferred to the appendix.
For an integer , we write . We use the standard big-/ notations that hide universal constants, with meaning . Denote by the set of symmetric matrices in , the -th standard basis vector in appropriate dimension, and the all-one vector. For a matrix , let be its -th row and its -th column. The Frobenius norm, operator norm and nuclear norm of a matrix are denoted by and . We also use the notations for the entrywise norm, and for the maximum row norm. For two matrices with compatible dimensions, we write . The best rank- approximation of in Frobenius norm is , and its -th largest singular value is (or simply if it is clear in the context). We denote by the identity operator on matrices. The operator norm of a linear map on matrices is . By with high probability (w.h.p.), we mean with probability at least for some universal constants , where and are the dimensions of the low-rank matrix to be recovered.
2 Problem setup and main theorems
In this section, we formally set up the matrix completion problem, and state our main results on nuclear norm minimization and Singular Value Projection.
2.1 Nuclear norm minimization
To present our results on NNM, we consider the following standard Bernoulli sampling model for the matrix completion problem.
Suppose that has rank and . Under the model MC, we are given the partially observed matrix , where the sampling operator is given by for each , and the observation indicators are independent Bernoulli variables with distribution . We denote by the set of indices of the observed entries.
The NNM approach involves solving the following convex program:
Our goal is to characterize when the NNM program (1) recovers the underlying low-rank matrix as the unique optimal solution. In order to quantify the difficulty of recovering , we consider the following standard measure of the incoherence of a matrix.
A matrix with rank- SVD is -incoherent if
The incoherence condition will be imposed on to avoid pathological situations where most of the entries of are equal to zero. In such situations, it is well-known that it is impossible to complete unless all of its entries are observed (Candès and Recht, 2009). Also denote by the condition number of .
With the set-up above, we state our sample complexity result for NNM.
Under the model MC(, if is -incoherent and , then with high probability is the unique minimizer of the NNM program (1).
We prove this theorem in Section 4. In the setting with , the theorem shows that NNM recovers with high probability provided that the expected number of observed entries satisfies . Note that the sample complexity bound has only one logarithmic term , and is independent of the condition number of . A lower bound on the sample complexity of matrix completion is established by Candès and Tao (2010), who show that is necessary for any algorithm to uniquely recover with probability at least . Our bound hence has the correct dependence on and , and is sub-optimal in terms of the incoherence parameter and the rank .
|Keshavan et al. (2010)|
|Sun and Luo (2016)|
|Zheng and Lafferty (2016)|
|Balcan et al. (2018)|
|Lower bound in Candès and Tao (2010)|
In Table 1 we compare our bound with the state-of-art sample complexity results for polynomial-time algorithms (we omit other existing results that are strictly dominated by the bounds presented here). The work of Chen (2015), which builds on earlier results on NNM, establishes a bound that scales sub-optimally with , which is a fundamental consequence of having to split into subsets in the Golfing Scheme. The other previous results in the table all have non-trivial dependence on the condition number . While it is common to see dependency of the time complexity on , the appearance of in the sample complexity seems unnecessary. To the best of our knowledge, our result is the only one on tractable algorithms that achieves optimal dependence on both the condition number and the dimension; in particular, our result is not dominated by any existing results.
Before we proceed, we would like to note that the very recent work in Balcan et al. (2018) obtains a sample complexity bound that matches the lower bound; their bound, however, is achieved by an inefficient algorithm that has running time exponential in .
2.2 Singular Value Projection
We now turn to the analysis of the SVP algorithm. For simplicity, we will work with the setting where the true matrix is symmetric and positive semidefinite. Our results can be extended to the general asymmetric case either via a direct analysis, or by using an appropriate form of the standard dilation argument (see, e.g., Zheng and Lafferty 2016), though the proofs will become more tedious.
The symmetric matrix completion problem is defined as follows.
Suppose that is symmetric positive semidefinite with rank , and . Under the model SMC, we are given the partially observed matrix , where the sampling operator is given by for each , and the observation indicators are independent Bernoulli variables with distribution and . We denote by the set of indices of the observed entries.
To motivate the SVP algorithm, we consider a natural optimization formulation for the above SMC model, which involves solving the following rank-constrained least-squares problem:
SVP can be viewed as the projected gradient descent method applied to the above problem (2). In particular, let be the projection operator onto the set of matrices with rank at most ; that is, is the best rank- approximation of in Frobenius norm. Then the SVP iteration with a step size is given by
We will later fix a constant step size . The rationale behind this choice is that if were independent of the quantity , then the expectation of would be exactly . As is standard, we assume that is known.111In practice can be estimated accurately by the empirical observation frequency .
The SVP algorithm is first proposed by Jain et al. (2010). They observe empirically that the objective value of the SVP iterate converges quickly to zero. Based on this observation, they conjecture that the SVP iterate is guaranteed to converge to (Jain et al., 2010, Conjecture 4.3). We reproduce below their conjecture, rephrased under our symmetric setting:
(Jain et al., 2010) For some constants depending on and , the following holds with high probability under the model SMC with . SVP with some fixed step size outputs a matrix of rank at most such that after iterations; moreover, converges to .
Jain et al provide theoretical evidences supporting this conjecture, though no complete proof is given. The conjecture remains open since the proposal of SVP.
Using the Leave-One-Out technique, we are able to essentially establish the conjecture. In fact, we prove a stronger result showing that the SVP iterate converges linearly to entrywise in norm.
Under the model SMC, if is -incoherent and , then with high probability the SVP iterates in equation (3) satisfy the bound
We prove Theorem 2 in Appendix C. The theorem establishes, for the first time, the convergence of the original form of SVP where the same set of observed entries is used in all iterations, without any resampling or sampling splitting.
2.2.1 Warm-up: the equal eigenvalue case
The proof of Theorem 2 turns out to be quite long and technical. To provide intuitions, we shall first prove the result in a simpler version of the SMC model, where we assume in addition that the non-zero singular values of are known and all equal to . We call this the SNM model. This simpler case captures most of the essential elements and difficulties of the full analysis.
Knowing the singular values, we only need to compute an estimate of the singular vectors of . We therefore consider the following simplified SVP procedure:
Here is the operator such that for each , is the -by-
matrix whose columns are the eigenvectors associated with the
largest eigenvalues of. We have the following guarantee for this simplified SVP procedure. The proof is given in Appendix B.
Under the model SNM, if is -incoherent with all non-zero eigenvalues equal to and , then with high probability the SVP iterates in equation (5) satisfy the bound
Challenges in the analysis:
Standard approaches for analyzing SVP often aim to prove that converges to in Frobenius norm. To do this, one may try to show that the operator satisfies a form of Restricted Isometry Property (RIP), so that preserves the Frobenius norm of the error matrix (Jain et al., 2010). However, RIP cannot hold uniformly for all low-rank matrices. Even when one restricts to incoherent iterates , the error matrix itself may not be incoherent but rather is the difference of two incoherent matrices. A more severe challenge arises in showing that the error matrix will retain the property needed for RIP to hold throughout the iterations—SVP has no explicit mechanism to ensure such properties. We instead employ a different argument, based on Leave-One-Out, that directly controls the quantity . This allows us to show that every row of factorized difference , and hence every entry of the error matrix as well, converge to zero linearly and simultaneously. A byproduct of our analysis is that remains incoherent throughout the iterations, although incoherence and RIP are no longer needed explicitly in our convergence proof.
3 Intuitions and recipe of our Leave-One-Out-based approach
In this section, we describe the high-level intuitions of our Leave-One-Out-based approach for studying a stochastic iterative procedure. We then provide a general recipe of employing this approach in the analysis of concrete algorithms including NNM and SVP.
3.1 Intuitions of Leave-One-Out
Consider an iterative procedure as follows: Starting from a given , one performs the iterations
where represents the random data. (In a more general situation, each can be a vector.) The map is a possibly nonlinear and implicit transformation, and let us assume that is a fixed point of . Our goal is to understand the behaviors of individual coordinates of the iterates. If is a contraction in norm in the sense that
for some with high probability, then it is easy to show that the distance of iterate to the fixed point decreases geometrically to zero.
In some situations, however, one is interested in controlling the entrywise behaviors of , e.g., bounding its norm . Using the worst-case inequality , together with the above distance bound, is often far too loose. Worse yet, there are cases where the contraction may not hold uniformly for all and ; instead, only a restricted version of it holds:
for some small number . In this case, establishing convergence of again requires one to first control the norm of . We see these two types of difficulties in the analysis of SVP and nuclear minimization for matrix completion problems.
To overcome these challenges, we need to make use of the more fine-grained structures of and . For illustration, we assume that iteration (7) is separable w.r.t. the data , in the sense that
That is, the -th coordinate of the iterate has explicit dependence only on the -th coordinate of the data . However, also depends on all the coordinates of , which in turn depends on the entire date vector . Consequently, still depends implicitly on the entire random vector .
On the other hand, the mapping is often not too sensitive to individual coordinates of . In this case, we expect that the randomness of propagates slowly across the coordinates of , so the correlation between and is weak (though they are not completely independent). To formalize the above insensitivity property quantitatively, we assume that satisfies, in addition to the restricted -contraction bound (8), the following Lipschitz condition
The value of is often small, sometimes , since we are comparing norm with norm. However, one should not expect that ; otherwise we would have contraction, which is rare for the problems we consider.
To exploit the above properties (8)–(10), we employ a leave-one-out argument, which allows us to isolate the effect of the dependency across coordinates. For each , let be the vector obtained from the original data vector by zeroing out the -th coordinate. Consider the fictitious iteration (which is used only in the analysis)
By construction, is independent of . Then we have
where the last line is due to the Lipschitz assumption (10), which bounds the “Lipschitz term” that measures the variation of applied to two different iterates and . The“discrepancy term” captures the effect of zeroing out the -th coordinate of the data . This term involves two independent quantities and , which is often easy to handle. To proceed, we need to bound the term , that is, showing that and are proximal to each other. This is true intuitively, as and are computed from two sets of data that differ at only one coordinate. More precisely, we have the bound
here step is due to the separability assumption (9), and step can be established using the -contraction bound (8) under the induction hypothesis . The second term in line above again involves independent quantities and is easy to bound. In this case, the bound above combined with an induction argument establishes a (often contracting) upper bound on , so and are close (often become closer) to each other throughout the iterations.
To sum up, by using the above leave-one-out argument, we reduce the harder problem of controlling the individual coordinates of to two easier problems:
Controlling a quantity of the form when and are independent;
Controlling a quantity of the form in various norms, for which we can make use of the restricted -contraction and Lipschitz properties of . The latter properties can often be established even when - or -contraction fails to hold uniformly.
3.2 Using the leave-one-out argument in our proofs
The analysis of NNM and SVP for matrix completion involves stochastic iteration procedures of the form (7) given in the last sub-section. Here the data consists of the observation indicators given in Definitions 1 and 3. For NNM, the dual certificate is constructed using a procedure with given by , where is a linear operator to be defined later. The (simplified) SVP iteration in (5) corresponds to an given by .
Dealing with these problems is of course more complicated than the example given in the last sub-section. In particular, the separability and Lipschitz properties (8)–(10) may not hold exactly as stated, or may be non-trivial to recognize and establish. Moreover, the high probability bounds obtained from exploiting independence only hold for a finite number of iterations due to the use of union bounds. Consequently, proving the desired results requires additional steps that are sometimes quite technical. Nevertheless, the intuitions remain valid. Below we summarize our strategy of using the leave-one-out technique for our problems:
Introduce the leave-one-out sequence to the analysis as in equation (11) by appropriately defining the coordinate system and recognizing which coordinate of should be zeroed out. This can be done using a calculation similar to equation (12), which exposes which can be left out to induce independence in the term .
To control an infinite number of iterations, we use the fact that the iterate is already close to after finitely many iterations. In this case, we may exploit local properties near to establish uniform/deterministic bounds. It may also suffice to use a crude bound , as the right hand side is already sufficiently small.
4 Proof of Theorem 1
In this section, we apply our leave-one-out based technique to prove the sample complexity bound in Theorem 1 for NNM. We assume that for simplicity; the proof of the general case follows exactly the same lines.
Let the singular value decomposition of be . We define the projections and for a matrix . Introduce the shorthand , which can be explicitly expressed as . We also define the operator and the linear subspace .
We make use of a standard result that provides sufficient conditions for the optimality of to the NNM formulation.
The first condition in Proposition 1 can be verified using the following well-known result from the matrix completion literature.
To prove Theorem 1 and establish optimality of , it remains to construct the dual certificate that satisfies the second set of conditions in Proposition 1. This is usually done using the Golfing scheme developed in Candès et al. (2011); Gross (2011), where is split into independent subsets and is constructed iteratively using a different subset in each iteration. Doing so induces independence across iterations and makes certain entrywise bounds easier to establish; it however leads to sub-optimal sample complexity bounds. We instead consider a procedure that uses the same in all iterations. Let us first describe our dual certificate construction procedure, and discuss why we need to establish entrywise bounds.
Constructing the dual certificate:
We consider an iterative procedure where and
We claim that with , the matrix
is the desired dual certificate. Indeed, with high probability satisfies the conditions 2(a) and 2(b) in Proposition 1, as shown below:
Verifying the dual certificate:
For condition 2(b), Lemma 1 ensures that w.h.p.
Applying the above inequality recursively with gives the desired bound:
For condition 2(a), let us assume for the moment thatsatisfies the norm bound
Combining with the assumed bound (16) gives
where we use the assumption that . It then follows from the definition of in (15) that
To complete the proof of Theorem 1, it remains to establish the entrywise bound (16) for the iterative procedure (14). This step has been the bottleneck in the analysis of previous work using Golfing Scheme. In the rest of this section, we establish inequality (16) by using our leave-one-out based technique.
4.2 Proving inequality (16) by leave-one-out
The iterative procedure (14) can be written abstractly as , where . Note that this is a special case of the general iterative procedure (7) described in Section 3. We now analyze this procedure following the strategy outlined in Section 3.2.
Introduce the leave-one-out sequence:
Since we need to bound the matrix entrywise norm, for each index we define the operator by
Also define . For each , we introduce the leave-one-out sequence
These iterates have the property that is independent of and , namely, the randomness of in the -th row and -th column. If we define to match the notations in Section 3, then and
Verify the Lipschitz condition and bound the discrepancy terms:
We now state a few lemmas useful in subsequent analysis; their proofs are deferred to Appendix A. In the language of Section 3, these lemmas are used to establish the -Lipschitz condition (10) and to bound discrepancy terms in (12) and (13). Note the contraction condition is already established in Lemma 1 as is linear and all iterates belongs to .
We first introduce a lemma corresponding to the -Lipschitz condition in (10).
If , we have w.h.p.
(To see the correspondence, note that all iterates are in and is linear; therefore the Lipschitz condition in (10) is equivalent to , where .)
The next lemma bounds on the discrepancy term in equation (12).
If , for each we have w.h.p.
Our last lemma bounds the discrepancy term in equation (13).
Suppose that . There exists a numerical constant such that for each , we have w.h.p.
Induction on :
Equipped with the lemmas above, we are ready to prove the desired inequality (16) by induction on . Our induction hypothesis is
For inequality (19), we have w.h.p. for each ,
For inequality (20), we have w.h.p. for each