1 Introduction
Non-convex matrix factorization problems have been an emerging object of study in theoretical computer science [JNS13, Har14, SL15, RSW16], optimization [WYZ12, SWZ14]
[BNS16, GLM16, GHJY15, JMD10, LLR16, WX12], and many other domains. In theoretical computer science and optimization, the study of such models has led to significant advances in provable algorithms that converge to local minima in linear time [JNS13, Har14, SL15, AAZB16, AZ16]. In machine learning, matrix factorization serves as a building block for large-scale prediction and recommendation systems, e.g., the winning submission for the Netflix prize [KBV09]. Two prototypical examples are matrix completion and robust Principal Component Analysis (PCA).This work develops a novel framework to analyze a class of non-convex matrix factorization problems with strong duality, which leads to exact recoverability for matrix completion and robust Principal Component Analysis (PCA) via the solution to a convex problem. The matrix factorization problems can be stated as finding a target matrix in the form of , by minimizing the objective function over factor matrices and with a known value of , where is some function that characterizes the desired properties of .
Our work is motivated by several promising areas where our analytical framework for non-convex matrix factorizations is applicable. The first area is low-rank matrix completion, where it has been shown that a low-rank matrix can be exactly recovered by finding a solution of the form that is consistent with the observed entries (assuming that it is incoherent) [JNS13, SL15, GLM16]. This problem has received a tremendous amount of attention due to its important role in optimization and its wide applicability in many areas such as quantum information theory and collaborative filtering [Har14, ZLZ16, BZ16]. The second area is robust PCA, a fundamental problem of interest in data processing that aims at recovering both the low-rank and the sparse components exactly from their superposition [CLMW11, NNS14, GWL16, ZLZC15, ZLZ16, YPCC16], where the low-rank component corresponds to the product of and while the sparse component is captured by a proper choice of function , e.g., the norm [CLMW11, ABHZ16]. We believe our analytical framework can be potentially applied to other non-convex problems more broadly, e.g., matrix sensing [TBSR15], dictionary learning [SQW17a], weighted low-rank approximation [RSW16, LLR16]
, and deep linear neural network
[Kaw16], which may be of independent interest.Without assumptions on the structure of the objective function, direct formulations of matrix factorization problems are NP-hard to optimize in general [HMRW14, ZLZ13]. With standard assumptions on the structure of the problem and with sufficiently many samples, these optimization problems can be solved efficiently, e.g., by convex relaxation [CR09, Che15]. Some other methods run local search algorithms given an initialization close enough to the global solution in the basin of attraction [JNS13, Har14, SL15, GHJY15, JGN17]. However, these methods have sample complexity significantly larger than the information-theoretic lower bound; see Table 1 for a comparison. The problem becomes more challenging when the number of samples is small enough that the sample-based initialization is far from the desired solution, in which case the algorithm can run into a local minimum or a saddle point.
Another line of work has focused on studying the loss surface of matrix factorization problems, providing positive results for approximately achieving global optimality. One nice property in this line of research is that there is no spurious local minima for specific applications such as matrix completion [GLM16], matrix sensing [BNS16], dictionary learning [SQW17a], phase retrieval [SQW16], linear deep neural networks [Kaw16], etc. However, these results are based on concrete forms of objective functions. Also, even when any local minimum is guaranteed to be globally optimal, in general it remains NP-hard to escape high-order saddle points [AG16], and additional arguments are needed to show the achievement of a local minimum. Most importantly, all existing results rely on strong assumptions on the sample size.
1.1 Our Results
Our work studies the exact recoverability problem for a variety of non-convex matrix factorization problems. The goal is to provide a unified framework to analyze a large class of matrix factorization problems, and to achieve efficient algorithms. Our main results show that although matrix factorization problems are hard to optimize in general, under certain dual conditions the duality gap is zero, and thus the problem can be converted to an equivalent convex program. The main theorem of our framework is the following.
Theorems 4 (Strong Duality. Informal). Under certain dual conditions, strong duality holds for the non-convex optimization problem
(1) |
where “the function is closed” means that for each , the sub-level set is a closed set. In other words, problem (1) and its bi-dual problem
(2) |
have exactly the same optimal solutions in the sense that , where is a convex function defined by and is the sum of the first
largest squared singular values.
Theorem 4 connects the non-convex program (1) to its convex counterpart via strong duality; see Figure 1. We mention that strong duality rarely happens in the non-convex optimization region: low-rank matrix approximation [OW92] and quadratic optimization with two quadratic constraints [BE06] are among the few paradigms that enjoy such a nice property. Given strong duality, the computational issues of the original problem can be overcome by solving the convex bi-dual problem (2).
The positive result of our framework is complemented by a lower bound to formalize the hardness of the above problem in general. Assuming that the random 4-SAT problem is hard (see Conjecture 1) [RSW16], we give a strong negative result for deterministic algorithms. If also BPP = P (see Section 6
for a discussion), then the same conclusion holds for randomized algorithms succeeding with probability at least
.Theorem 6.1 (Hardness Statement. Informal). Assuming that random 4-SAT is hard on average, there is a problem in the form of (1) such that any deterministic algorithm achieving in the objective function value with requires time, where OPT is the optimum and is an absolute constant. If BPP = P, then the same conclusion holds for randomized algorithms succeeding with probability at least .
Our framework only requires the dual conditions in Theorem 4 to be verified. We will show that two prototypical problems, matrix completion and robust PCA, obey the conditions. They belong to the linear inverse problems of form (1) with a proper choice of function , which aim at exactly recovering a hidden matrix with given a limited number of linear observations of it.
For matrix completion, the linear measurements are of the form , where
is the support set which is uniformly distributed among all subsets of
of cardinality . With strong duality, we can either study the exact recoverability of the primal problem (1), or investigate the validity of its convex dual (or bi-dual) problem (2). Here we study the former with tools from geometric analysis. Recall that in the analysis of matrix completion, one typically requires an -incoherence condition for a given rank- matrix with skinny SVD [Rec11, CT10]:(3) |
where
’s are vectors with
-th entry equal to and other entries equal to . The incoherence condition claims that information spreads throughout the left and right singular vectors and is quite standard in the matrix completion literature. Under this standard condition, we have the following results.Theorems 4.1, 4.2, and 4.3 (Matrix Completion. Informal). is the unique matrix of rank at most that is consistent with the measurements with minimum Frobenius norm by a high probability, provided that and satisfies incoherence (3). In addition, there exists a convex optimization for matrix completion in the form of (2) that exactly recovers with high probability, provided that , where is the condition number of .
Work | Sample Complexity | -Incoherence |
---|---|---|
[JNS13] | Condition (3) | |
[Har14] | Condition (3) | |
[GLM16] | ||
[SL15] | Condition (3) | |
[ZL16] | Condition (3) | |
[GLZ17] | Condition (3) | |
[ZWL15] | Condition (3) | |
[KMO10a] | Similar to (3) and (14) | |
[Gro11] | Conditions (3) and (14) | |
[Che15] | Condition (3) | |
Ours | Condition (3) | |
Lower Bound^{1}^{1}1This lower bound is information-theoretic. [CT10] | Condition (3) |
To the best of our knowledge, our result is the first to connect convex matrix completion to non-convex matrix completion, two parallel lines of research that have received significant attention in the past few years. Table 1 compares our result with prior results.
For robust PCA, instead of studying exact recoverability of problem (1) as for matrix completion, we investigate problem (2) directly. The robust PCA problem is to decompose a given matrix into the sum of a low-rank component and a sparse component [ANW12]. We obtain the following theorem for robust PCA.
Theorems 5.1 (Robust PCA. Informal). There exists a convex optimization formulation for robust PCA in the form of problem (2) that exactly recovers the incoherent matrix and with high probability, even if and the size of the support of is , where the support set of is uniformly distributed among all sets of cardinality , and the incoherence parameter satisfies constraints (3) and .
The bounds in Theorem 5.1 match the best known results in the robust PCA literature when the supports of are uniformly sampled [CLMW11], while our assumption is arguably more intuitive; see Section 5. Note that our results hold even when is close to full rank and a constant fraction of the entries have noise. Independently of our work, Ge et al. [GJY17] developed a framework to analyze the loss surface of low-rank problems, and applied the framework to matrix completion and robust PCA. Their bounds are: for matrix completion, the sample complexity is
; for robust PCA, the outlier entries are deterministic and the number that the method can tolerate is
. Zhang et al. [ZWG17] also studied the robust PCA problem using non-convex optimization, where the outlier entries are deterministic and the number of outliers that their algorithm can tolerate is . The strong duality approach is unique to our work.1.2 Our Techniques
Reduction to Low-Rank Approximation. Our results are inspired by the low-rank approximation problem:
(4) |
We know that all local solutions of (4) are globally optimal (see Lemma 3.1) and that strong duality holds for any given matrix [GRG16]. To extend this property to our more general problem (1), our main insight is to reduce problem (1) to the form of (4) using the -regularization term. While some prior work attempted to apply a similar reduction, their conclusions either depended on unrealistic conditions on local solutions, e.g., all local solutions are rank-deficient [HYV14, GRG16], or their conclusions relied on strong assumptions on the objective functions, e.g., that the objective functions are twice-differentiable [HV15]. Instead, our general results formulate strong duality via the existence of a dual certificate . For concrete applications, the existence of a dual certificate is then converted to mild assumptions, e.g., that the number of measurements is sufficiently large and the positions of measurements are randomly distributed. We will illustrate the importance of randomness below.
The Blessing of Randomness. The desired dual certificate may not exist in the deterministic world. A hardness result [RSW16] shows that for the problem of weighted low-rank approximation, which can be cast in the form of (1), without some randomization in the measurements made on the underlying low rank matrix, it is NP-hard to achieve a good objective value, not to mention to achieve strong duality. A similar phenomenon was observed for deterministic matrix completion [HM12]. Thus we should utilize such randomness to analyze the existence of a dual certificate. For matrix completion, the assumption that the measurements are random is standard, under which, the angle between the space (the space of matrices which are consistent with observations) and the space (the space of matrices which are low-rank) is small with high probability, namely, is almost the unique low-rank matrix that is consistent with the measurements. Thus, our dual certificate can be represented as another form of a convergent Neumann series concerning the projection operators on the spaces and . The remainder of the proof is to show that such a construction obeys the dual conditions.
To prove the dual conditions for matrix completion, we use the fact that the subspace and the complement space are almost orthogonal when the sample size is sufficiently large. This implies the projection of our dual certificate on the space has a very small norm, which exactly matches the dual conditions.
Non-Convex Geometric Analysis. Strong duality implies that the primal problem (1) and its bi-dual problem (2) have exactly the same solutions in the sense that . Thus, to show exact recoverability of linear inverse problems such as matrix completion and robust PCA, it suffices to study either the non-convex primal problem (1) or its convex counterpart (2). Here we do the former analysis for matrix completion. We mention that traditional techniques [CT10, Rec11, CRPW12] for convex optimization break down for our non-convex problem, since the subgradient of a non-convex objective function may not even exist [BV04]. Instead, we apply tools from geometric analysis [Ver09] to analyze the geometry of problem (1). Our non-convex geometric analysis is in stark contrast to prior techniques of convex geometric analysis [Ver15] where convex combinations of non-convex constraints were used to define the Minkowski functional (e.g., in the definition of atomic norm) while our method uses the non-convex constraint itself.
For matrix completion, problem (1) has two hard constraints: a) the rank of the output matrix should be no larger than , as implied by the form of ; b) the output matrix should be consistent with the sampled measurements, i.e., . We study the feasibility condition of problem (1) from a geometric perspective: is the unique optimal solution to problem (1) if and only if starting from , either the rank of or increases for all directions ’s in the constraint set . This can be geometrically interpreted as the requirement that the set and the constraint set must intersect uniquely at (see Figure 2). This can then be shown by a dual certificate argument.
Putting Things Together. We summarize our new analytical framework with the following figure.
Other Techniques. An alternative method is to investigate the exact recoverability of problem (2) via standard convex analysis. We find that the sub-differential of our induced function is very similar to that of the nuclear norm. With this observation, we prove the validity of robust PCA in the form of (2) by combining this property of with standard techniques from [CLMW11].
2 Preliminaries
We will use calligraphy to represent a set, bold capital letters to represent a matrix, bold lower-case letters to represent a vector, and lower-case letters to represent scalars. Specifically, we denote by the underlying matrix. We use () to indicate the -th column (row) of . The entry in the -th row, -th column of is represented by . The condition number of is . We let and . For a function on an input matrix , its conjugate function is defined by . Furthermore, let denote the conjugate function of .
We will frequently use to constrain the rank of . This can be equivalently represented as , by restricting the number of columns of and rows of to be . For norms, we denote by the Frobenius norm of matrix . Let be the non-zero singular values of . The nuclear norm (a.k.a. trace norm) of is defined by , and the operator norm of is . Denote by . For two matrices and of equal dimensions, we denote by . We denote by the sub-differential of function evaluated at . We define the indicator function of convex set by For any non-empty set , denote by .
We denote by the set of indices of observed entries, and its complement. Without confusion, also indicates the linear subspace formed by matrices with entries in being . We denote by the orthogonal projector of subspace . We will consider a single norm for these operators, namely, the operator norm denoted by and defined by . For any orthogonal projection operator to any subspace , we know that whenever . For distributions, denote by
a standard Gaussian random variable,
the uniform distribution of cardinality , andthe Bernoulli distribution with success probability
.3 -Regularized Matrix Factorizations: A New Analytical Framework
In this section, we develop a novel framework to analyze a general class of -regularized matrix factorization problems. Our framework can be applied to different specific problems and leads to nearly optimal sample complexity guarantees. In particular, we study the -regularized matrix factorization problem
We show that under suitable conditions the duality gap between (P) and its dual (bi-dual) problem is zero, so problem (P) can be converted to an equivalent convex problem.
3.1 Strong Duality
We first consider an easy case where for a fixed , leading to the objective function . For this case, we establish the following lemma.
Lemma 3.1.
For any given matrix , any local minimum of over and is globally optimal, given by . The objective function around any saddle point has a negative second-order directional curvature. Moreover, has no local maximum.^{2}^{2}2Prior work studying the loss surface of low-rank matrix approximation assumes that the matrix is of full rank and does not have the same singular values [BH89]. In this work, we generalize this result by removing these two assumptions.
The proof of Lemma 3.1 is basically to calculate the gradient of and let it equal to zero; see Appendix B for details. Given this lemma, we can reduce to the form for some plus an extra term:
(5) |
where we define as the Lagrangian of problem (P),^{3}^{3}3One can easily check that , where is the Lagrangian of the constraint optimization problem . With a little abuse of notation, we call the Lagrangian of the unconstrained problem (P) as well. and the second equality holds because is closed and convex w.r.t. the argument . For any fixed value of , by Lemma 3.1, any local minimum of is globally optimal, because minimizing is equivalent to minimizing for a fixed .
The remaining part of our analysis is to choose a proper such that is a primal-dual saddle point of , so that and problem (P) have the same optimal solution . For this, we introduce the following condition, and later we will show that the condition holds with high probability.
Condition 1.
For a solution (, ) to problem (P), there exists an such that
(6) |
Explanation of Condition 1. We note that for a fixed . In particular, if we set to be the in (6), then and . So Condition 1 implies that is either a saddle point or a local minimizer of as a function of for the fixed .
The following lemma states that if it is a local minimizer, then strong duality holds.
Lemma 3.2 (Dual Certificate).
Let be a global minimizer of . If there exists a dual certificate satisfying Condition 1 and the pair is a local minimizer of for the fixed , then strong duality holds. Moreover, we have the relation .
Proof Sketch. By the assumption of the lemma, we can show that is a primal-dual saddle point to the Lagrangian ; see Appendix C. To show strong duality, by the fact that and that , we have for any where the inequality holds because is a primal-dual saddle point of . So on the one hand, On the other hand, by weak duality, we have Therefore, , i.e., strong duality holds. Therefore, as desired.
This lemma then leads to the following theorem.
Theorem 3.3.
Denote by the optimal solution of problem (P). Define a matrix space
Then strong duality holds for problem (P), provided that there exists such that
(7) |
Proof.
The proof idea is to construct a dual certificate so that the conditions in Lemma 3.2 hold. should satisfy the following:
(8) |
It turns out that for any matrix , and so , a fact that we will frequently use in the sequel. Denote by the left singular space of and the right singular space. Then the linear space can be equivalently represented as . Therefore, . With this, we note that: (b) and imply and (so ), and vice versa. And (c) implies that for an orthogonal decomposition , we have . Conversely, and condition (b) imply . Therefore, the dual conditions in (8) are equivalent to (1) ; (2) ; (3) . ∎
4 Matrix Completion
In matrix completion, there is a hidden matrix with rank . We are given measurements , where , i.e., is sampled uniformly at random from all subsets of of cardinality . The goal is to exactly recover with high probability. Here we apply our unified framework in Section 3 to matrix completion, by setting .
A quantity governing the difficulties of matrix completion is the incoherence parameter . Intuitively, matrix completion is possible only if the information spreads evenly throughout the low-rank matrix. This intuition is captured by the incoherence conditions. Formally, denote by the skinny SVD of a fixed matrix of rank . Candès et al. [CLMW11, CR09, Rec11, ZLZ16] introduced the -incoherence condition (3) to the low-rank matrix . For conditions (3), it can be shown that . The condition holds for many random matrices with incoherence parameter about [KMO10a].
We first propose a non-convex optimization problem whose unique solution is indeed the ground truth , and then apply our framework to show that strong duality holds for this non-convex optimization and its bi-dual optimization problem.
Theorem 4.1 (Uniqueness of Solution).
Let be the support set uniformly distributed among all sets of cardinality . Suppose that for an absolute constant and obeys -incoherence (3). Then is the unique solution of non-convex optimization
(9) |
with probability at least .
Proof Sketch. Here we sketch the proof and defer the details to Appendix F. We consider the feasibility of the matrix completion problem:
(10) |
Our proof first identifies a feasibility condition for problem (10), and then shows that is the only matrix which obeys this feasibility condition when the sample size is large enough. More specifically, we note that obeys the conditions in problem (10). Therefore, is the only matrix which obeys condition (10) if and only if does not follow the condition for all , i.e., , where is defined as
This can be shown by combining the satisfiability of the dual conditions in Theorem , and the well known fact that when the sample size is large.
Given the non-convex problem, we are ready to state our main theorem for matrix completion.
Theorem 4.2 (Efficient Matrix Completion).
Let be the support set uniformly distributed among all sets of cardinality . Suppose has condition number . Then there are absolute constants and such that with probability at least , the output of the convex problem
(11) |
is unique and exact, i.e., , provided that and obeys -incoherence (3). Namely, strong duality holds for problem (9).^{4}^{4}4In addition to our main results on strong duality, in a previous version of this paper we also claimed a tight information-theoretic bound on the number of samples required for matrix completion; the proof of that latter claim was problematic as stated, and so we have removed that claim in this version.
Proof Sketch. We have shown in Theorem 4.1 that the problem exactly recovers , i.e., , with small sample complexity. So if strong duality holds, this non-convex optimization problem can be equivalently converted to the convex program (11). Then Theorem 4.2 is straightforward from strong duality.
It now suffices to apply our unified framework in Section 3 to prove the strong duality. We show that the dual condition in Theorem 4 holds with high probability by the following arguments. Let be a global solution to problem (11). For , we have
where the third equality holds since . Then we only need to show
(12) |
It is interesting to see that dual condition (12) can be satisfied if the angle between subspace and subspace is very small; see Figure 3. When the sample size becomes larger and larger, the angle becomes smaller and smaller (e.g., when , the angle is zero as ). We show that the sample size is a sufficient condition for condition (12) to hold.
This positive result matches a lower bound from prior work up to a logarithmic factor, which shows that the sample complexity in Theorem 4.1 is nearly optimal.
Theorem 4.3 (Information-Theoretic Lower Bound. [Ct10], Theorem 1.7).
Denote by the support set uniformly distributed among all sets of cardinality . Suppose that for an absolute constant . Then there exist infinitely many matrices of rank at most obeying -incoherence (3) such that , with probability at least .
5 Robust Principal Component Analysis
In this section, we develop our theory for robust PCA based on our framework. In the problem of robust PCA, we are given an observed matrix of the form , where is the ground-truth matrix and is the corruption matrix which is sparse. The goal is to recover the hidden matrices and from the observation . We set .
To make the information spreads evenly throughout the matrix, the matrix cannot have one entry whose absolute value is significantly larger than other entries. For the robust PCA problem, Candès et al. [CLMW11] introduced an extra incoherence condition (Recall that is the skinny SVD of )
(13) |
In this work, we make the following incoherence assumption for robust PCA instead of (13):
(14) |
Note that condition (14) is very similar to the incoherence condition (13) for the robust PCA problem, but the two notions are incomparable. Note that condition (14) has an intuitive explanation, namely, that the entries must scatter almost uniformly across the low-rank matrix.
We have the following results for robust PCA.
Comments
There are no comments yet.