is a very fundamental and ubiquitous technique for unsupervised learning and dimensionality reduction; basically, this involves finding the best low-rank approximation to the given data matrix. To be precise, one common formulation of PCA is the following:
where is the input data matrix, where denotes the Frobenius norm of a matrix and . It is well-known that the constrained optimization problem given by Equation (1
) can be solved via the Singular Value Decomposition (SVD) and truncating the resultant decomposition to the top-
singular values and singular vectors yields the optimal solution(Eckart & Young, 1936)
. While this machine learning technique has umpteen number of applications, one of its main shortcomings is that it is not robust to the presence of gross outliers since the optimization involves just anobjective. To address this issue, the robust PCA technique – given such that , our aim is to find and which are low-rank and sparse matrix components respectively – was developed. Precisely, one hopes to solve the following problem (or its equivalent formulations):
where denotes the number of non-zero entries in a matrix, and . While Equation (1) may not be always well-posed, under certain identifiability conditions, many recent works over the past decade have advanced our understanding of this problem; we briefly recap some of the existing relevant results in Section 1.2.
|2 Work||Features||Approach||Incoherence||Sparsity||Corruption||Comp. complexity|
|2 (Candès et al. , 2011)||✗||Convex||Strong||Random|
|(Hsu et al. , 2011)||✗||Convex||Weak||Deterministic|
|(Netrapalli et al. , 2014)||✗||Non-convex||Weak||Deterministic|
|(Yi et al. , 2016)||✗||Non-convex||Weak||Deterministic|
|(Chiang et al. , 2016)||✓||Convex||Strong||Random|
1.1 Robust Inductive Learning: Motivation
A key point to be noted is that Equation (1) does not incorporate feature information; this is the so-called transductive setting. In practical applications, we often have feature information available in the form of feature matrices and . In the low-rank matrix recovery literature, this is often incorporated as a bilinear form, , which models the feature interactions via the latent space characterized by matrix ; this is the so-called inductive setting. We now present a motivating real-life situation.
Example 1.1 (Using features for collaborative filtering with grossly corrupted observations).
In recommendation systems, it is often the case that we have user-product ratings matrix along with side information in the form of features corresponding to each user and product. It is common in large-scale machine learning applications that the number of products and users is very large compared to the features available for each user or product. Though a user might not have used a product, we would like to infer how the user might rate that product given the user and product features – unlike the transductive setting this is possible in, and is a key application of, the inductive learning setting. Moreover, the ratings matrix is subject to various kinds of noise including erasures and outliers – in this work, we consider a general noise model using which robust recovery of ratings is possible.
It is the goal of this paper to focus on the practically useful regime of .
1.2 Related Work
We now present the related work in both transductive and inductive settings.
This is the relatively more well-explored setting. There are two main solution approaches that have been considered in the literature namely, the convex and the non-convex methods.
Convex methods entail understanding the properties of the convex relaxation of Equation (1) given by:
The works of (Chandrasekaran et al. , 2011) and (Hsu et al. , 2011) characterize the recovery properties of the convex program assuming a weak deterministic assumption on the support of the sparse matrix that the fraction of corrupted entries; the tightest bounds are that this fraction scales as . Under a stronger model of the sparse matrix namely, uniformly sampled support, (Candès et al. , 2011) show that it is possible to have when
for exact recovery with high probability. Numerically, the convex program in Equation (1.2) is most commonly solved by variants of sub-gradient descent (involving iterative soft-thresholding); the convergence rate known for trace-norm programs is (Ji & Ye, 2009) for an -close solution.
The underlying theme in non-convex methods involves retaining the formulation in Equation (1), starting with a suitable initialization and performing alternating projections onto non-convex sets (involving iterative hard-thresholding) until convergence. The work of (Netrapalli et al. , 2014) provides recovery guarantees under the weaker deterministic support assumptions matching the conditions of (Hsu et al. , 2011). However, the computational complexity of their algorithm scales with rank quadratically – to improve this, (Yi et al. , 2016) propose a (non-convex) projected gradient approach while paying a cost in the permissible number of sparse corruptions, i.e., as opposed to . A consequence of the analysis of these non-convex methods is that they admit a faster convergence rate – specifically, iterations for an -close solution – as opposed to convex methods.
It is noteworthy that the matrix completion problem (see, for instance, (Recht, 2011) and (Jain & Netrapalli, 2014)), where the goal is to recover an incomplete low-rank matrix, is a special case of the robust PCA problem where is taken to be for the non-observed entries. Finally, we note that the robust PCA problem has been invoked in several applications including topic modeling (Min et al. , 2010), object detection (Li et al. , 2004) and so on.
To the best of our knowledge, currently, there is only one other work due to (Chiang et al. , 2016) which considers the robust PCA problem in the inductive setting and presents a guaranteed convex optimization procedure for solving it; incorporating additional feature information into the robust PCA problem, they solve the following convex program, known as PCPF:
For this paragraph, let , be the SVD of , , and denote the standard basis vector in ; the key recovery guarantee states that and ; most notably, these guarantees are derived under stronger assumptions namely, (1) strong incoherence property, i.e., , , , , (2) random sparsity, i.e., the support of is drawn uniformly at random from all subsets of of size . Note that assumptions such as uniform support sampling may not be realistic in practice. In contrast, as we explain in Sections 1.3 and 2.2, our work relaxes the assumptions they require while admitting a simpler algorithm, novel analysis approach and faster convergence result.
In this context, it is also to be mentioned that for the related problem of inductive matrix completion is relatively better understood; recovery guarantees are known for both the convex (see, for instance, (Xu et al. , 2013) and (Chiang et al. , 2015)) and the non-convex (e.g., (Jain & Dhillon, 2013)) approaches. Other related works based on probabilistic modeling include (Zhou et al. , 2012) and (Porteous et al. , 2010).
To summarize, we position this paper with respect to other works in Table 1. While we have highlighted the most relevant existing results, note that the list provided here is by no means comprehensive – such a list is beyond the scope of this work.
1.3 Our Contributions
To the best of our knowledge, our work is the first to derive a provable and efficient non-convex method for robust PCA in the inductive setting. Our novelty and technical contributions can be summarized along the following axes:
Assumptions (Section 2.2): We use the weakest assumptions, i.e., (1) weak incoherence conditions on only the feature matrices and (2) (weak) deterministic support of the sparse matrix.
Algorithm (Section 2.4): Our algorithm (IRPCA-IHT) performs simple steps involving spectral and entry-wise hard-thresholding operations.
Experiments (Section 4): We substantiate our theoretical results by demonstrating gains on both synthetic and real-world experiments.
2 Problem Setup
2.1 Notation and Preliminaries
Let , i.e., are matrices such that the input data matrix is the superposition of two component matrix signals namely, the low-rank component and the sparse component . Here, is a sparse perturbation matrix with unknown (deterministic) support and arbitrary magnitude. In our inductive setting, side information or features are present in the bilinear form specified . The feature matrices are denoted as and . Note that the feature dimensions are and such that and is the rank-
latent matrix to be estimated where; intuitively, this latent matrix parameter describes the interaction and correlation among the feature vectors. Now, our optimization problem is given by:
Here, for a matrix , we define the relevant functions, , , , Frobenius norm , spectral norm for unit vectors and . Next, for a matrix , we denote its maximum and minimum singular value by and respectively, and further the condition number of is denoted by . The pseudoinverse of a matrix is denoted by and is computed as where is assumed to be of full rank. Let
denote the identity matrix whose size will be clear from the context. Finally, we useto denote the standard basis vector in the appropriate dimension, which will also be clear from the context.
Remark 2.1 (Noisy case: motivation and setup).
Note that, so far, for simplicity and clarity, we have been focusing on the case when where . This model posits that is exactly a rank- matrix and is exactly a sparse matrix which might not be the case in practice. Our approach, for solving Equation 5, in terms of both the algorithm and the analysis, also handles the noisy case wherein is some generic bounded additive noise that renders approximately low-rank or approximately sparse.
We now state and explain the intuition behind the (by now standard) identifiability assumptions on the quantities involved in our optimization problem so that it is well-posed. Also, we re-emphasize specifically that Assumptions 2 and 3 are much weaker and generic than previous works such as (Chiang et al. , 2016).
Feasibility condition: We assume that and .
Weak incoherence of the feature matrices: Let be the SVD of the feature matrix such that , are the matrices of left and right singular vectors respectively, and is the diagonal matrix of singular values. Then, we assume where is called the incoherence constant of matrix . Similarly, we assume incoherence of as well.
Bounded deterministic sparsity: Let the number of non-zeros per row of the sparse matrix satisfy ; similarly, let the number of non-zeros per column of the sparse matrix satisfy . Here, and .
Bounded latent matrix: Without loss of generality, we assume that the latent matrix is bounded, ie, for a global constant .
Having side information always need not help; otherwise, we may always generate random features and obtain improvement over transductive learning. In disallowing this, Assumption 1 is a necessary condition, which ensures that we have informative features and in the sense that they are correlated meaningfully in the latent space given by .
In order to make the low-rank component not too sparse and distinguishable from the sparse perturbation, we make the weak incoherence assumption on the feature matrices which says that the energy of the right singular vectors of the matrices is well-spread with respect to all the co-ordinate axes. This is precisely quantified by Assumption 2.
In our problem setup we assume that a generic (possibly adversarial) deterministic sparse perturbation is added to the low-rank matrix. This is quantified by Assumption 3. In particular, we do not have any specific distributional assumptions on the support of the sparse matrix, and the magnitudes and signs of its non-zero entries.
Remark 2.2 (Noisy case: assumptions).
To obtain recovery guarantees for the noisy case described in Remark 2.1, the only assumption on we have is that it is suitably well-behaved – this is quantified by assuming .
2.3 Corruption Rate
In this work, as give in Table 1, we refer to the rank-sparsity trade-off in Assumption 3 as ‘corruption rate’ – this is the allowable extent to which the model is robust to gross outliers while retaining identifiability, ie, the number of non-zeros in the sparse corruption matrix. Note that, by using features, we are always able to tolerate (resp. ) gross corruptions per row (resp. column). This is a gain over the transductive setting as in (Netrapalli et al. , 2014) where the permissible number of outliers is (resp. ) per row (resp. column) and could be potentially .
Our method, presented in Algorithm 1, uses two non-convex projection operations as building blocks. Our algorithm essentially applies these projections to the low-rank and sparse residuals in an alternating manner until convergence, i.e., at the iteration, the residuals and are projected onto the set of sparse and low-rank matrices respectively via the following hard-thresholding operations:
Spectral hard thresholding: This is used for projecting a matrix onto the set of low-rank matrices. It is achieved via the truncated-SVD operation and is denoted by . Here, we are finding a matrix rank- matrix which best approximates .
Entry-wise hard thresholding: This is used for projecting a matrix onto the set of sparse matrices. We compute a matrix where if and if .
Note that the above hard thresholding operations result in in rank-restricted and sparsity-restricted matrices for appropriate choices of and
. It is noteworthy that our algorithm, unlike many non-convex optimization procedures, employs the very simple initialization scheme of setting the initial iterates to the all-zeros matrix () while achieving global convergence.
The algorithm needs (a) the true rank of , and (b) the noise parameter (for which it suffices to have the knowledge of a reasonable bound on ). In practice, the knowledge of and can be obtained using cross-validation, grid search or leveraging domain knowledge of the specific application; for instance, in the noiseless setting, and hence, is set to zero. Furthermore, efficient ways of estimating the incoherence of a matrix have been studied in the literature; see for instance, (Mohri & Talwalkar, 2011) and (Drineas et al. , 2012).
A key difference from related approaches in the transductive setting (Netrapalli et al. , 2014) is the more efficient spectral hard thresholding that is possible due to the available feature information, i.e., our approach involves a truncated SVD operation in the feature space rather than the ambient space which is computationally inexpensive. Specifically, since , in Step 7 of Algorithm 1, we find the best matrix such that for every . This is achieved via a bilinear transformation of the residual given by followed by a truncated -SVD of the resulting matrix to obtain . Note that the low-rank iterates may then be computed as ; specifically, at termination.
2.5 Computational Complexity
We now infer the per-iteration computational complexity from Algorithm 1, specifically Steps 6-8. The entry-wise hard-thresholding in Step 6 has a time complexity of . The spectral hard-thresholding in Step 7 has a time complexity of due to the involved matrix multiplication followed by the truncated SVD operation. Step 8 has a complexity of . Unlike previous (Chiang et al. , 2016) trace norm based approaches in the inductive setting, we directly perform rank- SVD in Step 7 leading to a complexity of just as opposed to ; this is a significant gain when . In the transductive setting as well, our method has significant computational gains over the state-of-the art AltProj algorithm of (Netrapalli et al. , 2014), especially in the regime while maintaining the corruption rate guarantees as in Section 2.3.
3.1 Proof Outline
For simplicity, we first begin with the symmetric noiseless case (Section 3.2). Upon presenting the convergence result for this case, we show how to extend our analysis and result to general cases including the noisy case (Section 3.3) and the asymmetric matrix case (Section 3.4).
The key steps in the proof of convergence of Algorithm 1 involve analyzing the two main hard-thresholding operations and controlling the error decrease, in terms of a suitably chosen potential function, as a result of performing these operations. Since we care about recovering every entry of both the low-rank and the sparse matrix components, we choose the infinity norm of appropriate error matrices as our potential function to track the progress of our algorithm. Bounds in the infinity norm are trickier to obtain than the more usual spectral norm. Consequently, our guarantees are stronger as opposed to showing faithful recovery in the spectral or Frobenius norms. Specifically, for a given , we show that . Upon showing this geometric reduction in error, we use induction to stitch up argument across iterations.
At a high level, the proof techniques involved for a fixed are as follows:
Entry-wise hard thresholding: The are two aspects here. First, given that is close to , we show, by using a case-by-case argument, that is also close to . Second, we show, by contradiction, that the does not have any spurious entries that are not present in originally.
Spectral hard thresholding: Given that is close to , we show that gets closer to than
. There are three aspects here. First, we use the weak incoherence property of features to obtain infinity norm bounds. Second, we use Weyl’s eigenvalue pertubation lemma to quantify how close the estimateis to the true latent matrix . Third, we bound the spectral norm of a sparse matrix tightly in terms of its infinity norm.
For the noisy case, using Remark 2.1, we simply account for the noise terms as well in the error reduction argument. Extension to the asymmetric case proceeds via the standard symmetric embedding technique, both for the noiseless and the noisy setting, as detailed in Section 3.4; a key point to be noted here is that we maintain the rank-sparsity conditions in the symmetrized matrix.
3.2 Symmetric Noiseless Case
Let , and . For simplicity, let the features be equal i.e., and . Further, let , and . Also, recall that in the noiseless case. We now state our main result.
Theorem 3.1 (Noiseless case: fast and correct convergence).
Several implications are immediate from Theorem 3.1 : (1) our algorithm converges to the true parameters at a linear rate; (2) we have faithful latent space recovery as well as outlier detection; (3) assumptions used for deriving the recovery guarantee are weaker than previous works in the inductive setting; (4) we achieve improved corruption rate; (5) guarantees for the transductive robust PCA problem are recovered if the features are identity matrices and
: (1) our algorithm converges to the true parameters at a linear rate; (2) we have faithful latent space recovery as well as outlier detection; (3) assumptions used for deriving the recovery guarantee are weaker than previous works in the inductive setting; (4) we achieve improved corruption rate; (5) guarantees for the transductive robust PCA problem are recovered if the features are identity matrices and; in particular, our corruption rate bounds match up to a factor of .
We now prove Theorem 3.1.
We prove this by induction over . Note that Step 3 of Algorithm 1 initializes (as ) and sets for all . For , since by our initialization, it is clear that and hence the base case holds.
Lemma 3.1 (Noiseless case: faithful support recovery due to entry-wise hard thresholding).
Let satisfy the error condition that . Then, we have and .
Note that . By the definition of our entry-wise hard thresholding operation, we have the following:
Term when .Thus .
Term when . Using the triangle inequality, we have .
Thus, the above two cases show the validity of the entry-wise hard thresholding operation. To show correct support recovery, we show that for any given , if then is also zero for all . Noting that and , iff . But this is a contradiction since by the inductive assumption. ∎
Lemma 3.2 (Noiseless case: error decay due to spectral hard thresholding).
Let satisfy the error condition that . Then, we have and .
Using the fact that and , we have
where follows by substituting the SVD of , i.e., and follows from the sub-multiplicative property of the spectral norm. Now, from Assumption 2, we have
Recall from Step 7 of Algorithm 1 that is computed as where . Let . Further, let be the full SVD of , where and span orthogonal sub-spaces of dimensions and respectively, and is the pseudoinverse of . Next, using these and the unitary invariance property of the spectral norm, we have
where is obtained by substituting , by triangle inequality. Inequality is obtained by using Weyl’s eigenvalue perturbation lemma (Bhatia, 2013), which is:
3.3 Symmetric Noisy Case
|(a) sparsity||(b) rank||(c) feature dimension||(d) condition number|
Next we consider the general noisy case of , where , and are symmetric and is a bounded additive noise matrix satisfying properties as given in Remark 2.2. Note that, in practice, by setting for a suitably chosen constant , Algorithm 1 works unchanged. However, in order to establish convergence in theory, the key challenge is to be able to control the perturbation effects of in each iteration. In making this precise, we now state our main result for this section whose proof is given in Appendix A due to space limitations.
Theorem 3.2 (Noisy case: fast and correct convergence).
To prove the above theorem, we need the following key lemmas whose proofs are given in Appendix A as well.
Lemma 3.3 (Noisy case: faithful support recovery due to entry-wise hard thresholding).
Let satisfy the error condition that . Then, we have and .
Lemma 3.4 (Noisy case: error decay due to spectral hard thresholding).
Let satisfy the error condition that . Then, we have and .
3.4 Asymmetric Case
We now show how to extend our analysis for any general asymmetric matrix, both in the noiseless and the noisy inductive settings. Let be the input data matrix. The main result can be stated as:
Consider the standard symmetric embedding of a matrix given by:
With as input, the intermediate iterates of our algorithm also have a similar form. Moreover, note that this embedding preserves the rank, incoherence and sparsity properties – due to space constraints, these details which are needed as the key components of the proof of Claim 3.1 are deferred to Appendix B.
In this section, we conduct a systematic empirical investigation of the performance of our robust subspace recovery method (IRPCA-IHT) and justify our theoretical claims in the previous sections. Specifically, the goal of this study is to show: (1) the correctness of our algorithm, (2) that informative features and feature correlations are indeed useful, and (3) that our algorithm is computationally efficient.
4.1 Synthetic Simulations
We set the problem size as ; for simplicity, we take , and ; let be the condition number of the feature matrix . First, we generate approximately well-conditioned weakly incoherent feature matrices by computing where the entries of and
are drawn iid from the standard normal distribution followed by row normalization, and the diagonal entries ofare set to one. Next, the latent matrix is generated by sampling each entry independently and uniformly at random from the interval , performing SVD of this sampled matrix and retaining its top singular values. The low-rank component is then computed as . Note that this also ensure the feasibility condition in Assumption 1. Next, we generate the sparse matrix as follows. We first choose the support according to the Bernoulli sampling model, ie, each entry is chosen to be included in the support with probability and then its value is chosen independently and uniformly at random from .
There are four main parameters in the problem namely, (a) the sparsity level of , (b) the rank of , (c) the feature dimension , and (d) condition number of the feature matrix ; we vary each of these while fixing the others. We compare the performance of our algorithm to that of two existing algorithms namely, (i) the convex relaxation approach ‘PCPF’ due to (Chiang et al. , 2016) which is a state-of-the-art robust PCA method in the inductive setting, and (ii) ‘AltProj’ due to (Netrapalli et al. , 2014) which is a state-of-the-art robust PCA method in the transductive setting. We execute these algorithms until an accuracy of is achieved and time them individually. All the results presented in the running time plots in Figure 1 are obtained by averaging over five runs.
We note that our algorithm outperforms PCPF and AltProj consistently while increasing the problem hardness in three situations (Figures 1-(a), 1-(b) and 1-(c)) in terms of running time. The gain in terms of scalability of our method over the convex PCPF method is attributed to the fact that the soft thresholding operation for solving the nuclear-norm objective involves computing the partial-SVD of the intermediate iterates which could of potentially much higher rank than – this leads to worst-case time complexity for the SVD step in PCPF as opposed to our algorithm which has worst-case complexity for spectral hard thresholding. The time gain over the transductive AltProj method is attributed to the fact that our spectral hard-thresholding is performed in the -dimensional (feature) space rather than the -dimensional (ambient) space; moreover, another factor that adds to the running time of AltProj is that it proceeds in stages unlike Algorithm 1. An interesting point to be noted from the relatively flat plot in Figure 1-(d) is that the condition number dependence in Assumption 3 is merely an artifact of our analysis and is not inherent to the problem; we leave tightening this bound in theory to future work.
4.2 Real-Data Experiments
As described in Example 1.1, we consider an important application of our method – to robustify estimation in recommendation systems while leveraging feature information; specifically, the task is to predict user-movie ratings accurately despite the presence of gross sparse corruptions. We take the MovieLens 111http://grouplens.org/datasets/movielens/ dataset which consists of ratings from users on movies. The ground-truth in this dataset is, per se, unavailable. Hence, as the first step, we apply matrix completion techniques (specifically, using the OptSpace algorithm of (Keshavan et al. , 2010)) to obtain a baseline complete user-movie ratings matrix, ; we take . Next, we form features while ensuring the feasibility condition. For this, we compute the SVD of the baseline matrix, followed by setting (resp. ) where (resp. ) are random rotation matrices; we take and . Note that forming features using the SVD result, as we have done here, is a common technique in inductive matrix estimation problems (see, for instance, (Natarajan & Dhillon, 2014)). We then add a sparse perturbation matrix whose each entry is chosen to be included in the support with probability and the entries are chosen independently and uniformly at random from