1 Introduction
Many machine learning and artificial intelligence tasks involve the separation of a data matrix into a lowrank structure and a sparse part capturing different information. Robust principal component analysis (RPCA)
Candes et al. (2011); Chandrasekaran et al. (2011) is a popular framework that logically characterizes this matrix separation problem.Nevertheless, prior side information, oftentimes in the form of features, may also be present in practice. For instance, features are available for the following tasks:

Collaborative filtering: apart from ratings of an item by other users, the profile of the user and the description of the item can also be exploited in making recommendations Chiang et al. (2015);

Relationship prediction: user behaviours and message exchanges can assist in finding missing links on social media networks Xu et al. (2013);

Personspecific facial deformable models: an orthonormal subspace learnt from manually annotated data captured inthewild, when fed into an image congealing procedure, can help produce more correct fittings Sagonas et al. (2014).
It is thus reasonable to investigate how propitious it is for RPCA to exploit the available features. Indeed,
recent results Liu et al. (2017) indicate that features are not redundant at all. In the setting of multiple subspaces, RPCA degrades as the number of subspaces grows because of the increased rowcoherence. On the other hand, the use of feature dictionaries allows accurate lowrank recovery by removing the dependency on rowcoherence. Despite the theoretical and practical merits of convexified RPCA with features, such as LRR Liu et al. (2010) and PCPF Chiang et al. (2016), convex relaxations of the rank function and norm necessarily lead into algorithm weakening Chandrasekarana and Jordan (2013).
On a separate note, recent advances in nonconvex optimization algorithms continue to undermine their convex counterparts Gong et al. (2013); Ge et al. (2016); Kohler and Lucchi (2017). In particular, nonconvex RPCA algorithms such as fast RPCA Yi et al. (2016) and AltProj Netrapalli et al. (2014) exhibit better properties than the convex formulation. Most recently, Niranjan et al. (2017) embedded features into a nonconvex RPCA framework known as IRPCAIHT with faster speed. However, it remains unclear as to whether features have been effectively incorporated into nonconvex RPCA and the benefits of accuracy, speed and so on have been exploited as much as possible.
In this work, we give positive answers to the above questions by proposing a novel, nonconvex scheme that fully leverages features, which reveal true row and column subspaces, to decompose an observation matrix into a core matrix with given rank and a residual part with informed sparsity. Even though the proposed algorithm is inspired by the recently proposed fast RPCA Yi et al. (2016), our contributions are by no means trivial, especially from a theoretical perspective. First, fast RPCA cannot be easily extended to consistently take account of features. Second, as we show in this paper, incoherence assumptions on the observation matrix and features play a decisive role in determining the corruption bound and the computational complexity of the nonconvex algorithm. Third, fast RPCA is limited to a corruption rate of due to their choice of the hard threshold, whereas our algorithm ups this rate to . Fourth, we prove that the costly projection onto factorized spaces is entirely optional when features satisfy certain incoherence conditions. Although our algorithm maintains the same corruption rate of and complexity of as fast RPCA, we show empirically that massive gains in accuracy and speed can still be obtained. Besides, the transfer of coherence dependency from observation to features means that our algorithm is capable of dealing with highly incoherent data.
Unavoidably, features adversely affect tolerance to corruption in IRPCAIHT () compared to its predecessor AltProj (). This is not always true with our algorithm in relation to fast RPCA. And when the underlying rank is low but features are only weakly informative, i.e. , which is often the case, our tolerance to corruption is arguably better. IRPCAIHT also has a higher complexity of than that of our algorithm. Although featurefree convex and nonconvex algorithms have higher asymptotic error bounds than our algorithm, we show in our experiments that this does not translate as accuracy in reality. Our algorithm still has the best performance in recovering accurately the lowrank part from highly corrupted matrices. This may be attributed to the fact that our bounds are not tight. Besides, PCPF and AltProj have much higher complexity ( and ) than ours. For PCPF, there does not exist any theoretical analysis under the deterministic sparsity model. Nonetheless, we show in our experiments that our algorithm is superior with regard to both recoverability and running time. The overall contribution of this paper is as follows:

A novel nonconvex algorithm integrating features with informed sparsity is proposed in order to solve RPCA problem.

We establish theoretical guarantees of exact recovery under different assumptions regarding the incoherence of features and observation.

Extensive experimental results on synthetic data indicate that the proposed algorithm is faster and more accurate in lowrank matrix recovery than the compared stateoftheart convex and nonconvex methods for RPCA (with and without features).

Experiments on two realworld datasets, namely MNIST and Yale B database demonstrate the practical merits of the proposed algorithm.
2 Notations
Lowercase letters denote scalars and uppercase letters denote matrices, unless otherwise stated. and represent the row and the column of . Projection onto support set is given by . is the elementwise absolute value of matrix . For norms of matrix , is the Frobenius norm; is the nuclear norm;
is the largest singular value; otherwise,
is thenorm of vectorized
; and is the maximum of matrix row norms. Moreover, represents tr() for real matrices . Additionally, is the largest singular value of a matrix.The Euclidean metric is not applicable here because of the nonuniqueness of the bifactorisation , which corresponds to a manifold rather than a point. Hence, we define the following distance between and any of the optimal pair such that :
(1) 
where is an orthogonal matrix.
3 Related Work
RPCA concerns a known observation matrix which we are seeking to decompose into matrices , such that is lowrank and is sparse and of arbitrary magnitude. Conceptually, it is equivalent to solving the following optimization problem:
(2) 
for appropriate . This problem, regrettably, is NPhard.
PCP Wright et al. (2009) replaces (2
) with convex heuristics:
(3) 
for some . In spite of the simplification, PCP can exactly recover the solution of RPCA under the random model Candes et al. (2011) and the deterministic model Chandrasekaran et al. (2011); Hsu et al. (2011).
If feasible feature dictionaries, and , regarding row and column spaces are available, PCPF Chiang et al. (2016) makes use of these to generalize (3) to the below objective:
(4) 
for the same as in (3). Convergence to the RPCA solution has only been established for the random sparsity model.
AltProj Netrapalli et al. (2014) addresses RPCA by minimizing an entirely different objective:
(5) 
where the search consists of alternating nonconvex projections. That is, during each cycle, hardthresholding takes place first to remove large entries and projection of appropriate residuals onto the set of lowrank matrices with increasing ranks is carried out next. Exact recovery has also been established.
Fast RPCA Yi et al. (2016) follows yet another nonconvex approach to solve RPCA. After an initialization stage, fast RPCA updates bilinear factors , such that
through a series of projected gradient descent and sparse estimations, where
, minimize the following loss:(6) 
for , properly constrained. Recovery guarantee is ensured.
IRPCAIHT Niranjan et al. (2017) includes features , in an iterative nonconvex projection algorithm. Similar to AltProj, at each step, a new sparse estimate is calculated from hard thresholding via a monotonically decreasing threshold. After that, spectral hard thresholding takes place to attain the lowrank estimate. IRPCAIHT provably converges to the solution of RPCA.
We also mention here several works of nonconvex objectives Oh et al. (2015); Shang et al. (2017), though exact recovery guarantees are lacking.
4 Problem Setup
Suppose that there is a known data matrix , which can be decomposed into a lowrank component and a sparse error matrix of compatible dimensions. Our aim is to identify these underlying matrices and hence robustly recover the lowrank component with the help of available side information in the form of feature matrices and .
Concretely, let
be the singular value decomposition and
and . follows the random sparsity model. That is, the support of is chosen uniformly at random from the collection of all support sets of the same size. Furthermore, let us be informed of the proportion of nonzero entries per row and column, denoted by . Assume that there are also available features and such that they are feasible, i.e. col()col() and col()col() where col() is the column space of and ^{1}^{1}1This can always achieved via orthogonalisation..In this paper, we discuss robust lowrank recovery using the above mentioned features and three different incoherence conditions: (i) and ; (ii) and ; (iii) both (i) and (ii), where is the given rank of and , are constants.
5 Algorithm
We use a nonconvex approach to achieve the above objective. The algorithm consists of an initialization phase followed by a gradient descent phase. At each stage, we keep track of the factors , such that .
5.1 Hardthresholding
We first introduce the sparse estimator via hardthresholding which is used in both phases. Given a threshold , removes elements of that are not among the largest fraction of elements in their respective rows and columns, breaking ties arbitrarily for equal elements:
(7) 
where are the and largest element in absolute value in row and column respectively.
5.2 Initialization
is first initialized as . Next, we obtain as the truncated SVD of , which is calculated via . We can then construct and . Such an initialization scheme gives , the desirable properties for use in the second phase.
5.3 Gradient Descent
In case (i), we need the following sets:
(8) 
(9) 
Otherwise, we can simply take as and as .
To proceed, we first regularise and :
(10) 
At each iteratiion, we first update with the sparse estimator using a threshold of :
(11) 
For , , we define the following objective function
(12) 
and are updated by minimizing the above function subject to the constraints imposed by the sets and . That is,
(13) 
(14) 
where the step size is determined analytically below. With properly initialized and , such an optimization design converges to and . The procedure is summarized in Algorithm 1.
6 Analysis
We first provide theoretical justification of our proposed approach. Then we evaluate its computational complexity. The proofs can be found in the supplementary material.
6.1 Convergence
The initialization phase provides us with the following guarantees on and .
Theorem 1.
In cases (i) and (iii), if , we have
(15) 
In case (ii), if , we have
(16) 
where is the condition number of and is a distance metric defined in the appendix.
Theorem 2.
For , there exist constants , , , , and such that, in case (i), when , we have the following relationship
(17) 
in case (ii), when , we have
(18) 
and in case (iii), when , we have
(19) 
6.2 Complexity
From Theorem 2, it follows that our algorithm converges at a linear rate under assumptions (ii) and (iii). To converge below of the initial error, iterations are needed. At each iteration, the most costly step is matrix multiplication which takes time. Overall, our algorithm has total running time of .
7 Experimental results
We have found that when the step size is set to 0.5, reasonable results can be obtained. For all algorithms in comparison, we run a total of 3000 iterations or until is met.
7.1 Phase transition
Here, we vary the rank and the error sparsity to investigate the behavior of both our algorithm and existing stateofart algorithms in terms of recoverability. True lowrank matrices are created via , where matrices
have independent elements drawn randomly from a Gaussian distribution of mean
and variance
so becomes the rank of . Next, we corrupt each column of such that of the elements are set independently with magnitude . However, this does not guarantee row corruption. We thus select only matrices whose maximum row corruption does not exceed but we still feed to the algorithms in order to demonstrate that our algorithm does not need the exact value of corruption ratio. We consider two types of signs for error: Bernoulli and . The resulting thus becomes the simulated observation. In addition, let be the SVD of . Feature is formed by randomly interweaving column vectors of with 5 arbitrary orthonormal bases for the null space of , while permuting the expanded columns of with 5 random orthonormal bases for the kernel of forms feature . Hence, the feasibility conditions are fulfilled: colcol, colcol. For each pair, three observations are constructed. The recovery is successful if for all these three problems,(20) 
from the recovered .
Figures 1(I) plot results from algorithms incorporating features. Besides, our algorithm contrasts with fast RPCA in Figure 1(II). Other featurefree algorithms are investigated in Figure 1(III). Figures 1(a) illustrate the random sign model and Figures 1(b) for the coherent sign model. All previous nonconvex attempts fail to outperform their convex equivalents. IRPCAIHT is unable to deal with even moderate levels of corruption. The frontier of recoverability that has been advanced by our algorithm over PCPF is phenomenal, massively ameliorating fast RPCA. The anomalous asymmetry in the two sign models is no longer observed in nonconvex algorithms.
7.2 Running Time
Next, we highlight the speed of our algorithm for largescale matrices, typical of video sequences Xiong et al. (2016). 15001500 to 25002500 random observation matrices are generated, where the rank is chosen to be 20 of the column number and random sign error corrupts 11 of the entries, with features having a dimension of 50 of the column number. The running times of all algorithms except IRPCAIHT are plotted in 2 (i) because IRPCAIHT is not able to achieve a relative error () less than 1 for larger matrices. For fair comparison, we have relaxed the rank to 0.3 of the column number and error rate to 0.1 to compare our algorithm with IRPCAIHT for matrices ranging from 20002000 to 1000010000. We have used features having a dimension of 80 of the column number to speed up the process. The result is shown in Figure 2 (ii). All times are averaged over three trials. It is evident that, for large matrices, our algorithm overtakes all existing algorithms in terms of speed. Note that features in PCPF even slow down the recovery process.
clean  noisy  PCP  PCPF  AltProj  IRPCAIHT  fast RPCA  our algorithm  
10  30.45  82.75  83.35  81.4  65.2  81.1  86.9  
15  25.1  82.95  83.4  81.15  49.65  79.65  84.8  
20  89.65  23.15  83.5  84  79.3  37.8  78.65  83.8 
25  18.65  81.35  82.65  74.05  30.35  75.3  83.15  
30  18.6  77.95  79  71.5  24.1  72.9  82.05  
35  16.95  71.2  73.4  67.75  21.05  71.45  79.05 
clean  noisy  PCP  PCPF  AltProj  IRPCAIHT  fast RPCA  our algorithm  
10  87  87.25  87.3  86.45  89.3  89.25  90.3  
15  75.85  87.15  87.4  86.75  82.85  87.2  89.8  
20  92.25  64.35  87.6  87.55  84.65  71.2  85.55  88.55 
25  55.85  87  86.95  79.4  62.35  82.65  87.8  
30  47.15  81.15  81.55  76.75  53.5  78.3  85.65  
35  40.55  74.8  75.7  71  47.4  76.75  85.15 
7.3 Image Classification
Once images are denoised, classification can be performed on them. The classification results directly reflect the image denoising ability. For a set of correlated images, lowrank algorithms are normally used to remove noise that is sparse. The same classifier is thus able to compare the different lowrank models.
The MNIST dataset is such an example which contains handwritten digits divided into training and testing sets. Let the observation matrix be composed of 2000 vectorized random images from the test set stacked columnwise. In this case, the left feature obtained from the training set is also applicable to the test set because of the Eigendigit nature. This imparts our algorithm to supervised learning where there are clean related training samples available. The right feature does not posses such property and is set to the identity matrix. We add a range of sparse noise to the test set separately where the noise sets the pixel to 255. For PCPF, we take
as in Chiang et al. (2016) and for IRPCAIHT and our algorithm we use instead.The relative error between the recovered matrix by the competing algorithms and the clean test matrix is plotted in Figure 3. Our algorithm is most accurate in removing the added artificial noise. To evaluate how classifiers perform on the recovered matrices, we train the linear and kernel SVM using the training set and test the corresponding models on the recovered images. Table 1 tabulates the linear SVM. Table 2 tabulates the kernel SVM. Both classifiers confirm the recovery result obtained by various models corroborating our algorithm’s preeminent accuracy.
7.4 Face denoising
It is common practice to decompose raw facial images as a lowrank component for faithful face representation and a sparse component for defects. This is because the face is a convex Lambertian surface which under distant and isotropic lighting has an underlying model that spans a 9D linear subspace Basri and Jacobs (2003), but theoretical lighting conditions cannot be realised and there are unavoidable occlusion and albedo variations in real images. We demonstrate that there can be a substantial boost to the performance of facial denoising by leveraging dictionaries learnt from the images themselves.
The extended Yale B database is used as our observation which consists images under different illuminations for a fixed pose. We study all 64 images of a randomly chosen person. A observation matrix is formed by vectorizing each image. For fast RPCA and our algorithm, a sparsity of 0.2 is adopted. We learn the feature dictionary as in Xue et al. (2017). In a nutshell, the feature learning process can be treated as a sparse encoding problem. More specifically, we simultaneously seek a dictionary and a sparse representation such that:
(21) 
where is the number of atoms, ’s count the number of nonzero elements in each sparsity code and is the sparsity constraint factor. This can be solved by the KSVD algorithm. Here, feature is the dictionary , feature corresponds to a similar solution using the transpose of the observation matrix as input. We set to , to and used iterations.
As a visual illustration, recovered images from all algorithms are exhibited in Figure 4. For this challenging scenario, our algorithm totally removed all shadows. PCPF is smoother than PCP but still suffers from shade. AltProj and fast RPCA both introduced extra artefacts. Although IRPCAIHT managed to remove the shadows but brought back a severely distorted image. To quantitatively verify the improvement made by our proposed method, we examine the structural information contained within the denoised eigenfaces. Singular values of the recovered lowrank matrices from all algorithms are plotted in Figure 5. All nonconvex algorithms are competent in incorporating the rank information to keep only 9 singular values, vastly outperforming convex approaches. Among them, our algorithm has the most rapid decay that is found naturally Wright et al. (2011).
8 Conclusion
This work proposes a new nonconvex algorithm to solve RPCA with the help of features when the error sparsity is roughly known. Exact recovery guarantee has been established for three different assumptions about the incoherence conditions on features and the data observation matrix. Simulation experiments suggest that our algorithm is able to recover matrices of higher ranks corrupted by errors of higher sparsity than previous stateoftheart approaches. Large synthetic matrices also show that our algorithm scales best with observation matrix dimension. MNIST and Yale B datasets further justify that our algorithm leads other approaches by a fair margin. Future work may involve finding a more accurate initialization scheme.
Appendix A Convex Projection
Given , the problem of finding can be seen as projection onto the intersection of a series of closed convex sets , that is , where . We have emperically found that the Cyclic Dykstra algorithm Reich and Zaslavski (2012) has the fastest rate of convergence. Let , and , the Cyclic Dykstra algorithm updates, at each iteration, and .
For , we formulate the equivalent optimisation problem below
(22) 
for . Its solution is given by
(23) 
For , follows similarly.
We have also run experiments to see how much improvement can be gained by convex projection. 200200 highincoherence matrices are created with ranks from 140 to 155 and corrupted by 10 random sign errors. Our algorithm is applied with projection several times. Each uses a different number of iterative steps ranging from 0 to 2000. Recoverability is plotted against the number of iterative projections in Figure 6. There is hardly any noticeable improvement so we do not use convex projection in our comparison experiments. Further analysis is demanded to justify the redundency of convex projection.
Appendix B Proofs
For simplicity, we assume that , .
b.1 Proof of Theorem 1
We first declare some lemmas that will be essential to our result.
Lemma B.1.
Let be obtained from the initialisation phase, we have
(24) 
Proof.
See Yi et al. (2016) theorem 1. ∎
Lemma B.2.
For any matrix for which the proportion of nonzero entries per row and column is , we have
(25) 
Proof.
See Netrapalli et al. (2014) lemma 4. ∎
Lemma B.3.
For two rank matrices and of the same dimension whose compact SVDs are and , we have
(26) 
provided .
Proof.
See Tu et al. (2016) lemma 5.14. ∎
Lemma B.4.
For any matrices and of consistent sizes, we have
(27) 
Proof.
See Liu et al. (2017) lemma 4.2. ∎
Lemma B.5.
For any matrix with compact SVD ,
(28) 
Proof.
See Yi et al. (2016) theorem 1. ∎
Lemma B.6.
Let be obtained from the initialisation phase, we have
(29) 
Proof.
Weyl’s theorem tells us that, for , . When , and because has rank and SVD. ∎
Lemma B.7.
For
(30) 
Proof.
(31) 
∎
We begin by deriving a bound on ,
(32) 
where the first inequality follows from Lemma B.2 with , the second from Lemma B.1 and the third from Lemma B.5. Next, we look at :
(33) 
where we have used Lemma B.6 and (32).
In cases (i) and (iii), the condition gives and we have
(34) 
using Lemma B.7, Lemma B.3 and (33). So, we have
(35) 
In case (ii), we have
(36) 
(37) 
The condition gives and we have similar to (35)
(38) 
b.2 Proof of Theorem 2
To ease our exposition, we define the following auxiliary quantities.
Let the solution set be
(39) 
For any , the corresponding solution is given by
(40) 
Let , and , from which we have
(41) 
Let and , we have
(42) 
We also have
(43) 
(44) 
Let , we have
(45) 
(46) 
Let , and , then we have .
We now state several lemmas that will help us construct the proof.
Lemma B.8.
For any and , we have
(47) 
Proof.
(48) 
∎
Lemma B.9.
For , in case (i), if and , then
(49) 
and in cases (ii) and (iii), if and , then
(50) 
Proof.
(51) 
where we have used and