Informed Non-convex Robust Principal Component Analysis with Features

09/14/2017 ∙ by Niannan Xue, et al. ∙ 0

We revisit the problem of robust principal component analysis with features acting as prior side information. To this aim, a novel, elegant, non-convex optimization approach is proposed to decompose a given observation matrix into a low-rank core and the corresponding sparse residual. Rigorous theoretical analysis of the proposed algorithm results in exact recovery guarantees with low computational complexity. Aptly designed synthetic experiments demonstrate that our method is the first to wholly harness the power of non-convexity over convexity in terms of both recoverability and speed. That is, the proposed non-convex approach is more accurate and faster compared to the best available algorithms for the problem under study. Two real-world applications, namely image classification and face denoising further exemplify the practical superiority of the proposed method.

READ FULL TEXT VIEW PDF

Authors

page 9

page 13

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many machine learning and artificial intelligence tasks involve the separation of a data matrix into a low-rank structure and a sparse part capturing different information. Robust principal component analysis (RPCA)

Candes et al. (2011); Chandrasekaran et al. (2011) is a popular framework that logically characterizes this matrix separation problem.
Nevertheless, prior side information, oftentimes in the form of features, may also be present in practice. For instance, features are available for the following tasks:

  • Collaborative filtering: apart from ratings of an item by other users, the profile of the user and the description of the item can also be exploited in making recommendations Chiang et al. (2015);

  • Relationship prediction: user behaviours and message exchanges can assist in finding missing links on social media networks Xu et al. (2013);

  • Person-specific facial deformable models: an orthonormal subspace learnt from manually annotated data captured in-the-wild, when fed into an image congealing procedure, can help produce more correct fittings Sagonas et al. (2014).

It is thus reasonable to investigate how propitious it is for RPCA to exploit the available features. Indeed, recent results Liu et al. (2017) indicate that features are not redundant at all. In the setting of multiple subspaces, RPCA degrades as the number of subspaces grows because of the increased row-coherence. On the other hand, the use of feature dictionaries allows accurate low-rank recovery by removing the dependency on row-coherence. Despite the theoretical and practical merits of convexified RPCA with features, such as LRR Liu et al. (2010) and PCPF Chiang et al. (2016), convex relaxations of the rank function and -norm necessarily lead into algorithm weakening Chandrasekarana and Jordan (2013).
On a separate note, recent advances in non-convex optimization algorithms continue to undermine their convex counterparts Gong et al. (2013); Ge et al. (2016); Kohler and Lucchi (2017). In particular, non-convex RPCA algorithms such as fast RPCA Yi et al. (2016) and AltProj Netrapalli et al. (2014) exhibit better properties than the convex formulation. Most recently, Niranjan et al. (2017) embedded features into a non-convex RPCA framework known as IRPCA-IHT with faster speed. However, it remains unclear as to whether features have been effectively incorporated into non-convex RPCA and the benefits of accuracy, speed and so on have been exploited as much as possible.
In this work, we give positive answers to the above questions by proposing a novel, non-convex scheme that fully leverages features, which reveal true row and column subspaces, to decompose an observation matrix into a core matrix with given rank and a residual part with informed sparsity. Even though the proposed algorithm is inspired by the recently proposed fast RPCA Yi et al. (2016), our contributions are by no means trivial, especially from a theoretical perspective. First, fast RPCA cannot be easily extended to consistently take account of features. Second, as we show in this paper, incoherence assumptions on the observation matrix and features play a decisive role in determining the corruption bound and the computational complexity of the non-convex algorithm. Third, fast RPCA is limited to a corruption rate of due to their choice of the hard threshold, whereas our algorithm ups this rate to . Fourth, we prove that the costly projection onto factorized spaces is entirely optional when features satisfy certain incoherence conditions. Although our algorithm maintains the same corruption rate of and complexity of as fast RPCA, we show empirically that massive gains in accuracy and speed can still be obtained. Besides, the transfer of coherence dependency from observation to features means that our algorithm is capable of dealing with highly incoherent data.
Unavoidably, features adversely affect tolerance to corruption in IRPCA-IHT () compared to its predecessor AltProj (). This is not always true with our algorithm in relation to fast RPCA. And when the underlying rank is low but features are only weakly informative, i.e. , which is often the case, our tolerance to corruption is arguably better. IRPCA-IHT also has a higher complexity of than that of our algorithm. Although feature-free convex and non-convex algorithms have higher asymptotic error bounds than our algorithm, we show in our experiments that this does not translate as accuracy in reality. Our algorithm still has the best performance in recovering accurately the low-rank part from highly corrupted matrices. This may be attributed to the fact that our bounds are not tight. Besides, PCPF and AltProj have much higher complexity ( and ) than ours. For PCPF, there does not exist any theoretical analysis under the deterministic sparsity model. Nonetheless, we show in our experiments that our algorithm is superior with regard to both recoverability and running time. The overall contribution of this paper is as follows:

  • A novel non-convex algorithm integrating features with informed sparsity is proposed in order to solve RPCA problem.

  • We establish theoretical guarantees of exact recovery under different assumptions regarding the incoherence of features and observation.

  • Extensive experimental results on synthetic data indicate that the proposed algorithm is faster and more accurate in low-rank matrix recovery than the compared state-of-the-art convex and non-convex methods for RPCA (with and without features).

  • Experiments on two real-world datasets, namely MNIST and Yale B database demonstrate the practical merits of the proposed algorithm.

2 Notations

Lowercase letters denote scalars and uppercase letters denote matrices, unless otherwise stated. and represent the row and the column of . Projection onto support set is given by . is the element-wise absolute value of matrix . For norms of matrix , is the Frobenius norm; is the nuclear norm;

is the largest singular value; otherwise,

is the

-norm of vectorized

; and is the maximum of matrix row -norms. Moreover, represents tr() for real matrices . Additionally, is the largest singular value of a matrix.
The Euclidean metric is not applicable here because of the non-uniqueness of the bi-factorisation , which corresponds to a manifold rather than a point. Hence, we define the following distance between and any of the optimal pair such that :

(1)

where is an orthogonal matrix.

3 Related Work

RPCA concerns a known observation matrix which we are seeking to decompose into matrices , such that is low-rank and is sparse and of arbitrary magnitude. Conceptually, it is equivalent to solving the following optimization problem:

(2)

for appropriate . This problem, regrettably, is NP-hard.
PCP Wright et al. (2009) replaces (2

) with convex heuristics:

(3)

for some . In spite of the simplification, PCP can exactly recover the solution of RPCA under the random model Candes et al. (2011) and the deterministic model Chandrasekaran et al. (2011); Hsu et al. (2011).
If feasible feature dictionaries, and , regarding row and column spaces are available, PCPF Chiang et al. (2016) makes use of these to generalize (3) to the below objective:

(4)

for the same as in (3). Convergence to the RPCA solution has only been established for the random sparsity model.
AltProj Netrapalli et al. (2014) addresses RPCA by minimizing an entirely different objective:

(5)

where the search consists of alternating non-convex projections. That is, during each cycle, hard-thresholding takes place first to remove large entries and projection of appropriate residuals onto the set of low-rank matrices with increasing ranks is carried out next. Exact recovery has also been established.
Fast RPCA Yi et al. (2016) follows yet another non-convex approach to solve RPCA. After an initialization stage, fast RPCA updates bilinear factors , such that

through a series of projected gradient descent and sparse estimations, where

, minimize the following loss:

(6)

for , properly constrained. Recovery guarantee is ensured.
IRPCA-IHT Niranjan et al. (2017) includes features , in an iterative non-convex projection algorithm. Similar to AltProj, at each step, a new sparse estimate is calculated from hard thresholding via a monotonically decreasing threshold. After that, spectral hard thresholding takes place to attain the low-rank estimate. IRPCA-IHT provably converges to the solution of RPCA.
We also mention here several works of non-convex objectives Oh et al. (2015); Shang et al. (2017), though exact recovery guarantees are lacking.

4 Problem Setup

Suppose that there is a known data matrix , which can be decomposed into a low-rank component and a sparse error matrix of compatible dimensions. Our aim is to identify these underlying matrices and hence robustly recover the low-rank component with the help of available side information in the form of feature matrices and .
Concretely, let

be the singular value decomposition and

and . follows the random sparsity model. That is, the support of is chosen uniformly at random from the collection of all support sets of the same size. Furthermore, let us be informed of the proportion of non-zero entries per row and column, denoted by . Assume that there are also available features and such that they are feasible, i.e. col()col() and col()col() where col() is the column space of and 111This can always achieved via orthogonalisation..

In this paper, we discuss robust low-rank recovery using the above mentioned features and three different incoherence conditions: (i) and ; (ii) and ; (iii) both (i) and (ii), where is the given rank of and , are constants.

5 Algorithm

We use a non-convex approach to achieve the above objective. The algorithm consists of an initialization phase followed by a gradient descent phase. At each stage, we keep track of the factors , such that .

5.1 Hard-thresholding

We first introduce the sparse estimator via hard-thresholding which is used in both phases. Given a threshold , removes elements of that are not among the largest -fraction of elements in their respective rows and columns, breaking ties arbitrarily for equal elements:

(7)

where are the and largest element in absolute value in row and column respectively.

5.2 Initialization

is first initialized as . Next, we obtain as the -truncated SVD of , which is calculated via . We can then construct and . Such an initialization scheme gives , the desirable properties for use in the second phase.

5.3 Gradient Descent

In case (i), we need the following sets:

(8)
(9)

Otherwise, we can simply take as and as .
To proceed, we first regularise and :

(10)

At each iteratiion, we first update with the sparse estimator using a threshold of :

(11)

For , , we define the following objective function

(12)

and are updated by minimizing the above function subject to the constraints imposed by the sets and . That is,

(13)
(14)

where the step size is determined analytically below. With properly initialized and , such an optimization design converges to and . The procedure is summarized in Algorithm 1.

1:Observation , features , rank , corruption approximation and step size .
2:Initialization:
3:
4:-SVD
5:
6:
7:Gradient descent:
8:
9:
10:while not converged do
11:   
12:   
13:   
14:end while
15:,
Algorithm 1 Non-convex solver for robust principal component analysis with features

6 Analysis

We first provide theoretical justification of our proposed approach. Then we evaluate its computational complexity. The proofs can be found in the supplementary material.

6.1 Convergence

The initialization phase provides us with the following guarantees on and .

Theorem 1.

In cases (i) and (iii), if , we have

(15)

In case (ii), if , we have

(16)

where is the condition number of and is a distance metric defined in the appendix.

Theorem 2.

For , there exist constants , , , , and such that, in case (i), when , we have the following relationship

(17)

in case (ii), when , we have

(18)

and in case (iii), when , we have

(19)

6.2 Complexity

From Theorem 2, it follows that our algorithm converges at a linear rate under assumptions (ii) and (iii). To converge below of the initial error, iterations are needed. At each iteration, the most costly step is matrix multiplication which takes time. Overall, our algorithm has total running time of .

7 Experimental results

Figure 1: Domains of recovery by various algorithms: (a) for random signs and (b) for coherent signs.

We have found that when the step size is set to 0.5, reasonable results can be obtained. For all algorithms in comparison, we run a total of 3000 iterations or until is met.

7.1 Phase transition

Here, we vary the rank and the error sparsity to investigate the behavior of both our algorithm and existing state-of-art algorithms in terms of recoverability. True low-rank matrices are created via , where matrices

have independent elements drawn randomly from a Gaussian distribution of mean

and variance

so becomes the rank of . Next, we corrupt each column of such that of the elements are set independently with magnitude . However, this does not guarantee row corruption. We thus select only matrices whose maximum row corruption does not exceed but we still feed to the algorithms in order to demonstrate that our algorithm does not need the exact value of corruption ratio. We consider two types of signs for error: Bernoulli and . The resulting thus becomes the simulated observation. In addition, let be the SVD of . Feature is formed by randomly interweaving column vectors of with 5 arbitrary orthonormal bases for the null space of , while permuting the expanded columns of with 5 random orthonormal bases for the kernel of forms feature . Hence, the feasibility conditions are fulfilled: colcol, colcol. For each pair, three observations are constructed. The recovery is successful if for all these three problems,

(20)

from the recovered .

Figures 1(I) plot results from algorithms incorporating features. Besides, our algorithm contrasts with fast RPCA in Figure 1(II). Other feature-free algorithms are investigated in Figure 1(III). Figures 1(a) illustrate the random sign model and Figures 1(b) for the coherent sign model. All previous non-convex attempts fail to outperform their convex equivalents. IRPCA-IHT is unable to deal with even moderate levels of corruption. The frontier of recoverability that has been advanced by our algorithm over PCPF is phenomenal, massively ameliorating fast RPCA. The anomalous asymmetry in the two sign models is no longer observed in non-convex algorithms.

7.2 Running Time

Figure 2: (i) Running times for observation matrices of increasing dimensions for (i) PCP, PCPF, fast RPCA, AltProj, our algorithm and (ii) IRPCA-IHT and our algorithm when .

Next, we highlight the speed of our algorithm for large-scale matrices, typical of video sequences Xiong et al. (2016). 15001500 to 25002500 random observation matrices are generated, where the rank is chosen to be 20 of the column number and random sign error corrupts 11 of the entries, with features having a dimension of 50 of the column number. The running times of all algorithms except IRPCA-IHT are plotted in 2 (i) because IRPCA-IHT is not able to achieve a relative error () less than 1 for larger matrices. For fair comparison, we have relaxed the rank to 0.3 of the column number and error rate to 0.1 to compare our algorithm with IRPCA-IHT for matrices ranging from 20002000 to 1000010000. We have used features having a dimension of 80 of the column number to speed up the process. The result is shown in Figure 2 (ii). All times are averaged over three trials. It is evident that, for large matrices, our algorithm overtakes all existing algorithms in terms of speed. Note that features in PCPF even slow down the recovery process.

clean noisy PCP PCPF AltProj IRPCA-IHT fast RPCA our algorithm
10 30.45 82.75 83.35 81.4 65.2 81.1 86.9
15 25.1 82.95 83.4 81.15 49.65 79.65 84.8
20 89.65 23.15 83.5 84 79.3 37.8 78.65 83.8
25 18.65 81.35 82.65 74.05 30.35 75.3 83.15
30 18.6 77.95 79 71.5 24.1 72.9 82.05
35 16.95 71.2 73.4 67.75 21.05 71.45 79.05
Table 1: Classification results obtained by a linear SVM.
clean noisy PCP PCPF AltProj IRPCA-IHT fast RPCA our algorithm
10 87 87.25 87.3 86.45 89.3 89.25 90.3
15 75.85 87.15 87.4 86.75 82.85 87.2 89.8
20 92.25 64.35 87.6 87.55 84.65 71.2 85.55 88.55
25 55.85 87 86.95 79.4 62.35 82.65 87.8
30 47.15 81.15 81.55 76.75 53.5 78.3 85.65
35 40.55 74.8 75.7 71 47.4 76.75 85.15
Table 2: Classification results obtained by an SVM with RBF kernel.

7.3 Image Classification

Figure 3: Relative error () for sparsity values: 10, 15, 20, 25%, 35.

Once images are denoised, classification can be performed on them. The classification results directly reflect the image denoising ability. For a set of correlated images, low-rank algorithms are normally used to remove noise that is sparse. The same classifier is thus able to compare the different low-rank models.

The MNIST dataset is such an example which contains hand-written digits divided into training and testing sets. Let the observation matrix be composed of 2000 vectorized random images from the test set stacked column-wise. In this case, the left feature obtained from the training set is also applicable to the test set because of the Eigendigit nature. This imparts our algorithm to supervised learning where there are clean related training samples available. The right feature does not posses such property and is set to the identity matrix. We add a range of sparse noise to the test set separately where the noise sets the pixel to 255. For PCPF, we take

as in Chiang et al. (2016) and for IRPCA-IHT and our algorithm we use instead.

The relative error between the recovered matrix by the competing algorithms and the clean test matrix is plotted in Figure 3. Our algorithm is most accurate in removing the added artificial noise. To evaluate how classifiers perform on the recovered matrices, we train the linear and kernel SVM using the training set and test the corresponding models on the recovered images. Table 1 tabulates the linear SVM. Table 2 tabulates the kernel SVM. Both classifiers confirm the recovery result obtained by various models corroborating our algorithm’s pre-eminent accuracy.

7.4 Face denoising

Figure 4: (i) original; (ii) PCPF; (iii) our algorithm; (iv) IRPCA-IHT; (v) PCP; (vi) fast RPCA; (vii) AltProj.

It is common practice to decompose raw facial images as a low-rank component for faithful face representation and a sparse component for defects. This is because the face is a convex Lambertian surface which under distant and isotropic lighting has an underlying model that spans a 9-D linear subspace Basri and Jacobs (2003), but theoretical lighting conditions cannot be realised and there are unavoidable occlusion and albedo variations in real images. We demonstrate that there can be a substantial boost to the performance of facial denoising by leveraging dictionaries learnt from the images themselves.
The extended Yale B database is used as our observation which consists images under different illuminations for a fixed pose. We study all 64 images of a randomly chosen person. A observation matrix is formed by vectorizing each image. For fast RPCA and our algorithm, a sparsity of 0.2 is adopted. We learn the feature dictionary as in Xue et al. (2017). In a nutshell, the feature learning process can be treated as a sparse encoding problem. More specifically, we simultaneously seek a dictionary and a sparse representation such that:

(21)

where is the number of atoms, ’s count the number of non-zero elements in each sparsity code and is the sparsity constraint factor. This can be solved by the K-SVD algorithm. Here, feature is the dictionary , feature corresponds to a similar solution using the transpose of the observation matrix as input. We set to , to and used iterations.

As a visual illustration, recovered images from all algorithms are exhibited in Figure 4. For this challenging scenario, our algorithm totally removed all shadows. PCPF is smoother than PCP but still suffers from shade. AltProj and fast RPCA both introduced extra artefacts. Although IRPCA-IHT managed to remove the shadows but brought back a severely distorted image. To quantitatively verify the improvement made by our proposed method, we examine the structural information contained within the denoised eigenfaces. Singular values of the recovered low-rank matrices from all algorithms are plotted in Figure 5. All non-convex algorithms are competent in incorporating the rank information to keep only 9 singular values, vastly outperforming convex approaches. Among them, our algorithm has the most rapid decay that is found naturally Wright et al. (2011).

Figure 5: Log-scale singular values of the denoised matrices.

8 Conclusion

This work proposes a new non-convex algorithm to solve RPCA with the help of features when the error sparsity is roughly known. Exact recovery guarantee has been established for three different assumptions about the incoherence conditions on features and the data observation matrix. Simulation experiments suggest that our algorithm is able to recover matrices of higher ranks corrupted by errors of higher sparsity than previous state-of-the-art approaches. Large synthetic matrices also show that our algorithm scales best with observation matrix dimension. MNIST and Yale B datasets further justify that our algorithm leads other approaches by a fair margin. Future work may involve finding a more accurate initialization scheme.

Appendix A Convex Projection

Given , the problem of finding can be seen as projection onto the intersection of a series of closed convex sets , that is , where . We have emperically found that the Cyclic Dykstra algorithm Reich and Zaslavski (2012) has the fastest rate of convergence. Let , and , the Cyclic Dykstra algorithm updates, at each iteration, and .

For , we formulate the equivalent optimisation problem below

(22)

for . Its solution is given by

(23)

For , follows similarly.

We have also run experiments to see how much improvement can be gained by convex projection. 200200 high-incoherence matrices are created with ranks from 140 to 155 and corrupted by 10 random sign errors. Our algorithm is applied with projection several times. Each uses a different number of iterative steps ranging from 0 to 2000. Recoverability is plotted against the number of iterative projections in Figure 6. There is hardly any noticeable improvement so we do not use convex projection in our comparison experiments. Further analysis is demanded to justify the redundency of convex projection.

Figure 6: Effectiveness of convex projection.

Appendix B Proofs

For simplicity, we assume that , .

b.1 Proof of Theorem 1

We first declare some lemmas that will be essential to our result.

Lemma B.1.

Let be obtained from the initialisation phase, we have

(24)
Proof.

See Yi et al. (2016) theorem 1. ∎

Lemma B.2.

For any matrix for which the proportion of non-zero entries per row and column is , we have

(25)
Proof.

See Netrapalli et al. (2014) lemma 4. ∎

Lemma B.3.

For two rank matrices and of the same dimension whose compact SVDs are and , we have

(26)

provided .

Proof.

See Tu et al. (2016) lemma 5.14. ∎

Lemma B.4.

For any matrices and of consistent sizes, we have

(27)
Proof.

See Liu et al. (2017) lemma 4.2. ∎

Lemma B.5.

For any matrix with compact SVD ,

(28)
Proof.

See Yi et al. (2016) theorem 1. ∎

Lemma B.6.

Let be obtained from the initialisation phase, we have

(29)
Proof.

Weyl’s theorem tells us that, for , . When , and because has rank and -SVD. ∎

Lemma B.7.

For

(30)
Proof.
(31)


We begin by deriving a bound on ,

(32)

where the first inequality follows from Lemma B.2 with , the second from Lemma B.1 and the third from Lemma B.5. Next, we look at :

(33)

where we have used Lemma B.6 and (32).
In cases (i) and (iii), the condition gives and we have

(34)

using Lemma B.7, Lemma B.3 and (33). So, we have

(35)

In case (ii), we have

(36)
(37)

The condition gives and we have similar to (35)

(38)

b.2 Proof of Theorem 2

To ease our exposition, we define the following auxiliary quantities.

Let the solution set be

(39)

For any , the corresponding solution is given by

(40)

Let , and , from which we have

(41)

Let and , we have

(42)

We also have

(43)
(44)

Let , we have

(45)
(46)

Let , and , then we have .

We now state several lemmas that will help us construct the proof.

Lemma B.8.

For any and , we have

(47)
Proof.
(48)

Lemma B.9.

For , in case (i), if and , then

(49)

and in cases (ii) and (iii), if and , then

(50)
Proof.
(51)

where we have used and