Principal component analysis (PCA) bas been one of the most widely used techniques for unsupervised learning in various applications. The classic PCA aims at seeking a low-rank approximation of a given data matrix. Mathematically, it uses the
norm to fit the reconstruction error, which is known to be sensitive to noise and outliers. The harder problem of seeking a PCA effective for outlier-corruped data is called robust PCA (RPCA). There has been no mathematically precise meaning for the term “outlier”. Thus multiple methods have been attempted to define or quantify this term, such as alternating minimization , random sampling techniques [17, 9], multivariate trimming , and so on [27, 7].
Among these methods, a recently emerged one treats an outlier as an additive sparse corruption , which leads to decomposing the data into a low-rank and a sparse part. Given data matrix , based on such a decomposition assumption, the corresponding RPCA method aims to mathematically solve the following problem [25, 6]:
where is a balancing parameter, and is the (pseudo) norm that counts the number of nonzero elements of the matrix. It is generally NP-hard to solve the rank function and norm-based optimization problems. Hence, in practice creftype 1 is often relaxed to the following convex problem :
where is the nuclear norm that adds all singular values of the input matrix and is the norm of a matrix. A number of algorithms have been developed to solve creftype 2, such as singular value thresholding (SVT) , accelerated proximal gradient (APG) , and inexact agumented Lagrange multipliers (IALM) . These algorithms, however, need to compute SVDs of matrices of size at each iteration, which, is known to generally have at least a quadratic or even cubic complexity . Thus, due to the use of SVDs, high complexity of these algorithms renders them less applicable to large-scale data. To improve efficiency, an augmented Lagrange multipliers (ALM)-based algorithm adopts the PROPACK package  to solve partial, instead of full, SVDs. Even with partial SVD, it is still computationally costy when and are both large.
The convex RPCA in creftype 2 has two known limitations: 1) Without the incoherence guarantee of the underlying matrix, or when the data is grossly corrupted, the results can be much deviated from the truth 
; 2) When the matrix has large singular values, its nuclear norm may lead to an estimation far from the rank. To combact these drawbacks, several approaches to a better rank approximation have been proposed. For example, the rank of is fixed and used as a hard constraint in , and a nonconvex rank approximation is adopted to more accurately approximate the rank function in . However, these nonconvex approaches also need to solve full SVDs of matrices. Two methods in [15, 19] need only to solve partial SVDs, which significantly reduces the complexity compared to full SVDs; for example, AltProj has a complexity of , with being the ground truth rank of . However, if is not known a priori,  usually fails to recover .
As large-scale data is increasingly ubiquitous, it is crucial to handle them with more efficient and scalable RPCA methods which, nonetheless, are still largely missing. To address such a need and challenge, in this paper, we propose a new RPCA method, called RES-PCA. This model does not depend on rank approximation to recover the low-rank component; rather, it effectivelly exploits the underlying group structural information of the low-rank component for the recovery. Consequently, the new method does not need to solve any SVDs as current state-of-the-art methods typically do, which avoids any quadratic or higher complexity; more specifically, the proposed method has a linear complexity in both and , rendering it lightweight, scalable, and thus suitable for large-scale data applications. We summarize the contributions of this paper as follows:
We propose a new type of RPCA model exploiting the underlying group structures of the low-rank component.
We develop an ALM-based algorithm for optimization, which uses no matrix decomposition and has linearly efficient computation at each iteration. The new method is scalable in data dimension and sample size, suitable for large-scale data.
Extensive experiments have demonstrated the effectiveness of the proposed method quantitatively and qualitatively.
The rest of this paper is organized as follows. We first briefly review some related work. Then we introduce the new method and its optimization. Next, we conduct experiments to evaluate the new method. Finally, we conclude the paper.
2 Related Work
The convex RPCA in creftype 2 considers the sparsity of the sparse component in an element-wise manner . To exploit example-wise sparsity, the norm has been adopted by replacing the norm in (2) [26, 18]:
It is ponited out that the nuclear norm may be far from accurate in approximating the rank function . To alleviate this defficiency, some new rank approximations have been used to replace the nuclear norm in creftypeplural 3 and 2, such as -norm . The -norm based RPCA solves the following optimization problem:
where , , and is the -th largest singular value of . Here, with different values used for , the -norm may have different performance in approximating the rank function.
Another recent nonconvex approach to RPCA, AltProj, cobmines the simplicity of PCA and elegant theory of convex RPCA . It alternatively projects the fitting residuals onto the low-rank and sparse sets. Given that the desired rank of is , AltProj computes a rank- projection in each of the total stages, with . During this process, matrix elements with large fitting errors are discarded such that sparse errors are suppressed. This method enjoys several nice properties; however, it needs the precise knowledge of the ground truth rank of , which is not always available. Without such knowledge, AltProj may fail to recover the low-rank component.
3 New Robust PCA Method
The classic RPCA and its variants usually require to solve SVDs, which has a high complexity. To overcome this drawback, in this paper we consider a new type of RPCA model that has a linear complexity. Motivated by the convex RPCA approach, we assume that the data can be decomposed as . Here, is the low-rank component of and its columns are linearly dependent in linear algebra; hence, it is true that many columns of
share high similarities and thus are close geometrically in Euclidean space. In the case of a single rank-1 subspace, the above assumption naturally leads to the minimization of the sum of squared mutual distances, or equivalently the variance (scaled by
), of the column vectors of:
where is a balancing parameter, is the th column of , and is the norm of a vector. It is noted that, though not necessary, it is sufficient that the minimization of the first term in creftype 5 leads to low-rank structure for . To see this, we reformulate it as , which is the sum of squares of residuals (SSR) from each data point to the average of all data points. Thus, by minimizing it, all columns are close to their average and the average is the minimizer of SSR, which ideally lead to rank-1 solution to . Under some mild conditions, we have the following theorem.
Given a matrix , with , and , , we have that is sufficient and necessary for
where , and 1 is an all- vector of dimension .
It is noted that the double summation in the first term of creftype 5 can be written as , by minimizing which we can obtain the desired low-rank structure. It is natural to generalize the above idea. To this end, we consider the case of multiple rank-1 subspaces with the following model, which we refer to as Robust, linearly Efficient, Scalable PCA (RES-PCA):
is an identity matrix of size, is an -dimensional column vector containing 1’s, is an operator that returns a diagonal matrix from an input vector, and is a binary vector with the positions of s indicating which of the column vectors belong to the -th subspace. It is evident that by automatically learning ’s we are able to obtain the structural information about the low-rank subspaces. It is noted that different norms can be used for , such as and norms; in this paper, without loss of generality, we adopt the norm to capture the sparse structure of . In next section, we will develop an efficient algorithm to optimize creftype 8.
Remark In the case that data have nonlinear relationships, i.e., and are close on manifold rather than in Euclidean space if they come from the same subspace, a direct extension of our method can be made, which is presented in Section 4.2. Since the linear model provides with us the key ideas and contributions of this paper, and the experiments have confirmed its effectiveness in several real world applications, we focus on the linear model in our paper. Due to space limit, we do not fully expand the nonlinear model and will consider it in further research and more applications.
Then we adpot the alternating decent approach to optimization, where at each step we optimize a subproblem with respect to a variable while keeping the others fixed. The detailed optimization strategies for each variable are described in the following.
The -subproblem is to solve the following problem:
Omitting the factor , it is seen that the first term above can be derived as
where the operator returns the submatrix of that contains the columns of corresponding to nonzeros of . Correspondingly, it is straightforward to see that the second term of creftype 10 can be decomposed in a similar way:
Hence, can be solved by individually solving the following subproblems for :
The above subproblems are convex and according to the first-order optimality condition we have
where, for ease of presentation, we denote , and . Hence, creftype 14 leads to the soluation of :
It is notable that due to the special structure of creftype 16 its inversion has a simple analytic expression by using the Sherman-Morrison-Woodbury formula:
Hence, it is apparent that that creftype 15 can be written as follows:
which has a linear complexity in both and by exploiting matrix-vector multiplications. can be obtained accordingly after obtaining all , for .
The subproblem associated with -minimization is given as follows:
It is seen that
where denotes the -th element of . Hence, the -subproblems can be converted to
which is simply the standard K-means problem. This is surprising in that we only need to perform K-means toand then the optimal simply corresponds to the group indicator matrix:
as a clustering rather than optimization problem. For example, if we consider nonlinear clustering algorithms, such as spectral clustering, the recoveredand actually reflect nonlinear structures of the data, which can be treated as a direct nonlinear extension of our method to account for nonlinear relationships of the data.
For the updating of and , we follow a standard approach in ALM framework:
where is a parameter that controls the increasing speed of .
Regarding the complexity of the above optimization procedure, it should be noted that each step requires complexity and typically ALM converges in a finite number of steps , thus the overall complexity of our method is .
In this section, we evaluate the proposed method in comparison with several current state-of-the-art algorithms, including variational Bayesian RPCA (VBRPCA) , IALM for convex RPCA , AltProj , NSA , and PCP . In particular, we follow [21, 13] and evaluate RES-PCA in three applications, including foreground-background separation from video sequences, shadow removal from face images, and anamoly detection from hand-written digits. All these experiments are conducted under Ubuntu system with 12 Intel(R) Xeon(R) W-2133 CPR 3.60GHz. All algorithms are terminated if a maximum of 500 iterations is reached or is satisfied.
5.1 Foreground-Background Seperation
Foreground-background separation is to detect moving objects or interesting activities in a scene, and remove background(s) from a video sequence. The background(s) and moving objects correspond to the low-rank and sparse parts, respectively. For this task, we use 9 datasets, whose characteristics are summarized in Table 1. Among these video datasets, the first 5 contain a single background while the remaining sequences have 2 backgrounds.
|Data Set||data size||# of backgrounds|
|Escalator Airport||130160 3,417||1|
|Hall Airport||144176 3,584||1|
|Shopping Mall||256320 1,286||1|
|Camera Parameter||240320 5,001||2|
|Light Switch-1||120160 2,800||2|
|Light Switch-2||120160 2,715||2|
|Data||Method||Rank()||# of Iter.||Time|
We set the rank to be the minimal number of singular values that contribute more
than 99.5% information to avoid the noise effect of small singular values.
“——” presents an “out of memory” issue.
For the parameters, we set them as follows. For IALM, we use the theoretically optimal balancing parameter . The same balancing parameter is used for PCP and NSA as suggested in the original papers. For fair comparison, we use for the proposed method. For AltProj, we specify the ground truth rank; for VBRPCA, we use the ground truth rank as its initial rank parameter. For fair comparison, we set to be ground truth rank for RES-PCA. For all methods that relay on ALM-optimization, we set the parameters to be and . These settings remain the same throughout this paper unless specified otherwise.
We show the results in Table 2. It is observed that AltProj, VBRPCA, and RES-PCA are able to recover the backgrounds from the video with low rank while IALM, NSA and PCP with much higher ranks. However, it is noted that VBRPCA may recover with ranks lower than the ground truth. For example, on Light Switch-1, Light Switch-2, and Camera Parameter data sets, the ground truth rank of the background is 2 whereas VBRPCA recovers the low rank parts with rank 1. This may be a potential problem, as will be clear later on in visual illustration. Although IALM, NSA amd PCP do not recover with desired low ranks, they recovery more sparsely than AltProj, VBRPCA, and RES-PCA. Besides, we observe that the speed of the proposed method is superior to that of the other methods. From Table 2, it is observed that the proposed method is about 3 times faster than AltProj, the second fastest one, and more than 10 (even about 60 on some data sets) times faster than IALM. Although the proposed method does not obtain the smallest errors at convergence on some data, it is noted that the levels of the errors are well comparable to the other methods.
It should be noted that for mthods such as IALM, PCP, and NSA, though they do not recover with desired low ranks, it is possible that by tunning their balancing parameters they may work well. However, tunning parameter for unsupervised learning method is usually time consuming. The proposed method has one balancing parameter, which has been empirically verified that the theoretical parameter as provided in  works well. A possible explaination is that RES-PCA has a close connection and thus enjoies the same optimal parameter with the convex RPCA. More theoretical validation is to be explored in further work.
Moreover, to visually compare the algorithms and illustrate the effectiveness of the proposed method, we show some decomposition results in Figs. 2 and 1. Since IALM, NSA and PCP cannot recover with desired low ranks, they cannot recover the backgrounds well. For example, we can observe shadows of car on highway in Fig. 1. VBRPCA reocvers with ranks lower than the ground truth on some data sets; consequently, on such data as Light Switch-2 in Fig. 2 we can see that VBRPCA cannot work well on data with different backgrounds. AltProj and RES-PCA can separate the backgrounds and foregrounds well.
To further assess the performance of the proposed method, we conduct the following experiments to compare the two methods that have achieved the top performance: AltProj and RES-PCA. In this test we asume that the ground truth rank of is unknown, and we set it to 5 for AltProj and for the proposed method. Some obtained results are given in Figs. 4 and 3. It is seen that RES-PCA can still separate the background and foreground well while AltProj fails. The success of RES-PCA in this kind of scenarios can be explained as follows: With greater than the ground truth rank of , a large group of backgrounds is usually divided into smaller groups such that the backgrounds within each group still share the same structure; as a consequence, RES-PCA can still recover the low-rank matrices correctly. This observation reveals that RES-PCA has superior performance to AltProj when the precise knowledge of the ground truth is unknown a priori.
5.2 Shadow removal from face images
Face recognition is an important topic; however, it is often plagued by heavy noise and shadows on face images . Therefore, there is a need to handle shadows. In this test, low-rank methods are used because the (unknown) clean images reside in a low-rank subspace, corresponding to , while the shadows correspond to . We use the Extended Yale B (EYaleB) data set for comparative study. EYaleB data contains face images from 38 persons, among which we select images of the first 2 persons, namely, subject 1 and subject 2. For each there are 64 images of pixels. Following the common approach as in [6, 13], we construct a data matrix for each person by vectorizing the images and perform different RPCA algorithms on the matrix. We show some results in Fig. 5 for visual inspection. It is observed that all methods can successfully remove shadows on subject 2, but some fail on subject 1. The proposed method removes shadows from face images on both subject 1 and subject 2, which confirms its effectiveness.
5.3 Anomaly Detection
Given a number of images from a subject, they form a low-dimensional subspace. Those images with stark differences from the majority can be regarded as outliers; besides, a few images from another subject are also treated as outliers. Anomaly detection is to identify such kinds of outliers from the dominant images. It is modeled thatis comprised of the dominant images while captures the outliers. For this test, we use USPS data set which consists of 9,298 hand-written digits of size . We follow  and vectorize the first 190 images of ‘1’s and the last 10 of ‘7’s to construct a data matrix. Since the dat set contains much more ‘1’s than ‘7’s, we regard the former as the dominant digit while the latter outlier. For visual illustration, we show examples of these digit images in Fig. 6.
It is observed that all the ‘7’s are outliers. Besides, some ‘1’s are quite different from the majority, such as the one with an underline. We apply RES-PCA to this data set and obtain the separated and . In , those columns corresponding to outliers have relatively larger values. Following , we use the norm to measure the columns of and show their values in Fig. 7, where we have vanished values smaller than 5 for clearer visualization. Then we show the corresponding digits in Fig. 8, which are the detected outliers. It is noted that RES-PCA has detected all the ‘7’s as well as some ‘1’s, such as the one with an underline. This has verified the effectiveness of RES-PCA in anomaly detection.
We have analyzed the scalability of the proposed method in previous sections. In this test, we empirically verify the result from our analysis regarding the linearity with and using the data sets in Table 1. For each of these data sets, we use different sampling ratios in sample size and data dimension, respectively, to collect its subsets of different sizes. On each subset, we perform RES-PCA 10 times. From Table 2, it is seen that all experiments are terminated within about 23-25 iterations; hence, in this test we temporarily ignore the terminating tolerance and terminate the experiment within a reasonable number of iterations, which is set to be 30. Then we report the average time cost and show the results in Fig. 9. It is observed that the time cost of RES-PCA increases linearly in both and , which confirms the scalability of the proposed method.
Existing RPCA methods typically need to solve SVDs of large matrices, which generally has at least a quadratic or even cubic complexity. To combat this drawback, in this pape we propose a new type of RPCA method. The new method recovers the low-rank component by exploiting geometrical similarities of the data, without performing any SVD that current state-of-the-art RPCA methods usually have to do. We develop an ALM-based optimization algorithm which is linearly efficient and scalable in both data dimension and sample size. Extensive experiments in different applications testify to the effectivenss of the proposed method, in which we observe superior performance in speed and visual quality to several current state-of-the-art methods. These observations suggest that the proposed method is suitable for large-scale data applications in real world problems.
This work is supported by National Natural Science Foundation of China under grants 61806106, 61802215, 61806045, 61502261, 61572457, and 61379132. C. Chen and Q. Cheng are corresponding authors.
-  Necdet Serhat Aybat, Donald Goldfarb, and Garud Iyengar. Fast first-order methods for stable principal component pursuit. arXiv preprint arXiv:1105.2126, 2011.
-  Ronen Basri and David W Jacobs. Lambertian reflectance and linear subspaces. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 25(2):218–233, 2003.
-  Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al.
Distributed optimization and statistical learning via the alternating
direction method of multipliers.
Foundations and Trends® in Machine learning, 3(1):1–122, 2011.
-  Jian-Feng Cai, Emmanuel J Candès, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956–1982, 2010.
-  Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM (JACM), 58(3):11, 2011.
-  Christophe Croux and Gentiane Haesbroeck. Principal component analysis based on robust estimators of the covariance or correlation matrix: influence functions and efficiencies. Biometrika, 87(3):603–618, 2000.
-  Ingrid Daubechies, Michel Defrise, and Christine De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on pure and applied mathematics, 57(11):1413–1457, 2004.
Fernando De La Torre and Michael J Black.
A framework for robust subspace learning.
International Journal of Computer Vision, 54(1-3):117–142, 2003.
-  Xinghao Ding, Lihan He, and Lawrence Carin. Bayesian robust principal component analysis. IEEE Transactions on Image Processing, 20(12):3419–3430, 2011.
Ramanathan Gnanadesikan and John R Kettenring.
Robust estimates, residuals, and outlier detection with multiresponse data.Biometrics, pages 81–124, 1972.
-  Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU Press, 2012.
-  Zhao Kang, Chong Peng, and Qiang Cheng. Robust pca via nonconvex rank approximation. In Data Mining (ICDM), 2015 IEEE International Conference on, pages 211–220. IEEE, 2015.
Qifa Ke and Takeo Kanade.
Robust l norm factorization in the presence of outliers and
missing data by alternative convex programming.
Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 739–746. IEEE, 2005.
-  Wee Kheng Leow, Yuan Cheng, Li Zhang, Terence Sim, and Lewis Foo. Background recovery by fixed-rank robust principal component analysis. In Computer Analysis of Images and Patterns, pages 54–61. Springer, 2013.
-  Zhouchen Lin, Minming Chen, and Yi Ma. The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices. arXiv preprint arXiv:1009.5055, 2010.
-  Ricardo A Maronna, R. Douglas Martin, and Victor J Yohai. Robust statistics. John Wiley & Sons Ltd Chichester, 2006.
-  Michael McCoy, Joel A Tropp, et al. Two proposals for robust pca using semidefinite programming. Electronic Journal of Statistics, 5:1123–1160, 2011.
-  Praneeth Netrapalli, UN Niranjan, Sujay Sanghavi, Animashree Anandkumar, and Prateek Jain. Non-convex robust pca. In Advances in Neural Information Processing Systems, pages 1107–1115, 2014.
-  Chong Peng, Zhao Kang, Shuting Cai, and Qiang Cheng. Integrate and conquer: Double-sided two-dimensional k-means via integrating of projection and manifold construction. ACM Trans. Intell. Syst. Technol., 9(5):57:1–57:25, June 2018.
-  Chong Peng, Zhao Kang, and Qiang Cheng. A fast factorization-based approach to robust pca. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pages 1137–1142. IEEE, 2016.
-  Chong Peng, Zhao Kang, Huiqing Li, and Qiang Cheng. Subspace clustering using log-determinant rank approximation. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 925–934. ACM, 2015.
-  Kim-Chuan Toh and Sangwoon Yun. An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems. Pacific Journal of Optimization, 6(615-640):15, 2010.
-  N. Vaswani, T. Bouwmans, S. Javed, and P. Narayanamurthy. Robust subspace learning: Robust pca, robust subspace tracking, and robust subspace recovery. IEEE Signal Processing Magazine, 35(4):32–55, July 2018.
-  John Wright, Arvind Ganesh, Shankar Rao, Yigang Peng, and Yi Ma. Robust principal component analysis: Exact recovery of corrupted low-rank matrices via convex optimization. In Advances in neural information processing systems, pages 2080–2088, 2009.
-  Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust pca via outlier pursuit. In Advances in Neural Information Processing Systems, pages 2496–2504, 2010.
-  Lei Xu and Alan L Yuille. Robust principal component analysis by self-organizing rules based on statistical physics approach. Neural Networks, IEEE Transactions on, 6(1):131–143, 1995.
-  Zihan Zhou, Xiaodong Li, John Wright, Emmanuel Candes, and Yi Ma. Stable principal component pursuit. In Information Theory Proceedings (ISIT), 2010 IEEE International Symposium on, pages 1518–1522. IEEE, 2010.