Principal Component Analysis and related subspace learning techniques based on matrix factorization have been widely used in dimensionality reduction, data compression, image processing, feature extraction and data visualization. It is well known that PCA based on a Gaussian noise model (i.e., Gaussian PCA) is sensitive to noise of large magnitude, because the effects of large noise are exaggerated by the use of Gaussian distribution induced quadratic loss. To robustify PCA, a number of improvements have been proposed(De La Torre & Black, 2003; Khan & Dellaert, 2004; Ke & Kanade, 2005; Archambeau et al., 2006; Ding et al., 2006; Brubaker, 2009; Candes et al., 2009; Eriksson & Van Den Hengel, 2010). Roughly, these methods can be grouped into a non-probabilistic paradigm and a probabilistic paradigm according to whether they are built on a probabilistic assumption of the noise. The non-probabilistic approaches (De La Torre & Black, 2003; Ding et al., 2006; Brubaker, 2009) either use robust -function to weight the squared loss of each data item according to its fitness to the subspace, or try to robustly estimate the covariance matrix by alternatively removing or down-weighting samples corrupted with large noise. Such a non-probabilistic paradigm makes it difficult, if possible, to take advantage of some desirable utilities offered by sophisticated probabilistic models such as Bayesian treatment of PCA, mixture of probabilistic PCAs (Tipping & Bishop, 1999) and probabilistic matrix factorization (Salakhutdinov & Mnih, 2008), and do not facilitate statistical testing or comparison with other probabilistic techniques.
The probabilistic robust PCA methods (Khan & Dellaert, 2004; Ke & Kanade, 2005; Archambeau et al., 2006; Candes et al., 2009; Eriksson & Van Den Hengel, 2010) are derived by replacing the Gaussian assumption of the noise with a Laplace assumption (Ke & Kanade, 2005; Candes et al., 2009; Eriksson & Van Den Hengel, 2010) or Student-t assumption (Khan & Dellaert, 2004; Archambeau et al., 2006). Both Laplace distribution and Student-t distribution have heavy tails which can reasonably explain data far away from the mean, thus they are suitable for modeling spiky noises with large magnitude. However, these methods suffer new problems. A major drawback of Laplace PCA is that it is incapable of coping with dense noise. This is because a Laplace distribution and the resultant norm would induce sparsity in the solution (Tibshirani, 1996), thereby falsely using a sparse model to explain dense noise. Student-t distribution can avoid the drawbacks of Laplace distribution and Gaussian distribution. Khan & Dellaert (2004) and Archambeau et al. (2006) tried to robustify probabilistic PCA (PPCA) (Tipping & Bishop, 1999) by replacing the Gaussian noise assumption with Student-t assumption. In practice, we find their methods have very similar performance with Gaussian PCA and work much worse on large noises than Laplace PCA .
In our opinion, data noises can be roughly partitioned into four patterns according to their abundance and magnitude: sparse small noise, sparse large noise, dense small noise and dense large noise. Gaussian PCA is limited to small noise and Laplace PCA is only suitable for sparse noise. Once noise is both large and dense, neither of the two PCA methods will be suffice. In reality, dense large noise is quite ubiquitous. For instance, in factorization based structure from motion (Tomasi & Kanade, 1992), due to bad illumination, fast optical flow, occlusions, and the deficiency of tracking algorithms, grossly mistracked features are quite common. In photo sharing websites (like Flickr, Instagram), many user generated tags are irrelevant to the images and many objects and attributes in images are not tagged by users. Thus considerable false positives and false negatives are present in the data. With the popularity of low cost cameras on mobile devices, millions of user generated videos are published to video sharing websites like Youtube. Due to high capturing rate, poor light conditions and users’ unprofessional capturing habits, videos are usually contaminated with gross noise affecting nearly every pixel, especially when videos are taken at night or on fast moving vehicles. On the other hand, in many problems, noise patterns are mixed. It is common that most entries of the low rank matrix are corrupted by small noise while a small part are contaminated by large noise. Gaussian PCA and Laplace PCA are not applicable in this case since neither of them are able to deal with the two types of noise simultaneously. Zhou et al. (2010) proposed Stable Principal Component Pursuit to recover matrices corrupted by small entry-wise noise and gross sparse errors. However, their method requires a good estimation of the magnitude of small noise, which is infeasible in many real applications.
In this paper, we propose an alternative probabilistic robust PCA method called Cauchy PCA, which is robust to all kinds of noise patters. We use Cauchy distribution to model noise and derive Cauchy PCA under a maximum likelihood estimation framework with rank constraints. We present a simple yet efficient projected gradient optimization method. Experiments demonstrate the robustness of Cauchy PCA to various noise patterns, and in particular, its superior capability in dealing with large dense noise.
The rest of the paper is organized as follows. Section 2 introduces related work. We propose Cauchy PCA in section 3. Section 4 gives experimental results on both simulated data and real world data. Section 5 concludes the paper.
2 Related Works
Robust PCA methods can be categorized into two paradigms: non-probabilistic approaches and probabilistic approaches. The basic strategy of non-probabilistic methods is to remove or down-weight the influence of large noise corrupted data items. De La Torre & Black (2003)
proposed robust subspace learning by replacing the squared loss function in PCA with Geman-McClure error function which is less sensitive to large noise and used iteratively reweighted least squares (IRLS) method to solve the problem.Ding et al. (2006) proposed a rotational invariant
norm PCA, whose principal components are eigenvectors of a re-weighted covariance matrix which softens samples corrupted with large noise.Brubaker (2009)
estimated the subspace by alternatively removing outliers and projecting to a lower dimensional subspace. These methods lack the ability to build or integrate with sophisticated probabilistic models.
In probabilistic approaches, one popular family is based on Laplace noise assumption and norm. Ke & Kanade (2005) proposed a matrix factorization formulation with norm and used alternating convex optimization as a solver. Candes et al. (2009) proposed Principal Component Pursuit (PCP) to recover the low rank matrix corrupted by sparse errors of arbitrarily large magnitude. They prove that, as if the noise is sufficiently sparse and the rank of the true underlying subspace is sufficiently small, the low rank matrix can be exactly recovered. Eriksson & Van Den Hengel (2010) generalized the Wiberg algorithm to solve the norm based low rank matrix approximation problem in the presence of missing data. All these norm (Laplace noise assumption) based methods are incapable of dealing with dense noise. Ganesh et al. (2010) claimed that by choosing a proper value of the tradeoff parameter, Principal Component Pursuit is also robust to dense noise. However, we find in practice, their suggested way of choosing parameter yields very poor results. Zhou et al. (2010) incorporated an additional term into the PCP model to account for small dense noise, but it is still unable to handle large dense noise. Xu et al. (2010) formulated a similar problem as Candes et al. (2009) to identify entirely corrupted points by imposing norm on the noise matrix. Their method assumes column level corruption of the low rank matrix and is not applicable for entry level corruption.
is a latent vector in a lower-dimensional space,is the projection matrix, is data offset and
is noise drawn from Student-t distribution. They utilize the property that Student-t distribution is a infinite mixture of Gaussian distributions with the same mean and varying variations and propose an expectation maximization (EM) algorithm to infer latent vectorsand learn parameters and . Empirically, these methods are comparable with Gaussian PCA on small noise. For large noise, they work slightly better than Gaussian PCA but much worse than Laplace PCA.
3 Cauchy Principal Component Analysis
In this section, we first present location-scale family distributions and shows how they can be used to derive PCA methods. Then we present some intuition of choosing Cauchy distribution to model noise by comparing its density curve with other distributions in location-scale family. We introduce Cauchy PCA by specializing the general location-scale family PCA framework with Cauchy distribution, interpret its robustness from a robust statistics view and propose an efficient singular value projection solver.
3.1 Location-Scale Family PCA
Location-scale family is a class of distributions parameterized by a location parameter and a scale parameter. The most important property of this family is that distributions are closed under linear transform. Ifis a random variable drawn from this family, then is also from this family. This property provides convenience to model additive noise. In PCA setting, assume each entry of noise matrix is from i.i.d location-scale family distribution
with location parameter zero and scale parameter . According to the closure under linear transformation property and additive noise assumption , observation matrix can be modeled as
with shifted location parameter . can be estimated by maximizing the likelihood of observations (or minimizing the negative log likelihood) with low rank constraint
Gaussian PCA and Laplace PCA (Candes et al., 2009; Ke & Kanade, 2005) are special cases of the general framework by specifying the distribution in Eq.(1) to Gaussian and Laplace distribution respectively.
3.2 Cauchy PCA
shows the density curves of univariate Gaussian, Laplace, Logistic and Cauchy distributions. To enable a clear comparison, density curves are aligned to the same location and peak. The motivation of aligning their peaks is to inspect an interesting phenomenon: if we put the same amount of probability on the mode of each distribution, how much probability will each distribution allocates for other values? This can give us a good sense of heavy-tail-ness. As data points get far away from the center, Gaussian probability drops quickly to zero while Laplace and Cauchy probabilities remain a certain amount as shown in Figure1(b). In other words, Laplace and Cauchy density curves have longer tails than Gaussian curve. A distribution (centered at zero) with heavy tail allocates a reasonable amount of probability on values far from zero. In terms of noise modeling under a probabilistic framework, large noises can be reasonably explained by heavy tail distribution since a certain amount of probability is granted to them. Thereby, Laplace PCA and Cauchy PCA naturally possess the ability of dealing with large noise due to their heavy-tail-ness. At location zero (Figure 1(c)), Laplace distribution is not differentiable. The non-smoothness property induces sparsity, which makes Laplace distribution unsuitable to model dense noise. Logistic distribution highly resembles Gaussian distribution in shape except a slightly heavier tail. Therefore its behavior in modeling noise should be very similar to Gaussian distribution. Among the four, Cauchy distribution owns two appealing advantages. First, it is smooth at zero and does not induce sparsity, thus is suitable for modeling dense noise. Second, it has a much heavier tail than the others, therefore, it is highly capable of modeling large noise.
We use Cauchy distribution with location parameter zero to model noise
where is the scale parameter. Substituting it into Eq.(1), we specialize the general location-scale family PCA framework to Cauchy PCA
Cauchy PCA can be naturally extended to deal with missing data. We use to denote that the entry at the th row and th column of is observed, otherwise. We maximize the following data likelihood
which is equivalent to introducing a 0-1 weight matrix to weight each item in Eq.(5).
3.3 A Robust Statistics Interpretation
In this section, we explain the robustness of Cauchy PCA from a robust statistics view (Hampel et al., 2011). Robust statistics seek to provide robust estimators resisting against gross noise. To be consistent with Section 3.1 and 3.2, we assume distributions are located at zero and parameters are estimated using maximum likelihood estimation (MLE). For a set of distributions and the estimator defined on , Hample et al (1974) introduced the influence function:
The influence function (IF) of at is given by
in those where this limit exists.
Heuristically, influence function describes the effect of an infinitesimal contamination at the point on the estimation. Based on IF, Hampel et al (1974) defined gross-error sensitivity:
The gross-error sensitivity of at is measured by
the supremum being taken over all where IF exists.
The gross-error sensitivity measures the worst influence which a small amount of contamination of fixed size can have on the estimator. A desirable robust estimator should have finite .
of Cauchy and Laplace MLE estimators are bounded. of Gaussian MLE estimator is unbounded.111Due to space limit, the proof of Corollary 1 and 2 are provided in supplementary meterial.
Corollary 1 explains why Cauchy and Laplace PCA are robust to gross noise while Gaussian PCA is not.
Another quantity local-shift sensitivity (Hampel, 1974) is defined to measure the effect on estimators by shifting an observation slightly from one point to some neighboring point:
The local-shift sensitivity of an estimator is defined as
A stable and robust estimator should possess a low .
of Cauchy and Gaussian MLE estimators are bounded. of Laplace MLE estimator is unbounded.
From Corollary 2, we can see Laplace estimator is very sensitive to local shifting around zero.
Cauchy MLE estimator has both bounded gross-error sensitivity and bounded local-shift sensitivity. Therefore, it is robust to gross noise and local shifting around zero.
Gaining insights from Meka et al. (2009), we adopt a projected gradient descent method to solve the problem defined in Eq.(5). It is an iterative approach where each iteration consists of gradient update and projection operation. Algorithm 1 outlines the optimization method. Low rank matrix to be estimated is initialized to the observation measurements . At each iteration, we firstly compute the gradient matrix , then use to update . This is the ordinary gradient descent step. Then we project the newly obtained to the feasible set . The projection is done by computing the largest singular values and singular vectors of : , then reconstructing : . Note that in the projection phase, we only need to compute the top singular values and their corresponding singular vectors, which can be done efficiently by Lanczos SVD algorithm (Larsen, 1998). Note that the optimization problem is not convex and may suffer local optimal. It would be helpful to run the algorithm multiple times with different random initializations.
In this section, we first corroborate the ability of Cauchy PCA to recover matrices from various noise patterns on simulated data, then demonstrate its usage in a real application: face recognition with corruption.
To evaluate the robustness of Cauchy PCA to various noise patterns, we generate low rank matrices, then corrupt them with noise of diverse amounts and magnitudes and try to recover them. We consider matrices with . Similar to Candes et al. (2009), we generate a rank matrix where ,
are matrices whose entries are independently sampled from uniform distribution. To corrupt , we randomly choose entries and each entry is independently added noise sampled uniformly from . We call corruption rate and as noise magnitude. Each matrix is corrupted by 33 noise patterns by varying corruption rate and noise magnitude . We compare Cauchy PCA with Gaussian PCA, Laplace PCA (Candes et al., 2009), and multivariate Student-t PCA (MV-Student-t PCA) (Khan & Dellaert, 2004). The rank constraint parameter in Gaussian PCA and Cauchy PCA and the dimension of latent vectors in MV-Student-t PCA are set to the intrinsic rank of the matrix to be recovered. Throughout the experiments, we set the scale parameter of Cauchy PCA to 0.1. For Laplace PCA, we tune the trade-off parameter and choose the largest one under which the estimated low rank matrix is of rank . We use to measure recovery error, where is the estimated low rank matrix and is the true low rank matrix.
Figure 2 summarizes matrix recovery results of four PCA methods. Each subfigure shows recovery error versus corruption rate . Noise magnitude in the first, second, third row respectively. Matrix size in the first, second, third column respectively. In the second row and third row, errors at and are not displayed since they are unreasonably high (greater than 1). In the third row, errors of Gaussian PCA and MV-Student-t PCA at all corruption rate are not displayed for the same reason. As can be seen from the figure, Gaussian PCA quickly fails as noise becomes large. When , errors of Gaussian PCA are greater than 1 at all corruption rates greater than zero. In all cases, Laplace PCA works very well when noise is sparse but fails rapidly when noise becomes dense. When corruption rate is below 0.3, Laplace PCA can perfectly recover the low rank matrix. However, once exceeds 0.3, errors of Laplace PCA increase sharply. This corroborates our claim and analysis that Laplace PCA is only suitable for sparse noise. Cauchy PCA shows great robustness under all kinds of noise conditions. When noise is small (shown in the first row), Cauchy PCA has comparable performance with Gaussian PCA. When noise is large (shown in the second and third row), errors of Cauchy PCA are consistently small at all corruption rate. The performance of Cauchy PCA is significantly superior to the other two when noise is large and dense. For instance, when , ,
, the average error of Cauchy PCA is only 0.032 while Laplace PCA and Gaussian PCA suffer errors of 0.882 and 9.049. The performance of MV-Student-t PCA is very similar to Gaussian PCA. For small noise (the first row), MV-Student-t PCA is nearly the same as Gaussian PCA. For large noise (the second and third row), MV-Student-t PCA works better than Gaussian PCA, but much worse than Laplace PCA and Cauchy PCA. We conjecture the reason is that in EM procedure of MV-Student-t PCA, each data instance is actually modeled using a Gaussian distribution with instance-specific variance, thereby, the final result is close to Gaussian PCA.
4.2 Face Recognition With Corruption
In this section, we investigate the problem of face recognition where face images are contaminated by severe noise (Wright et al., 2009). The same as Wright et al. (2009), we randomly corrupt a percentage of pixels and evaluate the robustness of each PCA method in recognizing corrupted faces. For face recognition, we adopt eigenface (Turk & Pentland, 1991) methodology. Given the noisy training data, we use each PCA method to recover the low rank matrix, then obtain the basis from the low rank matrix. All training and testing images are projected into the low dimensional subspace spanned by the learned basis. For each testing face, recognition is performed by finding the nearest training face in the subspace and assigning the identity of the nearest training face to the testing face. We measure the recognition accuracy under varying corruption rate . Recognition rate is defined as the ratio between the number of correctly recognized faces and the total number of test faces.
We use the Extended Yale B dataset (Lee et al., 2005) consisting of 2414 frontal-face images of 38 individuals. For each individual, we randomly choose half images for training and the other half for testing. All images are resized to . Following Wright et al. (2009), we corrupt pixels by randomly replacing their values with integers sampled uniformly from . For each , we take 5 replications where for each image,
pixels are corrupted. All corrupted images are normalized to have zero mean and unit standard deviation. We perform two experiments under different rank constraint settings. In the first experiment, we set the rank constraintof Gaussian PCA and Cauchy PCA and dimension of latent vectors in MV-Student-t to 30, and tune the tradeoff parameter of Laplace PCA to make sure the recovered low rank matrix is of rank 30. In the second experiment, rank constraint is set to 60.
Figure 3 shows the recognition accuracy under different corruption rate . It can be seen that Cauchy PCA is more robust to dense large noise than Gaussian and Laplace PCA. For (Figure 3(a)), Gaussian PCA quickly fails when exceeds 0.3. Cauchy and Laplace PCA have comparably stable performance when is below 0.5. At , the accuracy of Laplace PCA has a sharp drop while Cauchy PCA remains stable. At , the accuracy of Laplace PCA drops to 0.09 while Cauchy PCA achieves 0.27. Similar results can also be observed for (Figure 3(b)). The recognition accuracy of MV-Student-t is better than Gaussian PCA and is worse than Laplace PCA and Cauchy PCA, which is consistent with the matrix recovery results reported in Section 4.1.
Figure 4 shows face reconstruction results for , . Original face images (Figure 4(a)) are heavily corrupted by noise (Figure 4(b)). Reconstructed faces (Figure 4(c)) by Gaussian PCA are in severe contamination. Laplace PCA gets better results (Figure 4(d)), but the reconstructed images are still hard to recognize. Some reconstructions even change the original appearance. For example, reconstruction for the girl in the fourth row is completely wrong. In contrast, as shown in Figure 4
(e), Cauchy PCA can successfully remove most noise and restore the original appearance. Reconstructed faces by Cauchy PCA are much easier to recognize. The reconstruction results of MV-Student-t PCA are very close to those of Gaussian PCA. We do not show them to avoid skewing Figure4.
We propose Cauchy principal component analysis, which is robust to various noise patterns. For large dense noise, Cauchy PCA significantly outperforms Gaussian PCA and Laplace PCA. For small noise, Cauchy PCA has comparable performance with Gaussian PCA. For large noise, Cauchy PCA possesses comparable robustness with Laplace PCA. Experiments on simulated data and real world applications corroborate our intuitive and theoretical analysis of the robustness of our method. In future, we will seek further theoretical explanations and find more efficient solvers to scale Cauchy PCA to large datasets.
- Archambeau et al. (2006) Archambeau, Cédric, Delannay, Nicolas, and Verleysen, Michel. Robust probabilistic projections. In ICML, pp. 33–40. ACM, 2006.
- Brubaker (2009) Brubaker, S.C. Robust pca and clustering in noisy mixtures. In Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1078–1087. Society for Industrial and Applied Mathematics, 2009.
- Candes et al. (2009) Candes, E.J., Li, X., Ma, Y., and Wright, J. Robust principal component analysis? Arxiv preprint ArXiv:0912.3599, 2009.
- De La Torre & Black (2003) De La Torre, F. and Black, M.J. A framework for robust subspace learning. International Journal of Computer Vision, 54(1):117–142, 2003.
- Ding et al. (2006) Ding, C., Zhou, D., He, X., and Zha, H. R 1-pca: rotational invariant l 1-norm principal component analysis for robust subspace factorization. In ICML, pp. 281–288. ACM, 2006.
- Eriksson & Van Den Hengel (2010) Eriksson, A. and Van Den Hengel, A. Efficient computation of robust low-rank matrix approximations in the presence of missing data using the l1 norm. In CVPR 2010, pp. 771–778. IEEE, 2010.
- Ganesh et al. (2010) Ganesh, A., Wright, J., Li, X., Candes, E.J., and Ma, Y. Dense error correction for low-rank matrices via principal component pursuit. In Information Theory Proceedings (ISIT), 2010 IEEE International Symposium on, pp. 1513–1517. IEEE, 2010.
- Hampel (1974) Hampel, F.R. The influence curve and its role in robust estimation. Journal of the American Statistical Association, 69(346):383–393, 1974.
- Hampel et al. (2011) Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., and Stahel, W.A. Robust statistics: the approach based on influence functions, volume 114. Wiley, 2011.
- Ke & Kanade (2005) Ke, Q. and Kanade, T. Robust l1 norm factorization in the presence of outliers and missing data by alternative convex programming. In CVPR 2005, volume 1, pp. 739–746. IEEE, 2005.
- Khan & Dellaert (2004) Khan, Zia and Dellaert, Frank. Robust generative subspace modeling: The subspace t distribution. 2004.
- Larsen (1998) Larsen, R.M. Lanczos bidiagonalization with partial reorthogonalization. 1998.
- Lee et al. (2005) Lee, K.C., Ho, J., and Kriegman, D.J. Acquiring linear subspaces for face recognition under variable lighting. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(5):684–698, 2005.
- Meka et al. (2009) Meka, R., Jain, P., and Dhillon, I.S. Guaranteed rank minimization via singular value projection. Arxiv preprint arXiv:0909.5457, 2009.
- Salakhutdinov & Mnih (2008) Salakhutdinov, R. and Mnih, A. Probabilistic matrix factorization. Advances in neural information processing systems, 20:1257–1264, 2008.
- Tibshirani (1996) Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288, 1996.
- Tipping & Bishop (1999) Tipping, M.E. and Bishop, C.M. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622, 1999.
- Tomasi & Kanade (1992) Tomasi, C. and Kanade, T. Shape and motion from image streams under orthography: a factorization method. International Journal of Computer Vision, 9(2):137–154, 1992.
- Turk & Pentland (1991) Turk, M. and Pentland, A. Eigenfaces for recognition. Journal of cognitive neuroscience, 3(1):71–86, 1991.
- Wright et al. (2009) Wright, John, Yang, Allen Y, Ganesh, Arvind, Sastry, S Shankar, and Ma, Yi. Robust face recognition via sparse representation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(2):210–227, 2009.
- Xu et al. (2010) Xu, H., Caramanis, C., and Sanghavi, S. Robust pca via outlier pursuit. Information Theory, IEEE Transactions on, (99):1–1, 2010.
- Zhou et al. (2010) Zhou, Z., Li, X., Wright, J., Candes, E., and Ma, Y. Stable principal component pursuit. In Information Theory Proceedings (ISIT), 2010 IEEE International Symposium on, pp. 1518–1522. IEEE, 2010.