1 Introduction
Principal Component Analysis (PCA) [25] is one of the key methods for dimensionality reduction and feature extraction in data analysis [4, 13]
, which plays an important role in machine learning
[22], computer vison [31], genetics [3]and so on. As high dimensional data is difficult to analyze effectively and brings a huge computing burden, PCA attempts to represent highdimensional real data with principal components.
There exist two common definitions of PCA, both of which are algebraic and lack probabilistic explanation for the observed data [30]. In order to overcome this shortcoming, Tipping and Bishop proposed a probabilistic model for PCA, called Probabilistic PCA (PPCA) [30]. PPCA has several advantages over traditional PCA, such as, it can deal with missing values in the observed data due to the related EM algorithm.
However, despite these merits, PCA and PPCA inherently have a drawback: they are not robust, i.e., they are forcefully affected by the outliers [29, 28, 35]. Due to the presence of outliers, the principal components obtained by classical PCA and PPCA are greatly deviated from the real direction. Consequently, we can not get the main information of the data exactly. Subsequent analysis based on these principal components can not obtain satisified results.
By imitating human or animal learning, SelfPaced Learning (SPL) usually begins with simple samples of learning tasks, then introduces complex examples into the training process step by step [20]. SPL has been applied in object detection [34, 27], matrix factorization [37], mixture of regression [8]. It has been shown that SPL has the ability to reduce the effect of outliers [36].
In order to improve the robustness of the dimensionality reduction algorithms, we incorporate SelfPaced Learning mechanism [16] into PPCA, and propose a novel model called SelfPaced Probabilistic Principal Component Analysis (SPPPCA). Based on PPCA, we design a new objective function by introducing additional parameters about selfpaced learning. We use the alternative search strategy to learn the original parameters of PPCA and the additional introduced parameters. The proposed method attempts to learn from the clean training data gradually, and simultaneously prevent outliers from affecting the training process.
In summary, our main contributions in this paper are the following.

To effectively eliminate the impact of outliers, we introduce SelfPaced Learning mechanism into Probabilistic PCA, which is the earliest effort to use SPL for PPCA.

A novel approach named SPPPCA is constructed from a probabilistic perspective. We derive a corresponding optimization method based on the expectation maximization algorithm and the alternative search strategy.

We conduct extensive experiments on both simulated and real data sets. The results show that the proposed method can obtain more accurate projection vectors from the contaminated data.
The remainder of this work is organized as follows. In Section 2, we briefly describe Probabilistic PCA. In Section 3, we propose the SelfPaced Probabilistic PCA. In Section 4, we evaluate our method by experiments on synthetic data and real data. In Section 5, we introduce some related work. Finally, we summarize the paper in Section 6.
2 Probabilistic PCA
PCA can be defined as the orthogonal projection of the data onto a lower dimensional linear space called the principal subspace, where the variance of the projected data is maximized
[10] [4]. An equivalent definition is the linear projection that minimizes the mean squared distance between the data points and their projections [25].The above two definitions are algebraic and lack probabilistic explanation for the observed data. PCA can also be expressed as the maximum likelihood solution of a probabilistic latent variable model [4]. The probabilistic version of PCA is known as Probabilistic PCA (PPCA) [30].
Probabilistic PCA introduces a dimensional vector of latent variable corresponding to the principal component subspace. The prior distribution of is assumed to be:
The dimensional observed data vector is formulated by a linear combination of the latent variable plus noise :
(1) 
where is a matrix which relates the observation vector and the latent variable ; the dimensional vector allows the model to have nonzero mean; is a
dimensional Gaussiandistributed variable, i.e.,
.Hence, equation (1) induces the conditional distribution of the observed data vector :
By integrating out the latent variables, the marginal distribution of the observed variable is obtained:
where is a covariance matrix defined by To improve the efficiency of later calculations, we give the fact that where the matrix is defined by
(2) 
So the cost of evaluating is reduced from to [4]. And the posterior distribution can be calculated using Bayes’ rule, which is a Gaussian distribution:
(3) 
3 SelfPaced Probabilistic PCA
In this section, we explain the proposed method and the corresponding optimization algorithm in detail.
3.1 Object Function
We incorporate the SelfPaced Learning mechanism to PPCA, and propose SelfPaced Probabilistic Principal Component Analysis (SPPPCA). The objective function of SPPPCA is defined based on the above optimization problem (4
) by introducing binary variables
() and adding a sparse regularizer of , which is given by:(5) 
where , , is a hyperparameter. The introduced binary variable indicates whether the sample is an outlier (if ) or a clean sample (if ). Note that, like {}, also needs to be estimated from data. In SPPPCA, the goal is to minimize equation (5). In the next section 3.2.1, we will describe how to learn the parameters {} and iteratively.
3.2 Optimization
Based on the SelfPaced Learning strategy, we solve the minimization problem (5) by beginning with simple data points, then introducing complex examples into training gradually, meanwhile filtering out outliers during the training process.
For each fixed , we use the alternative search strategy to obtain approximate solution of the above problem efficiently. Specifically, given parameters {, , }, we estimate ; and for a fixed , we update parameters . A brief description of the algorithm is presented in Algorithm 1. In the following, we describe each component in detail.
3.2.1 I. Optimization of .
Fixing parameters {, , }, we estimate by solving the following problem:
which is equivalent to
Thus the solution of can be easily obtained by:
(6) 
Then, the following two important questions can be answered.

Why can be viewed as the outlier indicator?

What role does the hyperparameter play?
On the basis of equation (6), we find that can be estimated simply based on as a threshold. When is larger than , is set to 0. Since outliers are usually located far from the data center, they generally have lower likelihood according to equation (4). Thus, the training sample is more likely to be an outlier if the sample with larger , i.e., lower likelihood. Therefore, it is reasonable to use to indicate outliers in the training procedure.
It can be seen that is a very crucial hyperparameter. If is small, the optimization problem prefers to only consider relatively “cleaner” samples with high likelihood; on the contrary, most of samples (maybe also include outliers) are introduced if is very large. Thus, we use an adaptive strategy to tune , as shown in Algorithm 1. We increase the value of by a factor iteratively until the objective function value
converges. This strategy can collect clean data for training, at the same time, prevent outliers from skewing the results.
3.2.2 II. Optimization of .
Given fixed , we estimate by
which is equivalent to
It can be seen the above problem is similar to equation (4) in PPCA. By setting the derivative of the above equation with respect to equal to zero, we get:
(7) 
and then substitute with . Next, we use the EM algorithm to maximize the problem with respect to , . The completedata log likelihood function of this problem is given by:
where is the row of the matrix . Then we calculate the expectation with respect to the posterior distribution of the latent variables as follows,
From the above derivation, in the E step of the EM algorithm, we compute
(8)  
where is defined by equation (2). The above equations can be obtained easily using the posterior distribution (3) of latent variables. Then, in the M step, by setting the derivatives with respect to and to zero respectively, we obtain the Mstep reestimation solutions:
(9) 
(10)  
3.2.3 III. Summary.
The overall algorithm is shown in Algorithm 1. At each iteration, we increase the value of through a factor until the objective function (5) converges. For a fixed , we update the binary outlier indicator and the original model parameters iteratively. Thus, the proposed method SPPPCA looks for optimal projection vectors and filters out outliers iteratively. Note that we find that changing immediately after updating parameters and , in other words, keeping only the outer loop, can reduce the computational costs without affecting the performance significantly in our later experiments.
4 Experiments
The goal of our work is to get the correct principal components that are not influenced much by outliers. We follow the evaluation method in [21, 14]. Formally, we have a contaminated train dataset and a clean test dataset . We perform dimensionality reduction on the contaminated data , and obtain projection vectors. Then we calculate the reconstruction error on the test data by
(11) 
where, is the recovered data for based on the obtained projection vectors, is the Frobenius norm. A small reconstruction error means that the projection vectors obtained by dimensionality reduction methods contain more favorable information about true data and less negative information about outliers.
The proposed algorithm SPPPCA is tested on both simulated data and real data with classical PCA, PPCA [30], PCP ^{1}^{1}1The PCP algorithm decomposes the observed data into a low rank matrix and a sparse matrix. We treat the sparse matrix as noise part, and perform standard PCA on the low rank matrix. Note that PCP is not specifically designed for data with outliers. [32, 5], RAPCA [12], ROBPCA [11] and norm PCA [19] as baseline methods ^{2}^{2}2We implemented PCA using the module in the Sklearn library [26]. PPCA was implemented based on code in http://prml.github.io/. The code for PCP was obtained from https://github.com/wxiao0421/onlineRPCA. The codes of RAPCA and ROBPCA were obtained from https://wis.kuleuven.be/stat/robust/LIBRA/LIBRAhome. The code of norm PCA from https://ww2.mathworks.cn/matlabcentral/fileexchange/64855l1pcatoolbox. . The code of our method is publicly available at https://github.com/rumusan/SPPPCA.
We began our empirical evaluations by exploring the performance over two synthetic experiments. In section 4.1, we tested on twodimensional data firstly so that we could visualize the projection vectors. In section 4.2, to further demonstrate the robustness of SPPPCA, the experiments on high dimensional artifical data were analyzed. Then, in section 4.3 and 4.4, we compared the performance of SPPPCA with other methods on two real datasets respectively: the ISOLET dataset and the Yale face dataset. In our experiments, the methods were based on default parameters and all experiments were averaged over 5 random trials.
4.1 Experiments on twodimensional data
In this section, we compared our method with six different algorithms on twodimensional data. In the original data set ,
were sampled from a uniform distribution from 0 to 150, and
were draw from , where the noisewas generated from normal distribution
. Then we added some outliers to the original data as contaminated data. We reduced the clean original data and the dirty data to one dimension by different approaches.Figure 1 plots projection vectors obtained by these algorithms on both the clean data and the dirty data. It can be seen from this figure that the performance of these methods are similar on the clean data. However when it comes to the dirty data, PCA and PPCA are seriously affected by outliers. The outliers cause the projection vectors of PCA and PPCA to deviate from the correct direction. The results of this experiments show that the proposed method is superior to the traditional PCA and PPCA, and is comparable to PCP, RAPCA, ROBPCA and norm PCA.
4.2 Experiments on lowrank matrices
We built a lowrank matrix as , where (or ) were taken as (or ) cases from a multivariate normal distribution ; is a noise matrix, each row of which were generated from multivariate normal distribution . Besides, to keep the level of noise low, the values of were multiplied by 0.01. The dataset was divided into training set (70%) and test set (30%). Then the contaminated training data set were constructed by replacing some normal samples in the original train set with outliers, which came from multivariate distribution . We constructed a set of data with different sizes and various percentage of outliers. Then we run SPPPCA and other algorithms on the occluded train data, and calculated reprojection errors on the test data.
Figure 6 shows the average reconstruction errors by varying the sizes of the lowrank matrices and the number of projection vectors. From this figure we can see that the different methods perform similarly on the clean data. Nevertheless, PCA, PPCA, PCP, RAPCA, ROBPCA and norm PCA are greatly impacted by the outlying data points in dirty data. The proposed method can effectively withstand the influence of outliers and maintain low reconstruction errors.
4.3 Experiments on ISOLET dataset
Next, in section 4.3 and 4.4, we compared the performance of SPPPCA with other methods ^{3}^{3}3We did not use the norm PCA method for these experiments, because it is too timeconsuming. on real datasets.
In the third attempt, we conducted experiments on the ISOLET ^{4}^{4}4http://archive.ics.uci.edu/ml/datasets/ISOLET data set from UCI repository[7]. ISOLET is a dataset for spoken letter recognition with 7797 instances, and 617 attributes per sample. We randomly selected 900 samples as the training set and 600 samples as the test set. Then, we “contaminate” our clean train data by some outliers. The attribute values of outliers were randomly sampled from the uniform distribution between the minimum and maximum value of the train data. Then dimensionality reduction algorithms were used to project the data to a low dimensional space. Finally, the reconstruction errors can be calculated by (11).
Figure 11 presents the average reconstruction errors with different number of principal components. The performances of most methods deteriorate rapidly as the number of outliers increases, which is similar to the results of our simulation experiments in section 4.2. SPPPCA tends to remove outliers from the dirty dataset, thus, as shown in Figure 11, the proposed method can get good results on the contaminated data as well as on the raw data.
4.4 Experiments on Yale Face dataset
We further used the wellknown Yale Face dataset ^{5}^{5}5http://cvc.cs.yale.edu/cvc/projects/yalefaces/yalefaces.html [2] for our evaluation. The Yale Face dataset contains 165 grayscale images of 15 individuals. There are 11 images per subject with different illumination and expression. Each facial image is in 256 gray scales per pixel. All the images were scaled to pixels in experiments. The first row of Figure 12 shows some facial images in the Yale Face dataset. In experiments, we selected 9 images of each subject as the training data, and the rest 30 images make up the test set. We adopted the approaches in [14, 9] to produce noise on a small part of the original training images as outlier samples. More specifically, we randomly selected images from the original train set, then the selected images were occluded with rectangular blocks in random position consisting of random black and white dots. The number of outlier samples in the training set is 15 or 30, and the size of the noise blocks is or in our experiments. The corresponding noisy images of the original facial images are shown in the second row of Figure 12. We executed dimensionality reduction algorithms on the data with outliers, and calculated reconstruction errors on the clean test data set.
Figure 13 compares the eigenfaces of six different methods. As we can see, the eigenfaces of most methods are polluted by the outlying images in the training set. Compared with other approaches, the eigenfaces of SPPPCA are less affected by the contaminated data. Figure 14 presents some images in the test set and the corresponding reconstructed images using 30 projection vectors. We can see that the images reconstructed by PCA, PPCA, PCP and RAPCA are impacted by outliers to some degree. SPPPCA and ROBPCA perform better than other methods. The success of SPPPCA is due to SelfPaced Learning mechanism which tends to filter out outliers in the training set. Therefore, the projection vectors computed from SPPPCA are less influenced by the outlying images. Table 1 further provides the average reconstruction errors of each method. It can be shown that in most cases, the results of SPPPCA are better than those of other methods.
num & size  PCA  PPCA  PCP  RAPCA  ROBPCA  SPPPCA  

15 &  20  0.2108  0.2107  0.2106  0.2424  0.2170  0.2025 
30  0.2027  0.2026  0.2059  0.2221  0.2094  0.1963  
40  0.1862  0.1860  0.1901  0.1974  0.1886  0.1786  
30 &  20  0.2140  0.2140  0.2133  0.2481  0.2232  0.2099 
30  0.2015  0.2015  0.1985  0.2176  0.2081  0.1931  
40  0.1976  0.1976  0.1922  0.2051  0.2015  0.1940  
15 &  20  0.2262  0.2262  0.2208  0.2488  0.2272  0.2148 
30  0.2101  0.2101  0.2001  0.2209  0.1998  0.1875  
40  0.1922  0.1921  0.1918  0.2011  0.1856  0.1762 
5 Related work
Several robust PCA approaches have been proposed in the literature to handle outliers. Here we mainly introduce three kinds of methods: norm based PCA, Projection Pursuit based PCA and Principal Component Pursuit method.
PCA is based on the norm, which is not robust to outliers, so that much of the research work resorts to the norm to counteract outliers [1, 15, 6, 24, 14]. Baccini et al. [1]
introduced Gini’s mean absolute differences into PCA to get heuristic estimates for the
norm PCA model. Through using alternative convex programming, a subspace estimation method which minimizes the norm based cost function was presented in [15]. In addition, Markopoulos et al. [19] proposed the L1BF algorithm to efficiently calculate the principal components of the norm PCA based on bitflipping. However, outliers still persist in the training process of norm PCA methods, so these approaches can only alleviate the impact of outliers to some extent. Moreover, current norm PCA methods have a high computational cost [21].Projection Pursuit [17] based robust PCA attempts to search for the direction that maximize a robust measure of the variance of the projected observations. In order to relieve the high computational cost problem in [17], a faster algorithm, named RAPCA, was presented in [12]. Furthermore, Hubert et al. [11] combined projection pursuit ideas with robust covariance estimator and proposed the method ROBPCA.
In Principal Component Pursuit method (PCP) [5], the observed data is decomposed into a low rank matrix and a sparse matrix, among which the sparse matrix can be treated as noise. While, such a decomposition is not always realistic [23]. Moreover, PCP is proposed for uniformly corrupted coordinates of data, rather than outlying samples [35].
In addition, based on statistical physics and selforganized rules, a robust PCA was proposed in [33]
. The authors generalized an energy function by introducing binary variables. We also adopt binary variables in our algorithm, however, our method is derived under the probability framework based on SelfPaced Learning and Probabilistic PCA. More importantly, we give an efficient strategy to solve the relevant optimization problem.
6 Conclusion
In this paper, with the aim to build a more robust dimensionality reduction algorithm, we propose a novel model called SPPPCA. To our knowledge, this is the first attempt to introduce SelfPaced Learning mechanism into Probabilistic PCA. We develop the corresponding algorithm for optimizing the model parameters. Compared with classical PCA and several variant algorithms, extensive experiments on the simulation data and realworld data demonstrate the effectiveness of our model.
SPPPCA can be easily extended to solve the missing value problem like PPCA. In addition, SPPPCA is based on Gaussian distribution, other robust distribution, such as heavytailed distributions [18], can be adopted in SPPPCA to improve the robustness in the future.
References
 [1] Baccini, A., Besse, P., Falguerolles, A.: A norm pca and a heuristic approach. Ordinal and symbolic data analysis 1(1), 359–368 (1996)
 [2] Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Transactions on Pattern Analysis & Machine Intelligence (7), 711–720 (1997)
 [3] Biffi, A., Anderson, C.D., Nalls, M.A., Rahman, R., Sonni, A., Cortellini, L., Rost, N.S., Matarin, M., Hernandez, D.G., Plourde, A., et al.: Principalcomponent analysis for assessment of population stratification in mitochondrial medical genetics. The American Journal of Human Genetics 86(6), 904–917 (2010)

[4]
Bishop, C.M.: Pattern recognition and machine learning. springer (2006)
 [5] Candès, E.J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis? Journal of the ACM (JACM) 58(3), 11 (2011)
 [6] Dhanaraj, M., Markopoulos, P.P.: Novel algorithm for incremental norm principalcomponent analysis. In: 2018 26th European Signal Processing Conference (EUSIPCO). pp. 2020–2024. IEEE (2018)
 [7] Dua, D., Graff, C.: UCI machine learning repository (2017), http://archive.ics.uci.edu/ml
 [8] Han, L., Zhang, D., Huang, D., Chang, X., Ren, J., Luo, S., Han, J.: Selfpaced mixture of regressions. In: IJCAI. pp. 1816–1822 (2017)
 [9] He, R., Hu, B.G., Zheng, W.S., Kong, X.W.: Robust principal component analysis based on maximum correntropy criterion. IEEE Transactions on Image Processing 20(6), 1485–1494 (2011)
 [10] Hotelling, H.: Analysis of a complex of statistical variables into principal components. Journal of educational psychology 24(6), 417 (1933)
 [11] Hubert, M., Rousseeuw, P.J., Vanden Branden, K.: Robpca: a new approach to robust principal component analysis. Technometrics 47(1), 64–79 (2005)
 [12] Hubert, M., Rousseeuw, P.J., Verboven, S.: A fast method for robust principal components with applications to chemometrics. Chemometrics and Intelligent Laboratory Systems 60(12), 101–111 (2002)
 [13] Jolliffe, I.: Principal component analysis. Springer (2011)

[14]
Ju, F., Sun, Y., Gao, J., Hu, Y., Yin, B.: Image outlier detection and feature extraction via
normbased 2d probabilistic pca. IEEE Transactions on Image Processing 24(12), 4834–4846 (2015) 
[15]
Ke, Q., Kanade, T.: Robust
norm factorization in the presence of outliers and missing data by alternative convex programming. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05). vol. 1, pp. 739–746. IEEE (2005)
 [16] Kumar, M.P., Packer, B., Koller, D.: Selfpaced learning for latent variable models. In: Advances in Neural Information Processing Systems. pp. 1189–1197 (2010)
 [17] Li, G., Chen, Z.: Projectionpursuit approach to robust dispersion matrices and principal components: primary theory and monte carlo. Journal of the American Statistical Association 80(391), 759–766 (1985)
 [18] Luttinen, J., Ilin, A., Karhunen, J.: Bayesian robust pca of incomplete data. Neural processing letters 36(2), 189–202 (2012)
 [19] Markopoulos, P.P., Kundu, S., Chamadia, S., Pados, D.A.: Efficient norm principalcomponent analysis via bit flipping. IEEE Transactions on Signal Processing 65(16), 4252–4264 (2017)
 [20] Meng, D., Zhao, Q., Jiang, L.: What objective does selfpaced learning indeed optimize? arXiv preprint arXiv:1511.06049 (2015)
 [21] Minnehan, B., Savakis, A.: Grassmann manifold optimization for fast norm principal component analysis. IEEE Signal Processing Letters 26(2), 242–246 (2019)
 [22] Murphy, K.P.: Machine learning: a probabilistic perspective. MIT press (2012)
 [23] Neumayer, S., Nimmer, M., Setzer, S., Steidl, G.: On the rotational invariant norm pca. arXiv preprint arXiv:1902.03840 (2019)

[24]
Nie, F., Huang, H., Ding, C., Luo, D., Wang, H.: Robust principal component
analysis with nongreedy
norm maximization. In: TwentySecond International Joint Conference on Artificial Intelligence (2011)
 [25] Pearson, K.: Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2(11), 559–572 (1901)
 [26] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikitlearn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011)

[27]
Sangineto, E., Nabi, M., Culibrk, D., Sebe, N.: Self paced deep learning for weakly supervised object detection. IEEE transactions on pattern analysis and machine intelligence
41(3), 712–725 (2019)  [28] Serneels, S., Verdonck, T.: Principal component analysis for data containing outliers and missing elements. Computational Statistics & Data Analysis 52(3), 1712–1727 (2008)
 [29] Stanimirova, I., Daszykowski, M., Walczak, B.: Dealing with missing values and outliers in principal component analysis. Talanta 72(1), 172–178 (2007)
 [30] Tipping, M.E., Bishop, C.M.: Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61(3), 611–622 (1999)
 [31] De la Torre, F., Black, M.J.: Robust principal component analysis for computer vision. In: Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001. vol. 1, pp. 362–369. IEEE (2001)
 [32] Xiao, W., Huang, X., Silva, J., Emrani, S., Chaudhuri, A.: Online robust principal component analysis with change point detection. arXiv preprint arXiv:1702.05698 (2017)

[33]
Xu, L., Yuille, A.L.: Robust principal component analysis by selforganizing rules based on statistical physics approach. IEEE Transactions on Neural Networks
6(1), 131–143 (1995)  [34] Zhang, D., Meng, D., Zhao, L., Han, J.: Bridging saliency detection to weakly supervised object detection based on selfpaced curriculum learning. arXiv preprint arXiv:1703.01290 (2017)
 [35] Zhang, T., Lerman, G.: A novel mestimator for robust pca. The Journal of Machine Learning Research 15(1), 749–808 (2014)
 [36] Zhang, Y., Tang, Q., Niu, L., Dai, T., Xiao, X., Xia, S.T.: Selfpaced mixture of t distribution model. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 2796–2800. IEEE (2018)
 [37] Zhao, Q., Meng, D., Jiang, L., Xie, Q., Xu, Z., Hauptmann, A.G.: Selfpaced learning for matrix factorization. In: TwentyNinth AAAI Conference on Artificial Intelligence (2015)
Comments
There are no comments yet.