Self-Paced Probabilistic Principal Component Analysis for Data with Outliers

04/13/2019 ∙ by Bowen Zhao, et al. ∙ Tsinghua University 0

Principal Component Analysis (PCA) is a popular tool for dimensionality reduction and feature extraction in data analysis. There is a probabilistic version of PCA, known as Probabilistic PCA (PPCA). However, standard PCA and PPCA are not robust, as they are sensitive to outliers. To alleviate this problem, this paper introduces the Self-Paced Learning mechanism into PPCA, and proposes a novel method called Self-Paced Probabilistic Principal Component Analysis (SP-PPCA). Furthermore, we design the corresponding optimization algorithm based on the alternative search strategy and the expectation-maximization algorithm. SP-PPCA looks for optimal projection vectors and filters out outliers iteratively. Experiments on both synthetic problems and real-world datasets clearly demonstrate that SP-PPCA is able to reduce or eliminate the impact of outliers.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Principal Component Analysis (PCA) [25] is one of the key methods for dimensionality reduction and feature extraction in data analysis [4, 13]

, which plays an important role in machine learning

[22], computer vison [31], genetics [3]

and so on. As high dimensional data is difficult to analyze effectively and brings a huge computing burden, PCA attempts to represent high-dimensional real data with principal components.

There exist two common definitions of PCA, both of which are algebraic and lack probabilistic explanation for the observed data [30]. In order to overcome this shortcoming, Tipping and Bishop proposed a probabilistic model for PCA, called Probabilistic PCA (PPCA) [30]. PPCA has several advantages over traditional PCA, such as, it can deal with missing values in the observed data due to the related EM algorithm.

However, despite these merits, PCA and PPCA inherently have a drawback: they are not robust, i.e., they are forcefully affected by the outliers [29, 28, 35]. Due to the presence of outliers, the principal components obtained by classical PCA and PPCA are greatly deviated from the real direction. Consequently, we can not get the main information of the data exactly. Subsequent analysis based on these principal components can not obtain satisified results.

By imitating human or animal learning, Self-Paced Learning (SPL) usually begins with simple samples of learning tasks, then introduces complex examples into the training process step by step [20]. SPL has been applied in object detection [34, 27], matrix factorization [37], mixture of regression [8]. It has been shown that SPL has the ability to reduce the effect of outliers [36].

In order to improve the robustness of the dimensionality reduction algorithms, we incorporate Self-Paced Learning mechanism [16] into PPCA, and propose a novel model called Self-Paced Probabilistic Principal Component Analysis (SP-PPCA). Based on PPCA, we design a new objective function by introducing additional parameters about self-paced learning. We use the alternative search strategy to learn the original parameters of PPCA and the additional introduced parameters. The proposed method attempts to learn from the clean training data gradually, and simultaneously prevent outliers from affecting the training process.

In summary, our main contributions in this paper are the following.

  • To effectively eliminate the impact of outliers, we introduce Self-Paced Learning mechanism into Probabilistic PCA, which is the earliest effort to use SPL for PPCA.

  • A novel approach named SP-PPCA is constructed from a probabilistic perspective. We derive a corresponding optimization method based on the expectation maximization algorithm and the alternative search strategy.

  • We conduct extensive experiments on both simulated and real data sets. The results show that the proposed method can obtain more accurate projection vectors from the contaminated data.

The remainder of this work is organized as follows. In Section 2, we briefly describe Probabilistic PCA. In Section 3, we propose the Self-Paced Probabilistic PCA. In Section 4, we evaluate our method by experiments on synthetic data and real data. In Section 5, we introduce some related work. Finally, we summarize the paper in Section 6.

2 Probabilistic PCA

PCA can be defined as the orthogonal projection of the data onto a lower dimensional linear space called the principal subspace, where the variance of the projected data is maximized

[10] [4]. An equivalent definition is the linear projection that minimizes the mean squared distance between the data points and their projections [25].

The above two definitions are algebraic and lack probabilistic explanation for the observed data. PCA can also be expressed as the maximum likelihood solution of a probabilistic latent variable model [4]. The probabilistic version of PCA is known as Probabilistic PCA (PPCA) [30].

Probabilistic PCA introduces a -dimensional vector of latent variable corresponding to the principal component subspace. The prior distribution of is assumed to be:

The -dimensional observed data vector is formulated by a linear combination of the latent variable plus noise :

(1)

where is a matrix which relates the observation vector and the latent variable ; the -dimensional vector allows the model to have non-zero mean; is a

-dimensional Gaussian-distributed variable, i.e.,

.

Hence, equation (1) induces the conditional distribution of the observed data vector :

By integrating out the latent variables, the marginal distribution of the observed variable is obtained:

where is a covariance matrix defined by To improve the efficiency of later calculations, we give the fact that where the matrix is defined by

(2)

So the cost of evaluating is reduced from to [4]. And the posterior distribution can be calculated using Bayes’ rule, which is a Gaussian distribution:

(3)

Finally, based on a data set , where is the vector presentation of observed data, the log likelihood function is given by

(4)

The model parameters

can be estimated using maximum likelihood. They have exact closed-form solutions, and can also be found through an EM algorithm

[30, 4].

3 Self-Paced Probabilistic PCA

In this section, we explain the proposed method and the corresponding optimization algorithm in detail.

3.1 Object Function

We incorporate the Self-Paced Learning mechanism to PPCA, and propose Self-Paced Probabilistic Principal Component Analysis (SP-PPCA). The objective function of SP-PPCA is defined based on the above optimization problem (4

) by introducing binary variables

() and adding a sparse regularizer of , which is given by:

(5)

where , , is a hyper-parameter. The introduced binary variable indicates whether the sample is an outlier (if ) or a clean sample (if ). Note that, like {}, also needs to be estimated from data. In SP-PPCA, the goal is to minimize equation (5). In the next section 3.2.1, we will describe how to learn the parameters {} and iteratively.

3.2 Optimization

Based on the Self-Paced Learning strategy, we solve the minimization problem (5) by beginning with simple data points, then introducing complex examples into training gradually, meanwhile filtering out outliers during the training process.

For each fixed , we use the alternative search strategy to obtain approximate solution of the above problem efficiently. Specifically, given parameters {, , }, we estimate ; and for a fixed , we update parameters . A brief description of the algorithm is presented in Algorithm 1. In the following, we describe each component in detail.

3.2.1 I. Optimization of .

Fixing parameters {, , }, we estimate by solving the following problem:

which is equivalent to

Thus the solution of can be easily obtained by:

(6)

Then, the following two important questions can be answered.

  • Why can be viewed as the outlier indicator?

  • What role does the hyper-parameter play?

On the basis of equation (6), we find that can be estimated simply based on as a threshold. When is larger than , is set to 0. Since outliers are usually located far from the data center, they generally have lower likelihood according to equation (4). Thus, the training sample is more likely to be an outlier if the sample with larger , i.e., lower likelihood. Therefore, it is reasonable to use to indicate outliers in the training procedure.

It can be seen that is a very crucial hyper-parameter. If is small, the optimization problem prefers to only consider relatively “cleaner” samples with high likelihood; on the contrary, most of samples (maybe also include outliers) are introduced if is very large. Thus, we use an adaptive strategy to tune , as shown in Algorithm 1. We increase the value of by a factor iteratively until the objective function value

converges. This strategy can collect clean data for training, at the same time, prevent outliers from skewing the results.

3.2.2 II. Optimization of .

Given fixed , we estimate by

which is equivalent to

It can be seen the above problem is similar to equation (4) in PPCA. By setting the derivative of the above equation with respect to equal to zero, we get:

(7)

and then substitute with . Next, we use the EM algorithm to maximize the problem with respect to , . The complete-data log likelihood function of this problem is given by:

where is the row of the matrix . Then we calculate the expectation with respect to the posterior distribution of the latent variables as follows,

From the above derivation, in the E step of the EM algorithm, we compute

(8)

where is defined by equation (2). The above equations can be obtained easily using the posterior distribution (3) of latent variables. Then, in the M step, by setting the derivatives with respect to and to zero respectively, we obtain the M-step re-estimation solutions:

(9)
(10)

From equations (7), (9) and (10), only data points indicated as “clean” affect the values of the parameters . In other words, the projection vectors obtained by SP-PPCA are rarely influenced by outliers.

3.2.3 III. Summary.

The overall algorithm is shown in Algorithm 1. At each iteration, we increase the value of through a factor until the objective function (5) converges. For a fixed , we update the binary outlier indicator and the original model parameters iteratively. Thus, the proposed method SP-PPCA looks for optimal projection vectors and filters out outliers iteratively. Note that we find that changing immediately after updating parameters and , in other words, keeping only the outer loop, can reduce the computational costs without affecting the performance significantly in our later experiments.

1:Dataset , learning rate .
2:, , .
3:Initialize , , .
4:

By the mean of samples, a random matrix, and scalar 1, respectively.

5:Initialize to the median of .
6: for initialization is obtained by running only one iteration of the original PPCA algorithm.
7:repeat
8:     repeat
9:         Update by equation (6).
10:         Update , , by equations (7), (8), (9), (10).
11:     until convergence
12:     .
13:until convergence
Algorithm 1 Self-Paced Probabilistic PCA

Once we obtain the convergence results , a -dimensional point in the data space can be represented by the corresponding posterior mean and covariance in the latent space [4] according to equation (3). The mean is obtained by:

where . We also can reconstruct the original data point by :

4 Experiments

The goal of our work is to get the correct principal components that are not influenced much by outliers. We follow the evaluation method in [21, 14]. Formally, we have a contaminated train dataset and a clean test dataset . We perform dimensionality reduction on the contaminated data , and obtain projection vectors. Then we calculate the reconstruction error on the test data by

(11)

where, is the recovered data for based on the obtained projection vectors, is the Frobenius norm. A small reconstruction error means that the projection vectors obtained by dimensionality reduction methods contain more favorable information about true data and less negative information about outliers.

The proposed algorithm SP-PPCA is tested on both simulated data and real data with classical PCA, PPCA [30], PCP 111The PCP algorithm decomposes the observed data into a low rank matrix and a sparse matrix. We treat the sparse matrix as noise part, and perform standard PCA on the low rank matrix. Note that PCP is not specifically designed for data with outliers. [32, 5], RAPCA [12], ROBPCA [11] and -norm PCA [19] as baseline methods 222We implemented PCA using the module in the Sklearn library [26]. PPCA was implemented based on code in http://prml.github.io/. The code for PCP was obtained from https://github.com/wxiao0421/onlineRPCA. The codes of RAPCA and ROBPCA were obtained from https://wis.kuleuven.be/stat/robust/LIBRA/LIBRA-home. The code of -norm PCA from https://ww2.mathworks.cn/matlabcentral/fileexchange/64855-l1-pca-toolbox. . The code of our method is publicly available at https://github.com/rumusan/SP-PPCA.

We began our empirical evaluations by exploring the performance over two synthetic experiments. In section 4.1, we tested on two-dimensional data firstly so that we could visualize the projection vectors. In section 4.2, to further demonstrate the robustness of SP-PPCA, the experiments on high dimensional artifical data were analyzed. Then, in section 4.3 and 4.4, we compared the performance of SP-PPCA with other methods on two real datasets respectively: the ISOLET dataset and the Yale face dataset. In our experiments, the methods were based on default parameters and all experiments were averaged over 5 random trials.

4.1 Experiments on two-dimensional data

In this section, we compared our method with six different algorithms on two-dimensional data. In the original data set ,

were sampled from a uniform distribution from 0 to 150, and

were draw from , where the noise

was generated from normal distribution

. Then we added some outliers to the original data as contaminated data. We reduced the clean original data and the dirty data to one dimension by different approaches.

Figure 1: The projection vectors obtained by different algorithms from the clean data (left) and the dirty data (right). The plotted data is centered.

Figure 1 plots projection vectors obtained by these algorithms on both the clean data and the dirty data. It can be seen from this figure that the performance of these methods are similar on the clean data. However when it comes to the dirty data, PCA and PPCA are seriously affected by outliers. The outliers cause the projection vectors of PCA and PPCA to deviate from the correct direction. The results of this experiments show that the proposed method is superior to the traditional PCA and PPCA, and is comparable to PCP, RAPCA, ROBPCA and -norm PCA.

4.2 Experiments on low-rank matrices

We built a low-rank matrix as , where (or ) were taken as (or ) cases from a multivariate normal distribution ; is a noise matrix, each row of which were generated from multivariate normal distribution . Besides, to keep the level of noise low, the values of were multiplied by 0.01. The dataset was divided into training set (70%) and test set (30%). Then the contaminated training data set were constructed by replacing some normal samples in the original train set with outliers, which came from multivariate distribution . We constructed a set of data with different sizes and various percentage of outliers. Then we run SP-PPCA and other algorithms on the occluded train data, and calculated reprojection errors on the test data.

Figure 6 shows the average reconstruction errors by varying the sizes of the low-rank matrices and the number of projection vectors. From this figure we can see that the different methods perform similarly on the clean data. Nevertheless, PCA, PPCA, PCP, RAPCA, ROBPCA and -norm PCA are greatly impacted by the outlying data points in dirty data. The proposed method can effectively withstand the influence of outliers and maintain low reconstruction errors.

(a)
(b)
(c)
(d)
Figure 6: Average reconstruction errors for simulation problems with varying sizes of the low-rank matrices and different number of projection vectors. (a) , 4 projection vectors; (b) , 2 projection vectors; (c) , 3 projection vectors; (d) , 5 projection vectors.

4.3 Experiments on ISOLET dataset

Next, in section 4.3 and 4.4, we compared the performance of SP-PPCA with other methods 333We did not use the -norm PCA method for these experiments, because it is too time-consuming. on real datasets.

In the third attempt, we conducted experiments on the ISOLET 444http://archive.ics.uci.edu/ml/datasets/ISOLET data set from UCI repository[7]. ISOLET is a dataset for spoken letter recognition with 7797 instances, and 617 attributes per sample. We randomly selected 900 samples as the training set and 600 samples as the test set. Then, we “contaminate” our clean train data by some outliers. The attribute values of outliers were randomly sampled from the uniform distribution between the minimum and maximum value of the train data. Then dimensionality reduction algorithms were used to project the data to a low dimensional space. Finally, the reconstruction errors can be calculated by (11).

Figure 11 presents the average reconstruction errors with different number of principal components. The performances of most methods deteriorate rapidly as the number of outliers increases, which is similar to the results of our simulation experiments in section 4.2. SP-PPCA tends to remove outliers from the dirty dataset, thus, as shown in Figure 11, the proposed method can get good results on the contaminated data as well as on the raw data.

(a)
(b)
(c)
(d)
Figure 11: Average reconstruction errors for ISOLET with different number of principal components. (a) 40 principal components; (b) 30 principal components; (c) 20 principal components; (d) 10 principal components.

4.4 Experiments on Yale Face dataset

We further used the well-known Yale Face dataset 555http://cvc.cs.yale.edu/cvc/projects/yalefaces/yalefaces.html [2] for our evaluation. The Yale Face dataset contains 165 grayscale images of 15 individuals. There are 11 images per subject with different illumination and expression. Each facial image is in 256 gray scales per pixel. All the images were scaled to pixels in experiments. The first row of Figure 12 shows some facial images in the Yale Face dataset. In experiments, we selected 9 images of each subject as the training data, and the rest 30 images make up the test set. We adopted the approaches in [14, 9] to produce noise on a small part of the original training images as outlier samples. More specifically, we randomly selected images from the original train set, then the selected images were occluded with rectangular blocks in random position consisting of random black and white dots. The number of outlier samples in the training set is 15 or 30, and the size of the noise blocks is or in our experiments. The corresponding noisy images of the original facial images are shown in the second row of Figure 12. We executed dimensionality reduction algorithms on the data with outliers, and calculated reconstruction errors on the clean test data set.

Figure 13 compares the eigenfaces of six different methods. As we can see, the eigenfaces of most methods are polluted by the outlying images in the training set. Compared with other approaches, the eigenfaces of SP-PPCA are less affected by the contaminated data. Figure 14 presents some images in the test set and the corresponding reconstructed images using 30 projection vectors. We can see that the images reconstructed by PCA, PPCA, PCP and RAPCA are impacted by outliers to some degree. SP-PPCA and ROBPCA perform better than other methods. The success of SP-PPCA is due to Self-Paced Learning mechanism which tends to filter out outliers in the training set. Therefore, the projection vectors computed from SP-PPCA are less influenced by the outlying images. Table 1 further provides the average reconstruction errors of each method. It can be shown that in most cases, the results of SP-PPCA are better than those of other methods.

Figure 12: Facial images in Yale Face dataset (the first row) and their corresponding noisy images (the second row).
Figure 13: Some eigenfaces obtained by different methods. The first column shows eigenfaces of PCA, the second column PPCA, the third column PCP, the fourth column RAPCA, the fifth column ROBPCA, and the last column SP-PPCA (our method).
Figure 14: Reconstructed images by six different methods. The first column shows images in the test set, the second column are reconstructed images by PCA, the third column by PPCA, the fourth column by PCP, the fifth column by RAPCA, the sixth column by ROBPCA and the last column by SP-PPCA (our method).
num & size PCA PPCA PCP RAPCA ROBPCA SP-PPCA
15 & 20 0.2108 0.2107 0.2106 0.2424 0.2170 0.2025
30 0.2027 0.2026 0.2059 0.2221 0.2094 0.1963
40 0.1862 0.1860 0.1901 0.1974 0.1886 0.1786
30 & 20 0.2140 0.2140 0.2133 0.2481 0.2232 0.2099
30 0.2015 0.2015 0.1985 0.2176 0.2081 0.1931
40 0.1976 0.1976 0.1922 0.2051 0.2015 0.1940
15 & 20 0.2262 0.2262 0.2208 0.2488 0.2272 0.2148
30 0.2101 0.2101 0.2001 0.2209 0.1998 0.1875
40 0.1922 0.1921 0.1918 0.2011 0.1856 0.1762
Table 1: Average reconstruction errors of different methods. “” is the dimension of the reduced subspace; “num & size” represent the number of outliers in the training dataset and the size of noise blocks. The best results are highlighted in bold.

5 Related work

Several robust PCA approaches have been proposed in the literature to handle outliers. Here we mainly introduce three kinds of methods: -norm based PCA, Projection Pursuit based PCA and Principal Component Pursuit method.

PCA is based on the -norm, which is not robust to outliers, so that much of the research work resorts to the -norm to counteract outliers [1, 15, 6, 24, 14]. Baccini et al. [1]

introduced Gini’s mean absolute differences into PCA to get heuristic estimates for the

-norm PCA model. Through using alternative convex programming, a subspace estimation method which minimizes the -norm based cost function was presented in [15]. In addition, Markopoulos et al. [19] proposed the L1-BF algorithm to efficiently calculate the principal components of the -norm PCA based on bit-flipping. However, outliers still persist in the training process of -norm PCA methods, so these approaches can only alleviate the impact of outliers to some extent. Moreover, current -norm PCA methods have a high computational cost [21].

Projection Pursuit [17] based robust PCA attempts to search for the direction that maximize a robust measure of the variance of the projected observations. In order to relieve the high computational cost problem in [17], a faster algorithm, named RAPCA, was presented in [12]. Furthermore, Hubert et al. [11] combined projection pursuit ideas with robust covariance estimator and proposed the method ROBPCA.

In Principal Component Pursuit method (PCP) [5], the observed data is decomposed into a low rank matrix and a sparse matrix, among which the sparse matrix can be treated as noise. While, such a decomposition is not always realistic [23]. Moreover, PCP is proposed for uniformly corrupted coordinates of data, rather than outlying samples [35].

In addition, based on statistical physics and self-organized rules, a robust PCA was proposed in [33]

. The authors generalized an energy function by introducing binary variables. We also adopt binary variables in our algorithm, however, our method is derived under the probability framework based on Self-Paced Learning and Probabilistic PCA. More importantly, we give an efficient strategy to solve the relevant optimization problem.

6 Conclusion

In this paper, with the aim to build a more robust dimensionality reduction algorithm, we propose a novel model called SP-PPCA. To our knowledge, this is the first attempt to introduce Self-Paced Learning mechanism into Probabilistic PCA. We develop the corresponding algorithm for optimizing the model parameters. Compared with classical PCA and several variant algorithms, extensive experiments on the simulation data and real-world data demonstrate the effectiveness of our model.

SP-PPCA can be easily extended to solve the missing value problem like PPCA. In addition, SP-PPCA is based on Gaussian distribution, other robust distribution, such as heavy-tailed distributions [18], can be adopted in SP-PPCA to improve the robustness in the future.

References

  • [1] Baccini, A., Besse, P., Falguerolles, A.: A -norm pca and a heuristic approach. Ordinal and symbolic data analysis 1(1), 359–368 (1996)
  • [2] Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Transactions on Pattern Analysis & Machine Intelligence (7), 711–720 (1997)
  • [3] Biffi, A., Anderson, C.D., Nalls, M.A., Rahman, R., Sonni, A., Cortellini, L., Rost, N.S., Matarin, M., Hernandez, D.G., Plourde, A., et al.: Principal-component analysis for assessment of population stratification in mitochondrial medical genetics. The American Journal of Human Genetics 86(6), 904–917 (2010)
  • [4]

    Bishop, C.M.: Pattern recognition and machine learning. springer (2006)

  • [5] Candès, E.J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis? Journal of the ACM (JACM) 58(3),  11 (2011)
  • [6] Dhanaraj, M., Markopoulos, P.P.: Novel algorithm for incremental -norm principal-component analysis. In: 2018 26th European Signal Processing Conference (EUSIPCO). pp. 2020–2024. IEEE (2018)
  • [7] Dua, D., Graff, C.: UCI machine learning repository (2017), http://archive.ics.uci.edu/ml
  • [8] Han, L., Zhang, D., Huang, D., Chang, X., Ren, J., Luo, S., Han, J.: Self-paced mixture of regressions. In: IJCAI. pp. 1816–1822 (2017)
  • [9] He, R., Hu, B.G., Zheng, W.S., Kong, X.W.: Robust principal component analysis based on maximum correntropy criterion. IEEE Transactions on Image Processing 20(6), 1485–1494 (2011)
  • [10] Hotelling, H.: Analysis of a complex of statistical variables into principal components. Journal of educational psychology 24(6),  417 (1933)
  • [11] Hubert, M., Rousseeuw, P.J., Vanden Branden, K.: Robpca: a new approach to robust principal component analysis. Technometrics 47(1), 64–79 (2005)
  • [12] Hubert, M., Rousseeuw, P.J., Verboven, S.: A fast method for robust principal components with applications to chemometrics. Chemometrics and Intelligent Laboratory Systems 60(1-2), 101–111 (2002)
  • [13] Jolliffe, I.: Principal component analysis. Springer (2011)
  • [14]

    Ju, F., Sun, Y., Gao, J., Hu, Y., Yin, B.: Image outlier detection and feature extraction via

    -norm-based 2d probabilistic pca. IEEE Transactions on Image Processing 24(12), 4834–4846 (2015)
  • [15] Ke, Q., Kanade, T.: Robust

    norm factorization in the presence of outliers and missing data by alternative convex programming. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05). vol. 1, pp. 739–746. IEEE (2005)

  • [16] Kumar, M.P., Packer, B., Koller, D.: Self-paced learning for latent variable models. In: Advances in Neural Information Processing Systems. pp. 1189–1197 (2010)
  • [17] Li, G., Chen, Z.: Projection-pursuit approach to robust dispersion matrices and principal components: primary theory and monte carlo. Journal of the American Statistical Association 80(391), 759–766 (1985)
  • [18] Luttinen, J., Ilin, A., Karhunen, J.: Bayesian robust pca of incomplete data. Neural processing letters 36(2), 189–202 (2012)
  • [19] Markopoulos, P.P., Kundu, S., Chamadia, S., Pados, D.A.: Efficient -norm principal-component analysis via bit flipping. IEEE Transactions on Signal Processing 65(16), 4252–4264 (2017)
  • [20] Meng, D., Zhao, Q., Jiang, L.: What objective does self-paced learning indeed optimize? arXiv preprint arXiv:1511.06049 (2015)
  • [21] Minnehan, B., Savakis, A.: Grassmann manifold optimization for fast -norm principal component analysis. IEEE Signal Processing Letters 26(2), 242–246 (2019)
  • [22] Murphy, K.P.: Machine learning: a probabilistic perspective. MIT press (2012)
  • [23] Neumayer, S., Nimmer, M., Setzer, S., Steidl, G.: On the rotational invariant -norm pca. arXiv preprint arXiv:1902.03840 (2019)
  • [24] Nie, F., Huang, H., Ding, C., Luo, D., Wang, H.: Robust principal component analysis with non-greedy

    -norm maximization. In: Twenty-Second International Joint Conference on Artificial Intelligence (2011)

  • [25] Pearson, K.: Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2(11), 559–572 (1901)
  • [26] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011)
  • [27]

    Sangineto, E., Nabi, M., Culibrk, D., Sebe, N.: Self paced deep learning for weakly supervised object detection. IEEE transactions on pattern analysis and machine intelligence

    41(3), 712–725 (2019)
  • [28] Serneels, S., Verdonck, T.: Principal component analysis for data containing outliers and missing elements. Computational Statistics & Data Analysis 52(3), 1712–1727 (2008)
  • [29] Stanimirova, I., Daszykowski, M., Walczak, B.: Dealing with missing values and outliers in principal component analysis. Talanta 72(1), 172–178 (2007)
  • [30] Tipping, M.E., Bishop, C.M.: Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61(3), 611–622 (1999)
  • [31] De la Torre, F., Black, M.J.: Robust principal component analysis for computer vision. In: Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001. vol. 1, pp. 362–369. IEEE (2001)
  • [32] Xiao, W., Huang, X., Silva, J., Emrani, S., Chaudhuri, A.: Online robust principal component analysis with change point detection. arXiv preprint arXiv:1702.05698 (2017)
  • [33]

    Xu, L., Yuille, A.L.: Robust principal component analysis by self-organizing rules based on statistical physics approach. IEEE Transactions on Neural Networks

    6(1), 131–143 (1995)
  • [34] Zhang, D., Meng, D., Zhao, L., Han, J.: Bridging saliency detection to weakly supervised object detection based on self-paced curriculum learning. arXiv preprint arXiv:1703.01290 (2017)
  • [35] Zhang, T., Lerman, G.: A novel m-estimator for robust pca. The Journal of Machine Learning Research 15(1), 749–808 (2014)
  • [36] Zhang, Y., Tang, Q., Niu, L., Dai, T., Xiao, X., Xia, S.T.: Self-paced mixture of t distribution model. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 2796–2800. IEEE (2018)
  • [37] Zhao, Q., Meng, D., Jiang, L., Xie, Q., Xu, Z., Hauptmann, A.G.: Self-paced learning for matrix factorization. In: Twenty-Ninth AAAI Conference on Artificial Intelligence (2015)