DeepAI
Log In Sign Up

Cauchy robust principal component analysis with applications to high-deimensional data sets

Principal component analysis (PCA) is a standard dimensionality reduction technique used in various research and applied fields. From an algorithmic point of view, classical PCA can be formulated in terms of operations on a multivariate Gaussian likelihood. As a consequence of the implied Gaussian formulation, the principal components are not robust to outliers. In this paper, we propose a modified formulation, based on the use of a multivariate Cauchy likelihood instead of the Gaussian likelihood, which has the effect of robustifying the principal components. We present an algorithm to compute these robustified principal components. We additionally derive the relevant influence function of the first component and examine its theoretical properties. Simulation experiments on high-dimensional datasets demonstrate that the estimated principal components based on the Cauchy likelihood outperform or are on par with existing robust PCA techniques.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

10/24/2019

Robust Principal Component Analysis Based On Maximum Correntropy Power Iterations

Principal component analysis (PCA) is recognised as a quintessential dat...
02/18/2014

High Dimensional Semiparametric Scale-Invariant Principal Component Analysis

We propose a new high dimensional semiparametric principal component ana...
05/03/2019

Uncertainty-Aware Principal Component Analysis

We present a technique to perform dimensionality reduction on data that ...
08/22/2018

XPCA: Extending PCA for a Combination of Discrete and Continuous Variables

Principal component analysis (PCA) is arguably the most popular tool in ...
02/08/2019

Automatic dimensionality selection for principal component analysis models with the ignorance score

Principal component analysis (PCA) is by far the most widespread tool fo...
05/11/2020

Robust PCA via Regularized REAPER with a Matrix-Free Proximal Algorithm

Principal component analysis (PCA) is known to be sensitive to outliers,...
10/27/2017

Quantifying the Estimation Error of Principal Components

Principal component analysis is an important pattern recognition and dim...

1 Introduction

In the analysis of multivariate data, it is frequently desirable to employ statistical methods which are insensitive to the presence of outliers in the sample. To address the problem of outliers, it is important to develop robust statistical procedures. Most statistical procedures include explicit or implicit prior assumptions about the distribution of the observations, but often without taking into account the effect of outliers. The purpose of this paper is to present a novel robust version of PCA which has some attractive features.

Principal components analysis (PCA) is considered to be one of the most important techniques in statistics. However, the classical version of PCA depends on either a covariance or a correlation matrix, both of which are very sensitive to outliers. We develop an alternative method to classical PCA, which is far more robust, by using a multivariate Cauchy likelihood to construct a robust principal components (PC) procedure. It is an adaptation of the classic method of PCA obtained by replacing the Gaussian log-likelihood function by the Cauchy log-likelihood function, in a sense that will be explained in section 2.2. Although we do not claim that the interpretation of standard PCA in terms of operations on a Gaussian likelihood is new, see Bolton and Krzanowski, this fact does not appear to have been exploited in the development of a robust PCA procedure, as we do in this paper. An important reason for using the multivariate Cauchy likelihood is that this likelihood has only one maximum point, but the single most important motivation is that it leads to a robust procedure.

In the next section we review briefly some of the techniques employed for estimating parameters and for directing a PCA in ways which are robust against the presence of outliers. We also present robustness preliminaries that include some important techniques which are necessary to assess whether the method used is robust or not. In Section 3 we develop the Cauchy-PCA and theoretically explore its robustness properties. Finally, in Section 4 we present the numerical algorithms for creating Cauchy PCs, and also give the results of a number of very high-dimensional real-data and simulated examples. Our approach is seen to be competitive with, and often gives superior results to, that of the projection pursuit algorithm of Croux et al. (2007, 2013). Finally we conclude the paper in Section 5.

1.1 Literature review on robust PCA

It is well known that PCA is an important technique for high-dimensional data reduction. PCA is based on the sample covariance matrix

and it involves searching for a linear combination of the

components of the vector that maximize the sample variance of the components of

. According to Mardia et al. (1979), the solution will be given by the equation

where and its diagonal elements are the sample variances, while

is an orthogonal matrix, i.e.

, whose columns

are the corresponding eigenvectors which represent the linear combinations. [[The principal components are efficiently estimated in practice via Singular Value Decomposition (SVD) (cite Lanczos for an efficient algorithm).]]

Classical PCA, unfortunately, is non-robust, since it based on the sample covariance or sample correlation matrix which are very sensitive to outlying observations; see section 2. However, this problem has been handled by two different methods which result in robust versions of PCA by:

i.

replacing the standard covariance or correlation matrix with a robust estimator; or

ii.

maximising (or minimising) a different objective function to obtain a robust PCA.

Many different proposes had been developed to carry out robust PCA, such that using projection pursuit PP, estimators and so on.

Despite maximum likelihood estimation, perhaps, being considered as the most important statistical inference method, sometimes this approach can lead to improper results when the underlying assumptions are not satisfied, for instance, when data contain outliers or deviate slightly from the supposed model. A generalization of maximum likelihood estimation proposed by Huber (1964) which is called -estimation, aims to produce a robust statistic by constructing approaches that are resistant to deviations from the underline assumptions. -estimators were also defined for the multivariate case by Maronna (1976).

Campbell (1980) provided a procedure for robust PCA by examining the estimates of means and covariances which are less affected by outlier observations, and by exploring the observations which have a large effect on the estimates. He replaced the sample covariance sample by an estimator. Hubert and Verboven (2003) introduced a new approach to create robust PCA. It combines the advantages of two methods, the first one is based on replacing the covariance or correlation matrix by its robust estimator, while the second one is based on maximizing the objective function for this robust estimate.

A robust PCA based on the projection pursuit (PP) method was developed by Li and Chen (1985), using Huber’s -estimator of dispersion as the projection index. The objective of PP is to seek projections, of the high-dimensional data set onto low-dimensional subspaces, that optimise a function of ”interestingness”. The function that should be optimised is called an index or objective function and its choice depends on a feature that the researcher is concerned about. This property gives the PP technique a flexibility to handle many different statistical problems range from clustering to identifying outliers in a multivariate data set.

Bolton and Krzanowski (1999)

characterized the PC’s for PP in terms of maximum likelihood under the assumption of normality. PCA can be considered as a special case of PP as well as many other methods of multivariate analysis.

Li and Chen (1985) used Huber’s -estimator of dispersion as projective index to develop a robust PCA based on the PP approach. The sample median was used as a projective index to develop a robust PCA by Xie et al. (1993). In their simulation studies, Xie et al. (1993)

observed a PCA resistant to outliers and deviations from the normal distribution.

Croux et al. (2007, 2013) also suggested a robust PCA using projection pursuit and we will contrast our methodology against their algorithm.

2 Preliminaries on standard PCA

PCA is an orthogonal linear transformation that projects the data to a new coordinate system according to the variance of each direction. Given a data matrix

with each row correspond to a sample, the first direction that maximizes the variance is defined through

where is an -dimensional vector whose elements are all set to 1 while is the empirical mean. The process is repeated times and at each iteration the to-be-estimated principal direction has to be orthogonal to all previously-computed principal directions. Thus, the -th direction which has to be orthogonal to the previous ones is defined by

2.1 Non-robustness of standard PCA

We will show that the influence function for the largest eigenvalue of the covariance matrix and the respective eigenvector are unbounded with respect to the norm of an outlier sample. Suppose that

is the covariance matrix of a population with distribution function , i.e.,

(1)

where corresponds to the mean vector. Assume that the leading eigenvalue of has multiplicity 1, then we denote it by and the leading eigenvector by (i.e., ).

Let be an arbitrary functional, a distribution and an arbitrary point in the relevant sample space. The influence function is defined as

(2)

where is a unit point mass located at .

A robust estimator for means that the influence function is bounded with respect to the norm of the outlier .

Proposition 2.1.

The influence function for the leading eigenvector of is given by111We use to denote the Moore-Penrose inverse of a matrix .

(3)

Similarly, the IF for the largest eigenvalue of is

(4)

The detailed calculations are presented in Appendix A1. The following result shows that outliers with unbounded influence function do exist.

Corollary 2.1.

Let where is orthogonal to and does not belong to the null space of and then

and similarly for .

Proof.

Direct substitution of into the influence function gives:

Since does not belong to the null space of , it holds that thus . Hence,

Given that , either sending or completes the proof.

Similarly,

as . ∎

2.2 Generalizations of standard PCA

Standard PCA can be viewed as a special case of a more general optimization problem. We present two such generalization: the first one leads to projection pursuit algorithms while the second leads to a maximum likelihood formulation. Let be a unit vector and define the projection values

and a function acting on the projected values. The first generalization of PCA is defined as the maximization of :

As in the standard PCA, the following principal directions are obtained after removing the contribution of the current principal component from the data. When is the sample variance then we recover the standard PCA.

The second generalization interprets the computation of the principal component as a maximum likelihood estimation problem. By letting,

(5)

be the Gaussian log-likelihood, the first principal direction can be obtained by solving the minimax problem:

Indeed, the inner maximization can be solved analytically which leads to the optimal solution

and

Unsurprisingly, the optimal values are the sample mean and the sample variance. Using the above formulas it is straightforward to show that

(6)

Variations of PCA can be derived by changing the likelihood function and in the next section we analyze the case of Cauchy distribution.

3 Cauchy PCA

The Cauchy log-likelihood function is given by

(7)

where and are the two parameters of the Cauchy distribution. The first Cauchy principal direction is also obtained by solving the minimax optimization problem:

(8)

In contrast to the Gaussian case, the inner maximization cannot be performed analytically. Therefore an iterative approach needs to be utilized. Here, we apply the Newton-Raphson method with initial values the median and half the interquartile range for the location and scale parameters, respectively. According to Copas (1975), although the mean of the Cauchy distribution does not exist and it has infinite variance, the Cauchy log-likelihood function has a unique maximum likelihood estimate, .

Fixing and , the outer minimization is also non-analytic and a fixed point iteration is applied to calculate . The iteration is given by

(9)

where is the unnormalized direction which is obtained from the differentiation of the Lagrangian function with respect to and it is given by

(10)

Once the first principal direction has been computed, its contribution from the dataset is removed and the same procedure to estimate the next principal direction is repeated. This iterative process is repeated times. The removal of the contribution makes the principal directions orthogonal to each other. We summarize the estimation of Cauchy principal components in the following pseudo-code (Algorithm 1).

  for  do
      Initialize and normalize
     while not converged do
         Fix and set
         Via Newton-Raphson algorithm find
         Fix and using fixed point iteration (i.e., (10) & (9)) find
     end while
      Set the -th Cauchy principal direction
      Remove the contribution from the dataset
  end for
Algorithm 1 Cauchy PCA

3.1 Robustness of the Leading Cauchy Principal Direction

Let be the parameter vector of the Cauchy distribution and consider the infinite-sample normalized Cauchy log-likelihood function

(11)

where and . We will estimate the influence function for the leading Cauchy principal direction

(12)

where is the optimal Cauchy parameters for a given direction .

Since is restricted to be a unit vector, the standard condition for the minimum, i.e., is not valid. The proper condition is defined by

(13)

where is the projection matrix given by .

Remark 3.1.

An equivalent condition is to satisfy for all such that and . Both derived conditions are essentially a consequence of the Lagrangian formulation of the constraint optimization problem. Indeed, the Lagrange condition implies that at the minimum the direction of the objective function’s derivative should be parallel to the direction of the constraint’s derivative which translates to where is the Lagrange multiplier.

Let be the likelihood function computed at and let denote its partial derivatives as

and

Similarly, , and denote the second order derivatives. The following proposition establishes the expression for the influence function of the leading Cauchy principal direction, .

Proposition 3.1.

Under the assumption of and being invertible matrices, the influence function of is

(14)

where

and

while

is the expected Fisher information matrix under for the parameters of the Cauchy distribution computed at .

Proof.

The proof consists of several straightforward series expansions and implicit function calculations. The complete proof is given in Appendix A2. ∎

The following boundedness result for the influence function states the conditions under which Cauchy PCA is robust.

Corollary 3.1.

Let the assumptions of the proposition hold. If or if but then the influence function for is bounded.

Proof.

First, observe that matrix does not depend on . It is only that depends on and our goal is to prove that is bounded with respect to . Second, we have to compute the partial derivatives and . Straightforward calculations lead to

and

Let us now define an arbitrary scaling of the outlier and prove boundedness of as we send . We consider the first case where . It holds that , and therefore is bounded with respect to .

For the second case, we have

and

since by assumption. Thus is bounded with respect to for the second case, too. ∎

The only case not covered by the corollary is when and . Our experiments presented in the following section show that outliers that are orthogonal to the Cauchy principal direction do sometimes influence the estimation of the Cauchy principal direction yet not significantly.

3.2 Several Cauchy principal components

We briefly mention possibilities for estimating several Cauchy principal components. There are two obvious approaches: one approach, the sequential approach, is to repeat the algorithm described above on the subspace orthogonal to to obtain , the second Cauchy principal component, where is the first Cauchy principal component; then repeat the procedure on the subspace orthogonal to and to obtain ; and so on. A second approach, the simultaneous approach, is to decide in advance how many principal components we wish to determine, say, and then use a -dimensional multivariate Cauchy likelihood, which has free parameters, to obtain . It turns out that these two approaches lead to equivalent results in classical (Gaussian) PCA but when a Cauchy likelihood is used the two approaches produce different sets of principal components. Our current thinking is this: the sequential approach is easier to implement (essentially the same software can be used at each step) and it is faster. However, the simultaneous approach could potentially be preferable if we know in advance how many principal components we wish to estimate. Further investigation is required.

4 Numerical Results

4.1 Simulation studies

In this section we will empirically validate our proposed methodology, via simulation studies. We searched for R packages that offer robust PCA in the case and came up with FastHCS (Vakili, 2018), rrcovHD (Todorov, 2016), rpca (Sykulski, 2017) and pcaPP (Filzmoser et al., 2018). Out of them, pcaPP (Projection Pursuit PCA) is the only one which does not require hyper-parameter tuning, e.g. selection of the LASSO penalty or choice of the percentage of observations used to estimate a robust covariance matrix.

4.1.1 Setup of the simulations

Initially, we created a (orthonormal) basis

by using QR decomposition on some randomly generated data. We then generated eigenvalues

, where and hence we obtained the covariance matrix , where . The first column of served as the first “clean” eigenvector, and was the benchmark in our comparative evaluations. Following this step, we simulated random vectors and in order to check the robustness of the results to the center of the data, all observations were shifted right by adding everywhere. A number of outliers equal to 2 of the sample size were introduced. These outliers were , where is the sample mean vector, are unit vector(s) and a real number denoting their norm, where varied from up to increasing with a step size equal to and the angle between the outliers and the first “clean” eigenvector spanned from up to . In all cases, we subtracted the spatial median or the column-wise median222The results are pretty similar for either type of median and we here show the results of he column-wise median. and scaled them by the mean absolute deviation.

At each case, we computed the first Cauchy-PCA eigenvector and the first PP-PCA eigenvector. The performance metric is the angle (in degrees) between the first robust (based on Cauchy or PP-PCA) eigenvector and the first ”clean” eigenvector computed using the classical PCA. All experiments were repeated times and the results were averaged.

4.1.2 Comparative results

Tables 1-3 present the performance of the first Cauchy-PCA eigenvector and of the first PP-PCA eigenvector for a variety of norms of the outlier, with different angles () between the outlier and the leading true eigenvector, for the case.

The case of was selected as statistical inference in this case is more challenging than the case333In this paper we focus on high-dimensional simulations and real-date examples ( but in results not presented in the paper we found that Cauchy PCA is also very competitive and performs strongly in low dimensional settings ().. Additionally, this case is also ordinarily met in the field of bioinformatics were the -omics data count tens of thousands of variables (genes, single nucleotide polymorphisms, etc.) but only tens or at most hundreds of observations.

As observed in Tables 1-3, the average angular difference between the Cauchy and the PP PCA ranges from up to more than , which is evidently quite substantial, providing evidence that Cauchy PCA has performed in a superior manner to the projection pursuit method of Croux et al. (2007, 2013). In particular, the tables demonstrate that Cauchy PCA is less error prone than its competitor but, as is seen in Table 3, the error decreases for both methods with increasing sample size. Further, the mean angular difference between the two methods increases as the angle increases. For instance, in Table 1, when and the difference between the two methods is , whereas when the difference increases to . Further, the error is not highly affected by the angle , or the norm of the outliers. It can be seen that in Table 2 and Table 3 in the special case of , the error increases for the Cauchy PCA by , thus corroborating the result of Corollary 3.1. However, this effect, as in Table 1, is rather small, though noticeable.

Angle Method k=-Inf k=3 k=4 k=5 k=6 k=7 k=8
Cauchy 31.17 29.79 29.54 28.83 28.86 29.24 28.78
PP 82.45 49.91 48.84 48.22 49.08 49.61 48.14
Cauchy 31.44 29.24 29.13 28.60 28.89 29.34 29.65
PP 82.45 65.28 65.34 63.42 62.96 66.63 65.43
Cauchy 31.49 29.86 29.07 29.04 29.55 29.70 29.09
PP 82.11 81.11 82.55 82.63 82.12 82.49 82.03
Cauchy 32.32 31.67 33.00 33.13 32.86 33.19 33.06
PP 82.38 82.06 81.69 82.12 81.73 81.74 81.88
Table 1: Mean angular difference between the robust eigenvectors computed in the contaminated data and the sample eigenvector computed in the clean data when and . The norm of the outliers is and their angle with the true clean eigenvector is denoted by .
Angle Method k=-Inf k=3 k=4 k=5 k=6 k=7 k=8
Cauchy 36.53 33.12 33.60 33.69 32.62 32.51 33.16
PP 83.06 80.36 80.17 81.87 80.50 80.76 80.16
Cauchy 36.55 34.72 33.91 33.09 33.11 33.16 32.79
PP 83.07 82.36 82.76 82.65 83.07 82.93 83.12
Cauchy 36.42 34.46 33.96 33.61 34.41 33.07 33.47
PP 83.78 82.86 82.71 84.05 83.46 82.71 82.78
Cauchy 36.50 36.12 36.81 37.18 39.34 39.11 38.51
PP 83.63 83.73 83.69 83.65 84.03 83.66 83.00
Table 2: Mean angular difference between the robust eigenvectors computed in the contaminated data and the sample eigenvector computed in the clean data when and . The norm of the outliers is and their angle with the true clean eigenvector is denoted by .
Angle Method k=-Inf k=3 k=4 k=5 k=6 k=7 k=8
Cauchy 19.95 18.60 18.46 18.35 18.24 18.20 17.93
PP 68.76 26.08 24.93 24.91 24.83 24.73 24.72
Cauchy 19.43 18.30 18.39 18.22 18.16 18.01 18.13
PP 68.98 39.72 38.88 38.44 38.20 38.15 38.14
Cauchy 19.76 18.60 18.12 18.20 18.40 18.19 18.01
PP 69.10 64.10 63.12 62.89 62.91 62.82 62.77
Cauchy 19.49 19.84 20.16 21.87 22.41 22.87 22.84
PP 68.99 68.62 68.59 68.70 68.45 68.73 68.43
Table 3: Mean angular difference between the robust eigenvectors computed in the contaminated data and the sample eigenvector computed in the clean data when and . The norm of the outliers is and their angle with the true clean eigenvector is denoted by .

4.2 High dimensional real datasets

Two real gene expression datasets, GSE13159 and GSE31161444From a biological standpoint, the data have already been uniformly pre-processed, curated and automatically annotated., downloaded from the Biodataome platform (Lakiotaki et al., 2018), were used in the experiments. The dimensions of the datasets were equal to and , respectively. We randomly selected variables and computed the outliers using the high dimensional Minimum Covariance Determinant (MCD) of Ro et al. (2015). In accordance with the simulations studies, we removed the

of the most extreme outliers detected by MCD and computed the first classical PC (benchmark eigenvector), the first Cauchy-PCA eigenvector and the first PP-PCA eigenvector of the ”clean” data. We then added those outliers and increased their norm by

, where and computed computed the first Cauchy-PCA eigenvector and the first PP-PCA eigenvector. In all cases, we subtracted the spatial median or the column-wise median and scaled them by the mean absolute deviation. The performance metric is the angle (in degrees) between the first robust (based on Cauchy or PP-PCA) eigenvector and the first true “clean” eigenvectors and the time required by each method. This procedure was repeated times and the average results are graphically displayed in Figures 1(a)-(d).

Broadly speaking the effect of the PP PCA does not seem to have been affected substantially by the centering method, i.e. subtraction of the spatial or the column-wise median. On the contrary, the Cauchy PCA is affected by the type of median employed to this end. Centering with the spatial median yields high error levels for all norms of the outliers, for both datasets, whereas centering with the column-wise median produces much lower error levels. On average, the difference in the error between Cauchy PCA and PP PCA is about for the GSE31159 dataset (Figure 1(a)) and about for the GSE3161 dataset (Figure 1(b)). However, the error of the Cauchy PCA increases and the stabilizes in the GSE31159 dataset whereas the error of the PP PCA is stable regardless of the norm of the outliers. A different conclusion is extracted in the GSE31161 where the error of either method decreases as the norm of the outliers increases, until it reaches a plateau.

With regards to computational efficiency, the PP PCA is not affected by either centering method, whereas Cauchy PCA seems to be affected in the GSE31159 dataset but not in the GSE31161 dataset as seen in Figures 1(c) and 1(d). Cauchy PCA centered with the column-wise median is, on average, 5 times faster than PP PCA.

GSE31159 GSE31161
(a) (b)
(c) (d)
Figure 1: The first row presents the angle between the first Cauchy PC of the ”contaminated” data and the 1st leading eigenvector of the ”clean” data and the angle between the first Projection Pursuit PC of the ”contaminated” data and the 1st leading eigenvector of the ”clean” data for increasing norms of the outliers. The second row contains the time in seconds.

5 Conclusion

The starting point for this paper is the observation that classical PCA can be formulated purely in terms of operations on a Gaussian likelihood. Although this observation is not new, the specifics of this formulation of classical PCA do not appear to be as widely known as might be expected. The novel idea underlying this paper is to formulate a version of PCA in which a Cauchy likelihood is used instead of a Gaussian likelihood, leading to what we call Cauchy PCA. Study of the resulting influence functions shows that Cauchy PCA has very good robustness properties. Moreover, we have provided an implementation of Cauchy PCA which runs quickly and reliably. Numerous simulation and real-data examples, mainly in high-dimensional settings, show that Cauchy PCA typically out-performs alternative robust versions of PCA whose implementation is in the public domain.

Appendix

A1 Proof of Proposition 2.1

Proof.

The perturbed distribution has perturbed mean value

and perturbed covariance matrix

Denoting by the leading eigenvalue of and by the corresponding eigenvector, it holds that

(15)

Next, we expand the perturbed eigenvector and eigenvalue around the unperturbed ones as follows:

and

with

Substituting the formulas into (15), and equating the zero-th and first order we get

and

(16)

and

Multiplying (16) from the left with , we get

For , we rearrange (16) to

and then multiply from the left with the pseudo-inverse of to obtain

Using the properties (Mardia et al. (1979)): and , we obtain

and the proof is completed. ∎

A2 Proof of Proposition 3.1

Let us first make the symbolism more explicit and denote the Cauchy log-likelihood function with respect to the distribution and the respective leading Cauchy principal direction. Then, our goal is to calculate the limit of

as where is the leading Cauchy principal direction for the distribution . The optimality condition for the leading Cauchy principal direction reads

(17)

and

Moreover, is a unit vector which can be represented as

where is a unit vector perpendicular to and is a (small) real number. Under these assumptions, is a unit vector since

Obviously, depends on and (i.e., ) and but we choose to avoid denoting their explicit relationship because it is not required in our proof. Moreover, a Taylor expansion for the representation leads to

thus we obtain that

Next, we compute the partial derivative using the chain rule

Therefore,

The second summand equals to zero because maximizes the Cauchy log-likelihood function thus it holds that . Similarly,

Next, we further Taylor expand using

Using again the chain rule, we obtain that

The computation of the partial derivative follows. Formula