1 Introduction
In the analysis of multivariate data, it is frequently desirable to employ statistical methods which are insensitive to the presence of outliers in the sample. To address the problem of outliers, it is important to develop robust statistical procedures. Most statistical procedures include explicit or implicit prior assumptions about the distribution of the observations, but often without taking into account the effect of outliers. The purpose of this paper is to present a novel robust version of PCA which has some attractive features.
Principal components analysis (PCA) is considered to be one of the most important techniques in statistics. However, the classical version of PCA depends on either a covariance or a correlation matrix, both of which are very sensitive to outliers. We develop an alternative method to classical PCA, which is far more robust, by using a multivariate Cauchy likelihood to construct a robust principal components (PC) procedure. It is an adaptation of the classic method of PCA obtained by replacing the Gaussian loglikelihood function by the Cauchy loglikelihood function, in a sense that will be explained in section 2.2. Although we do not claim that the interpretation of standard PCA in terms of operations on a Gaussian likelihood is new, see Bolton and Krzanowski, this fact does not appear to have been exploited in the development of a robust PCA procedure, as we do in this paper. An important reason for using the multivariate Cauchy likelihood is that this likelihood has only one maximum point, but the single most important motivation is that it leads to a robust procedure.
In the next section we review briefly some of the techniques employed for estimating parameters and for directing a PCA in ways which are robust against the presence of outliers. We also present robustness preliminaries that include some important techniques which are necessary to assess whether the method used is robust or not. In Section 3 we develop the CauchyPCA and theoretically explore its robustness properties. Finally, in Section 4 we present the numerical algorithms for creating Cauchy PCs, and also give the results of a number of very highdimensional realdata and simulated examples. Our approach is seen to be competitive with, and often gives superior results to, that of the projection pursuit algorithm of Croux et al. (2007, 2013). Finally we conclude the paper in Section 5.
1.1 Literature review on robust PCA
It is well known that PCA is an important technique for highdimensional data reduction. PCA is based on the sample covariance matrix
and it involves searching for a linear combination of thecomponents of the vector that maximize the sample variance of the components of
. According to Mardia et al. (1979), the solution will be given by the equationwhere and its diagonal elements are the sample variances, while
is an orthogonal matrix, i.e.
, whose columnsare the corresponding eigenvectors which represent the linear combinations. [[The principal components are efficiently estimated in practice via Singular Value Decomposition (SVD) (cite Lanczos for an efficient algorithm).]]
Classical PCA, unfortunately, is nonrobust, since it based on the sample covariance or sample correlation matrix which are very sensitive to outlying observations; see section 2. However, this problem has been handled by two different methods which result in robust versions of PCA by:
 i.

replacing the standard covariance or correlation matrix with a robust estimator; or
 ii.

maximising (or minimising) a different objective function to obtain a robust PCA.
Many different proposes had been developed to carry out robust PCA, such that using projection pursuit PP, estimators and so on.
Despite maximum likelihood estimation, perhaps, being considered as the most important statistical inference method, sometimes this approach can lead to improper results when the underlying assumptions are not satisfied, for instance, when data contain outliers or deviate slightly from the supposed model. A generalization of maximum likelihood estimation proposed by Huber (1964) which is called estimation, aims to produce a robust statistic by constructing approaches that are resistant to deviations from the underline assumptions. estimators were also defined for the multivariate case by Maronna (1976).
Campbell (1980) provided a procedure for robust PCA by examining the estimates of means and covariances which are less affected by outlier observations, and by exploring the observations which have a large effect on the estimates. He replaced the sample covariance sample by an estimator. Hubert and Verboven (2003) introduced a new approach to create robust PCA. It combines the advantages of two methods, the first one is based on replacing the covariance or correlation matrix by its robust estimator, while the second one is based on maximizing the objective function for this robust estimate.
A robust PCA based on the projection pursuit (PP) method was developed by Li and Chen (1985), using Huber’s estimator of dispersion as the projection index. The objective of PP is to seek projections, of the highdimensional data set onto lowdimensional subspaces, that optimise a function of ”interestingness”. The function that should be optimised is called an index or objective function and its choice depends on a feature that the researcher is concerned about. This property gives the PP technique a flexibility to handle many different statistical problems range from clustering to identifying outliers in a multivariate data set.
Bolton and Krzanowski (1999)
characterized the PC’s for PP in terms of maximum likelihood under the assumption of normality. PCA can be considered as a special case of PP as well as many other methods of multivariate analysis.
Li and Chen (1985) used Huber’s estimator of dispersion as projective index to develop a robust PCA based on the PP approach. The sample median was used as a projective index to develop a robust PCA by Xie et al. (1993). In their simulation studies, Xie et al. (1993)observed a PCA resistant to outliers and deviations from the normal distribution.
Croux et al. (2007, 2013) also suggested a robust PCA using projection pursuit and we will contrast our methodology against their algorithm.2 Preliminaries on standard PCA
PCA is an orthogonal linear transformation that projects the data to a new coordinate system according to the variance of each direction. Given a data matrix
with each row correspond to a sample, the first direction that maximizes the variance is defined throughwhere is an dimensional vector whose elements are all set to 1 while is the empirical mean. The process is repeated times and at each iteration the tobeestimated principal direction has to be orthogonal to all previouslycomputed principal directions. Thus, the th direction which has to be orthogonal to the previous ones is defined by
2.1 Nonrobustness of standard PCA
We will show that the influence function for the largest eigenvalue of the covariance matrix and the respective eigenvector are unbounded with respect to the norm of an outlier sample. Suppose that
is the covariance matrix of a population with distribution function , i.e.,(1) 
where corresponds to the mean vector. Assume that the leading eigenvalue of has multiplicity 1, then we denote it by and the leading eigenvector by (i.e., ).
Let be an arbitrary functional, a distribution and an arbitrary point in the relevant sample space. The influence function is defined as
(2) 
where is a unit point mass located at .
A robust estimator for means that the influence function is bounded with respect to the norm of the outlier .
Proposition 2.1.
The influence function for the leading eigenvector of is given by^{1}^{1}1We use to denote the MoorePenrose inverse of a matrix .
(3) 
Similarly, the IF for the largest eigenvalue of is
(4) 
The detailed calculations are presented in Appendix A1. The following result shows that outliers with unbounded influence function do exist.
Corollary 2.1.
Let where is orthogonal to and does not belong to the null space of and then
and similarly for .
Proof.
Direct substitution of into the influence function gives:
Since does not belong to the null space of , it holds that thus . Hence,
Given that , either sending or completes the proof.
Similarly,
as . ∎
2.2 Generalizations of standard PCA
Standard PCA can be viewed as a special case of a more general optimization problem. We present two such generalization: the first one leads to projection pursuit algorithms while the second leads to a maximum likelihood formulation. Let be a unit vector and define the projection values
and a function acting on the projected values. The first generalization of PCA is defined as the maximization of :
As in the standard PCA, the following principal directions are obtained after removing the contribution of the current principal component from the data. When is the sample variance then we recover the standard PCA.
The second generalization interprets the computation of the principal component as a maximum likelihood estimation problem. By letting,
(5) 
be the Gaussian loglikelihood, the first principal direction can be obtained by solving the minimax problem:
Indeed, the inner maximization can be solved analytically which leads to the optimal solution
and
Unsurprisingly, the optimal values are the sample mean and the sample variance. Using the above formulas it is straightforward to show that
(6) 
Variations of PCA can be derived by changing the likelihood function and in the next section we analyze the case of Cauchy distribution.
3 Cauchy PCA
The Cauchy loglikelihood function is given by
(7) 
where and are the two parameters of the Cauchy distribution. The first Cauchy principal direction is also obtained by solving the minimax optimization problem:
(8) 
In contrast to the Gaussian case, the inner maximization cannot be performed analytically. Therefore an iterative approach needs to be utilized. Here, we apply the NewtonRaphson method with initial values the median and half the interquartile range for the location and scale parameters, respectively. According to Copas (1975), although the mean of the Cauchy distribution does not exist and it has infinite variance, the Cauchy loglikelihood function has a unique maximum likelihood estimate, .
Fixing and , the outer minimization is also nonanalytic and a fixed point iteration is applied to calculate . The iteration is given by
(9) 
where is the unnormalized direction which is obtained from the differentiation of the Lagrangian function with respect to and it is given by
(10) 
Once the first principal direction has been computed, its contribution from the dataset is removed and the same procedure to estimate the next principal direction is repeated. This iterative process is repeated times. The removal of the contribution makes the principal directions orthogonal to each other. We summarize the estimation of Cauchy principal components in the following pseudocode (Algorithm 1).
3.1 Robustness of the Leading Cauchy Principal Direction
Let be the parameter vector of the Cauchy distribution and consider the infinitesample normalized Cauchy loglikelihood function
(11) 
where and . We will estimate the influence function for the leading Cauchy principal direction
(12) 
where is the optimal Cauchy parameters for a given direction .
Since is restricted to be a unit vector, the standard condition for the minimum, i.e., is not valid. The proper condition is defined by
(13) 
where is the projection matrix given by .
Remark 3.1.
An equivalent condition is to satisfy for all such that and . Both derived conditions are essentially a consequence of the Lagrangian formulation of the constraint optimization problem. Indeed, the Lagrange condition implies that at the minimum the direction of the objective function’s derivative should be parallel to the direction of the constraint’s derivative which translates to where is the Lagrange multiplier.
Let be the likelihood function computed at and let denote its partial derivatives as
and
Similarly, , and denote the second order derivatives. The following proposition establishes the expression for the influence function of the leading Cauchy principal direction, .
Proposition 3.1.
Under the assumption of and being invertible matrices, the influence function of is
(14) 
where
and
while
is the expected Fisher information matrix under for the parameters of the Cauchy distribution computed at .
Proof.
The proof consists of several straightforward series expansions and implicit function calculations. The complete proof is given in Appendix A2. ∎
The following boundedness result for the influence function states the conditions under which Cauchy PCA is robust.
Corollary 3.1.
Let the assumptions of the proposition hold. If or if but then the influence function for is bounded.
Proof.
First, observe that matrix does not depend on . It is only that depends on and our goal is to prove that is bounded with respect to . Second, we have to compute the partial derivatives and . Straightforward calculations lead to
and
Let us now define an arbitrary scaling of the outlier and prove boundedness of as we send . We consider the first case where . It holds that , and therefore is bounded with respect to .
For the second case, we have
and
since by assumption. Thus is bounded with respect to for the second case, too. ∎
The only case not covered by the corollary is when and . Our experiments presented in the following section show that outliers that are orthogonal to the Cauchy principal direction do sometimes influence the estimation of the Cauchy principal direction yet not significantly.
3.2 Several Cauchy principal components
We briefly mention possibilities for estimating several Cauchy principal components. There are two obvious approaches: one approach, the sequential approach, is to repeat the algorithm described above on the subspace orthogonal to to obtain , the second Cauchy principal component, where is the first Cauchy principal component; then repeat the procedure on the subspace orthogonal to and to obtain ; and so on. A second approach, the simultaneous approach, is to decide in advance how many principal components we wish to determine, say, and then use a dimensional multivariate Cauchy likelihood, which has free parameters, to obtain . It turns out that these two approaches lead to equivalent results in classical (Gaussian) PCA but when a Cauchy likelihood is used the two approaches produce different sets of principal components. Our current thinking is this: the sequential approach is easier to implement (essentially the same software can be used at each step) and it is faster. However, the simultaneous approach could potentially be preferable if we know in advance how many principal components we wish to estimate. Further investigation is required.
4 Numerical Results
4.1 Simulation studies
In this section we will empirically validate our proposed methodology, via simulation studies. We searched for R packages that offer robust PCA in the case and came up with FastHCS (Vakili, 2018), rrcovHD (Todorov, 2016), rpca (Sykulski, 2017) and pcaPP (Filzmoser et al., 2018). Out of them, pcaPP (Projection Pursuit PCA) is the only one which does not require hyperparameter tuning, e.g. selection of the LASSO penalty or choice of the percentage of observations used to estimate a robust covariance matrix.
4.1.1 Setup of the simulations
Initially, we created a (orthonormal) basis
by using QR decomposition on some randomly generated data. We then generated eigenvalues
, where and hence we obtained the covariance matrix , where . The first column of served as the first “clean” eigenvector, and was the benchmark in our comparative evaluations. Following this step, we simulated random vectors and in order to check the robustness of the results to the center of the data, all observations were shifted right by adding everywhere. A number of outliers equal to 2 of the sample size were introduced. These outliers were , where is the sample mean vector, are unit vector(s) and a real number denoting their norm, where varied from up to increasing with a step size equal to and the angle between the outliers and the first “clean” eigenvector spanned from up to . In all cases, we subtracted the spatial median or the columnwise median^{2}^{2}2The results are pretty similar for either type of median and we here show the results of he columnwise median. and scaled them by the mean absolute deviation.At each case, we computed the first CauchyPCA eigenvector and the first PPPCA eigenvector. The performance metric is the angle (in degrees) between the first robust (based on Cauchy or PPPCA) eigenvector and the first ”clean” eigenvector computed using the classical PCA. All experiments were repeated times and the results were averaged.
4.1.2 Comparative results
Tables 13 present the performance of the first CauchyPCA eigenvector and of the first PPPCA eigenvector for a variety of norms of the outlier, with different angles () between the outlier and the leading true eigenvector, for the case.
The case of was selected as statistical inference in this case is more challenging than the case^{3}^{3}3In this paper we focus on highdimensional simulations and realdate examples ( but in results not presented in the paper we found that Cauchy PCA is also very competitive and performs strongly in low dimensional settings ().. Additionally, this case is also ordinarily met in the field of bioinformatics were the omics data count tens of thousands of variables (genes, single nucleotide polymorphisms, etc.) but only tens or at most hundreds of observations.
As observed in Tables 13, the average angular difference between the Cauchy and the PP PCA ranges from up to more than , which is evidently quite substantial, providing evidence that Cauchy PCA has performed in a superior manner to the projection pursuit method of Croux et al. (2007, 2013). In particular, the tables demonstrate that Cauchy PCA is less error prone than its competitor but, as is seen in Table 3, the error decreases for both methods with increasing sample size. Further, the mean angular difference between the two methods increases as the angle increases. For instance, in Table 1, when and the difference between the two methods is , whereas when the difference increases to . Further, the error is not highly affected by the angle , or the norm of the outliers. It can be seen that in Table 2 and Table 3 in the special case of , the error increases for the Cauchy PCA by , thus corroborating the result of Corollary 3.1. However, this effect, as in Table 1, is rather small, though noticeable.
Angle  Method  k=Inf  k=3  k=4  k=5  k=6  k=7  k=8 

Cauchy  31.17  29.79  29.54  28.83  28.86  29.24  28.78  
PP  82.45  49.91  48.84  48.22  49.08  49.61  48.14  
Cauchy  31.44  29.24  29.13  28.60  28.89  29.34  29.65  
PP  82.45  65.28  65.34  63.42  62.96  66.63  65.43  
Cauchy  31.49  29.86  29.07  29.04  29.55  29.70  29.09  
PP  82.11  81.11  82.55  82.63  82.12  82.49  82.03  
Cauchy  32.32  31.67  33.00  33.13  32.86  33.19  33.06  
PP  82.38  82.06  81.69  82.12  81.73  81.74  81.88 
Angle  Method  k=Inf  k=3  k=4  k=5  k=6  k=7  k=8 

Cauchy  36.53  33.12  33.60  33.69  32.62  32.51  33.16  
PP  83.06  80.36  80.17  81.87  80.50  80.76  80.16  
Cauchy  36.55  34.72  33.91  33.09  33.11  33.16  32.79  
PP  83.07  82.36  82.76  82.65  83.07  82.93  83.12  
Cauchy  36.42  34.46  33.96  33.61  34.41  33.07  33.47  
PP  83.78  82.86  82.71  84.05  83.46  82.71  82.78  
Cauchy  36.50  36.12  36.81  37.18  39.34  39.11  38.51  
PP  83.63  83.73  83.69  83.65  84.03  83.66  83.00 
Angle  Method  k=Inf  k=3  k=4  k=5  k=6  k=7  k=8 

Cauchy  19.95  18.60  18.46  18.35  18.24  18.20  17.93  
PP  68.76  26.08  24.93  24.91  24.83  24.73  24.72  
Cauchy  19.43  18.30  18.39  18.22  18.16  18.01  18.13  
PP  68.98  39.72  38.88  38.44  38.20  38.15  38.14  
Cauchy  19.76  18.60  18.12  18.20  18.40  18.19  18.01  
PP  69.10  64.10  63.12  62.89  62.91  62.82  62.77  
Cauchy  19.49  19.84  20.16  21.87  22.41  22.87  22.84  
PP  68.99  68.62  68.59  68.70  68.45  68.73  68.43 
4.2 High dimensional real datasets
Two real gene expression datasets, GSE13159 and GSE31161^{4}^{4}4From a biological standpoint, the data have already been uniformly preprocessed, curated and automatically annotated., downloaded from the Biodataome platform (Lakiotaki et al., 2018), were used in the experiments. The dimensions of the datasets were equal to and , respectively. We randomly selected variables and computed the outliers using the high dimensional Minimum Covariance Determinant (MCD) of Ro et al. (2015). In accordance with the simulations studies, we removed the
of the most extreme outliers detected by MCD and computed the first classical PC (benchmark eigenvector), the first CauchyPCA eigenvector and the first PPPCA eigenvector of the ”clean” data. We then added those outliers and increased their norm by
, where and computed computed the first CauchyPCA eigenvector and the first PPPCA eigenvector. In all cases, we subtracted the spatial median or the columnwise median and scaled them by the mean absolute deviation. The performance metric is the angle (in degrees) between the first robust (based on Cauchy or PPPCA) eigenvector and the first true “clean” eigenvectors and the time required by each method. This procedure was repeated times and the average results are graphically displayed in Figures 1(a)(d).Broadly speaking the effect of the PP PCA does not seem to have been affected substantially by the centering method, i.e. subtraction of the spatial or the columnwise median. On the contrary, the Cauchy PCA is affected by the type of median employed to this end. Centering with the spatial median yields high error levels for all norms of the outliers, for both datasets, whereas centering with the columnwise median produces much lower error levels. On average, the difference in the error between Cauchy PCA and PP PCA is about for the GSE31159 dataset (Figure 1(a)) and about for the GSE3161 dataset (Figure 1(b)). However, the error of the Cauchy PCA increases and the stabilizes in the GSE31159 dataset whereas the error of the PP PCA is stable regardless of the norm of the outliers. A different conclusion is extracted in the GSE31161 where the error of either method decreases as the norm of the outliers increases, until it reaches a plateau.
With regards to computational efficiency, the PP PCA is not affected by either centering method, whereas Cauchy PCA seems to be affected in the GSE31159 dataset but not in the GSE31161 dataset as seen in Figures 1(c) and 1(d). Cauchy PCA centered with the columnwise median is, on average, 5 times faster than PP PCA.
GSE31159  GSE31161 
(a)  (b) 
(c)  (d) 
5 Conclusion
The starting point for this paper is the observation that classical PCA can be formulated purely in terms of operations on a Gaussian likelihood. Although this observation is not new, the specifics of this formulation of classical PCA do not appear to be as widely known as might be expected. The novel idea underlying this paper is to formulate a version of PCA in which a Cauchy likelihood is used instead of a Gaussian likelihood, leading to what we call Cauchy PCA. Study of the resulting influence functions shows that Cauchy PCA has very good robustness properties. Moreover, we have provided an implementation of Cauchy PCA which runs quickly and reliably. Numerous simulation and realdata examples, mainly in highdimensional settings, show that Cauchy PCA typically outperforms alternative robust versions of PCA whose implementation is in the public domain.
Appendix
A1 Proof of Proposition 2.1
Proof.
The perturbed distribution has perturbed mean value
and perturbed covariance matrix
Denoting by the leading eigenvalue of and by the corresponding eigenvector, it holds that
(15) 
Next, we expand the perturbed eigenvector and eigenvalue around the unperturbed ones as follows:
and
with
Multiplying (16) from the left with , we get
A2 Proof of Proposition 3.1
Let us first make the symbolism more explicit and denote the Cauchy loglikelihood function with respect to the distribution and the respective leading Cauchy principal direction. Then, our goal is to calculate the limit of
as where is the leading Cauchy principal direction for the distribution . The optimality condition for the leading Cauchy principal direction reads
(17) 
and
Moreover, is a unit vector which can be represented as
where is a unit vector perpendicular to and is a (small) real number. Under these assumptions, is a unit vector since
Obviously, depends on and (i.e., ) and but we choose to avoid denoting their explicit relationship because it is not required in our proof. Moreover, a Taylor expansion for the representation leads to
thus we obtain that
Next, we compute the partial derivative using the chain rule
Therefore,
The second summand equals to zero because maximizes the Cauchy loglikelihood function thus it holds that . Similarly,
Next, we further Taylor expand using
Using again the chain rule, we obtain that
The computation of the partial derivative follows. Formula