Iteratively Reweighted Least Squares Algorithms for L1-Norm Principal Component Analysis

09/10/2016 ∙ by Young Woong Park, et al. ∙ Southern Methodist University Northwestern University 0

Principal component analysis (PCA) is often used to reduce the dimension of data by selecting a few orthonormal vectors that explain most of the variance structure of the data. L1 PCA uses the L1 norm to measure error, whereas the conventional PCA uses the L2 norm. For the L1 PCA problem minimizing the fitting error of the reconstructed data, we propose an exact reweighted and an approximate algorithm based on iteratively reweighted least squares. We provide convergence analyses, and compare their performance against benchmark algorithms in the literature. The computational experiment shows that the proposed algorithms consistently perform best.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Principal component analysis (PCA) is a technique to find orthonormal vectors, which are a linear combination of the attributes of the data, that explain the variance structure of the data [12]. Since a few orthonormal vectors usually explain most of the variance, PCA is often used to reduce dimension of the data by keeping only a few of the orthonormal vectors. These orthonormal vectors are called principal components (PCs).

For dimensionality reduction, we are given target dimension p, the number of PCs. To measure accuracy, given principal components, first, the original data is projected into the lower dimension using the PCs. Next, the projected data in the lower dimension is lifted to the original dimension using the PCs. Observe that this procedure causes loss of some information if is smaller than the dimension of the original attribute space. The reconstruction error is defined by the difference between the projected-and-lifted data and the original data. To select the best PCs, the following two objective functions are usually used:

minimization of the reconstruction error,
maximization of the variance of the projected data.

The traditional measure to capture the errors and variances in P1 and P2 is the norm. For each observation, we have the vector of the reconstruction error and variance for P1 and P2, respectively. Then, the squared norm of the vectors are added over all observations to define the total reconstruction error and variance for P1 and P2, respectively. In fact, in terms of a matrix norm, we optimize the Frobenius norm of the reconstruction error and projected data matrices for P1 and P2, respectively. With the norm as the objective function, P1 and P2 are actually equivalent. Further, P2

can be efficiently solved by singular value decomposition (SVD) of the data matrix or the eigenvalue decomposition (EVD) of the covariance matrix of the data. However, the

norm is usually sensitive to outliers. As an alternative, PCA based on

norm has been used to find more robust PCs.

For P1, instead of the norm, we minimize the sum of the

norm of the reconstruction error vectors over all observations. A few heuristics have been proposed for this minimization problem. The heuristic proposed in

[1] is based on a canonical correlation analysis. The iterative algorithm in [14] assumes that the projected-and-lifted data is a product of two matrices and is then iteratively optimizing by fixing one of the two matrices. The algorithm in [3] sequentially reduces the dimension by one. The algorithm is based on the observation that the projection from to the best fit

dimension can be found by solving several linear program (LP) problems for least absolute deviation regression. The algorithms in

[14] and [3] actually try to find the best fitting subspace, where in the objective function the original data is approximated by the multiplication of two matrices, PC and score matrices. This approximation is not the same as the reconstructed matrix by PCs, while the ultimate goal is still minimizing the reconstruction errors.

The norm for P2 has also been studied. This problem is often called the projection pursuit -PCA. In this context, we maximize the sum of the norm of the projected observation vectors over all observations. However, in contrast to the conventional norm based PCA, the solutions of P1 and P2 with the norm might not be same. The work in [9] studies

-norm based covariance matrix estimation, while the works in

[6, 7, 15, 17] and [20] directly consider P2. The algorithm in [15] finds a local optimal solution by sequentially obtaining one PC that is orthogonal to the previously obtained PCs based on a greedy strategy. Recently, the work in [20] extended the algorithm in [15] using a non-greedy strategy. The works in [18] and [19] show that P2 with one PC is NP-hard when the number of observations and attributes are jointly arbitrarily large. The work in [18] provides a polynomial algorithm when the number of attributes is fixed.

As the objective functions are different, solving for P1 and P2 with different norms give different solutions in terms of the signs and order of the PCs. In Figure 1, we present heat maps of PCs obtained by solving -PCA, P1 with the norm, and P2 with the norm with for data set cancer_2 presented in Table I and used in the experiment in Section III-C. The three heat maps represent the matrices of PCs of -PCA (left), P1 with the norm (center), and P2 with the norm (right). The rows and columns of each matrix represent original attributes and PCs, respectively. The blue, white, and red cells represent the intensity of positive, zero, and negative values. Note that both -PCA variations give Attr 1 a large negative loading in either of the first or second PCs, whereas -PCA gives Attr 1 a large positive loading in the third PC. Attr 9 has a large negative loading for the fifth PC of P1 with norm, while -PCA gives Attr 9 a large positive loading in the second PC.

Fig. 1: Heat maps of 5 principal components by -PCA (left), P1 with norm (center), and P2 with norm (right) for cancer_2 data set

In this paper, we propose two iterative algorithms for P1 with the norm, and provide analytical convergence results. Although we propose iterative algorithms, our focus is the objective function and thus our algorithms do not directly compare with iterative algorithms for the standard -PCA such as algorithms based on EM [21] or NIPALS [10]. For P1 with the norm, the first proposed algorithm, the exact reweighted version, is based on iteratively reweighted least squares (IRLS) that gives a weight to each observation. The weighted least square problem is then converted into the standard

-PCA problem with a weighted data matrix, and the algorithm iterates over different weights. We show that the algorithm gives convergent weights and the weighted data matrices have convergent eigenvalues. We also propose an approximate version in order to speed up the algorithm and show the convergent eigenvectors. The work in

[8] provides an IRLS algorithm for sparse recovery. The reader is referred to [13] for a review of IRLS applied in different contexts. Recently, the work in [24]

studied the minimization of a differentiable function of an orthogonal matrix. In the computational experiment, we compare our algorithms with benchmark algorithms in

[3, 14, 15, 20]. The experiment shows that our algorithm for P1 regularly outperforms the benchmark algorithms with respect to the solution quality and its computational time is of the same order as the fastest among the other four. Even though -PCA can be used for building robust PCs and is an alternative to other robust PCA approaches such as the work in [5], we limit the comparison for the -PCA objective functions introduced in Section II, as our goal is to directly optimize the norms for P1.

Our contributions can be summarized as follows.

  1. [noitemsep]

  2. For P1 with the norm, we propose the first IRLS algorithm in the literature, and show the convergence of the eigenvalues of the weighted matrices. The algorithm directly minimizes the reconstruction error, while the other benchmark algorithms primarily try to find the optimal subspace. An additional advantage of our algorithm is that it uses an -PCA algorithm as a black box. Hence, by using a more scalable algorithm, the practical time complexity of our algorithm can further be reduced.

  3. We propose an approximate version to speed up the algorithm and to guarantee convergent eigenvectors. The difference is that the approximate version uses a formula to update the eigenpairs when the changes in the weighted data matrices become relatively small.

  4. The results of the computational experiment show that the proposed algorithms for P1 outperform the benchmark algorithms in most cases.

The rest of the paper is organized as follows. In Section II, we present the algorithms and show all analytic results. Then in Section III, the computational experiment is presented.

Ii Algorithms for L1 PCA

In this section, we present the algorithms for P1 and show the underlying analytic results. We use the following notation.

  1. [noitemsep]

  2. : number of observations of the data

  3. : number of attributes of the data

  4. : number of principal components (target dimension)

  5. : index set of the observations

  6. : index set of the attributes

  7. : index set of the PCs

  8. : data matrix with elements for

  9. : principal components matrix with elements for

  10. : projected data matrix with elements for , defined as

  11. : reconstruction error matrix with elements for , defined as

  12. : by identity matrix

For a matrix with elements , , , we denote by

the Frobenius norm.

The conventional PCA problem, P1 with the norm, can be written as

(1)

Note that is different from in the objective function.

We consider (1) with the norm instead of the norm in the objective function. The resulting P1 problem is written as

(2)

Next we present an iterative algorithm for (2) to minimize the reconstruction error. Instead of solving (2) directly, we iteratively solve a weighted version of (1) by giving a different weight for each observation.

We rewrite (2) in the following non-matrix form.

(3a)
(3b)
(3c)
(3d)
(3e)
(3f)

Note that the corresponding -PCA problem (1) can be also written as

(4)

since the only difference between (1) and (2) is the objective function. However, there is no known algorithm that solves (3) optimally, whereas (4) can be solved by SVD or EVD. Hence, we want to take advantage of the fact that (4) can be optimally solved.

Let us consider a weighted version of (4):

(5)

with for every in . Note that (4) and (5) are equivalent when for all in . However, solving (5) with non-constant ’s is not easy in its original form due to the orthogonality constraint. Instead, let us define weighted data matrix with each element defined as

(6)

In the following proposition, we show that an optimal solution to (5) can be obtained by SVD of .

Proposition 1.

Solving (5) with is equivalent to solving (4) with .

Proof.

Let be a solution to (4) with . We claim that, for any , there exists with the same objective function value for (5) with , and vice versa. Let , , and . We derive

,
     ,
.

Further, since for all and , orthonormality constraints (3d) and (3e) are automatically satisfied. Hence, solving (4) with is equivalent to solving (5) with . ∎

Since now we know that (5) can be solved optimally, the remaining task is to define appropriate weights that give a good solution to (3). We first provide intuition behind our choices.

Let be an optimal solution to (3) and let

(7)

be weights defined based on , where is a large number. Note that the value of , for the case , does not affect the value of because for the corresponding observation. However, considering the fact that we want to give less weight to the outliers in order to reduce their effect on the objective function, it is reasonable to assign a big number to the observations with zero error.

With defined in (7), it is trivial to show

(8)

The equality in (8) implies that, given and , the objective function value of (3) and (5) are equal. The inequality in (8) implies that, given , the objective function value of (5) gives a lower bound on the optimal objective function value of (3). Hence, we aim to minimize the objective function of (3) by solving (5), hoping and are not far from each other.

The equality and inequality in (8) give motivation to use a weight formula similar to (7). Before presenting the weight update formula and the algorithm, let us define the following notation for the algorithm.

  1. [noitemsep]

  2. : current iteration

  3. : weight vector used in iteration with elements for

  4. : diagonal matrix in iteration with ’s on the diagonal

  5. : weighted data matrix defined in (6) with in iteration , defined as

  6. : the principal component matrix obtained by SVD of in iteration

  7. : reconstruction error matrix in iteration with elements for , defined as

  8. : subroutine that returns by solving (4) with

  9. : objective function value of for (3), defined as

  10. : current best objective function value

  11. : principal component matrix associated with

Note that is obtained by SVD of , but is based on and is different from .

Motivated by (7), for iteration , we define

(9)

where is the largest weight among the observations in . Using in (9) for the weights is natural and we empirically observe that the algorithm is convergent. However, it is not trivial to show the convergence with for (5). Hence, in order to show the convergence of the algorithm, we present a modified update formula based on .

(10)

where . Note that is the to the power of and is different from other superscript-containing notations such as or . The role of (10) is to enable bounds of the change for from . If is too small compared to , then is assigned a value between and . If is too large compare to , then obtains a value between and . Otherwise, follows the weight formula in (9). Given , we have , which implies . Further, since and are bounded above and below, we can show that is convergent. By setting close to 1, we would have in most cases, as and are close to 0 and 2 for small values of , i.e., early iterations. From all these facts and by using elementary mathematics, the following lemma follows.

Lemma 1.

With defined in (10), and are convergent in .

We present the overall algorithmic framework in Algorithm 1. The algorithm requires data matrix , target dimension , and tolerance parameter as input. After the initialization of weights and the best objective function value in Step 1, the while loop is executed until and are close enough. In each iteration of the while loop, is constructed based on and is obtained by SVD of (Steps 3 and 4). If gives a lower objective function value than , then and are updated. Recall that is obtained by using , but uses the original data matrix . Each iteration ends after the update of weights in Step 6. Observe that the termination criteria in Step 2 solely depends on the convergence of . Hence, the algorithm terminates in a finite number of iterations.

0:   (data), (target dimension), (tolerance),
0:  principal components
1:  , , ,
2:  while  do
3:     set based on
4:     
5:     if then ,
6:     update by using (10) given
7:     
8:  end while
Algorithm 1 wPCA (Weight-based algorithm for -PCA)
Lemma 2.

Eigenvalues of are convergent in .

Proof.

Recall that the weights and weighted matrix are convergent, which also implies is also convergent. Since the eigenvalues of symmetric matrices are point-wise convergent if the matrices are convergent [16], it is trivial to see that the eigenvalues of are convergent. ∎

Hence, Algorithm 1 gives convergent eigenvalues. Although eigenvalues are convergent, it is not trivial to show the convergence of . This is because even a slight change in a matrix can cause a change in an eigenvector, and eigenvalue-eigenvector pairs are not unique.

In order to provide convergent eigenpairs and accelerate the algorithm for large scale data, we use the first order eigenpair approximation formula from [23]. Let be an approximate eigenpair of and . Then, the approximate eigenpairs of can be obtained by

(11)
(12)

using the formula in [23]. The error is of the order of . Let L2PCA_Approx be a function that returns principal components by formula (11) - (12). The modified algorithm is presented in Algorithm 2.

0:   (data), (target dimension), (tolerance), ,
0:  principal components
1:  , , ,
2:  while  do
3:     set based on ,
4:     If then
5:     Else
6:     if then ,
7:     update by using (10) given
8:     
9:  end while
Algorithm 2 awPCA (Approximated weight-based algorithm for -PCA)

The difference is only in Lines 4 and 5. If the change in is large (greater than ), we use the original procedure L2PCA. If the change in is small (less than or equal to ), then we use the update formula (11) and (12). Algorithm 2 has the following convergence result.

Proposition 2.

Eigenpairs of in Algorithm 2 are convergent in .

Proof.

Note that the convergence of in Lemma 2 does not depend on how the eigenvectors are obtained and thus it holds whether we execute Lines 4 or 5 in each iteration. Hence, we have in Algorithm 2.

Since converges to zero, after a certain number of iterations , we have for all . From such large , the approximation rule applies. In [23], it is shown that

when are the exact eigenpairs. Since all of the terms in the formula are bounded and , we conclude that the eigenpairs of are convergent. ∎

Iii Computational Experiment

We compare the performance of the proposed and benchmark algorithms for varying instance sizes and number of PCs . All experiments were performed on a personal computer with 8 GB RAM and Intel Core i7 (2.40GHz dual core). We implement Algorithms 1 and 2 in R [22], which we denote as wPCA and awPCA, respectively. The R script for wPCA and awPCA is available on a web site 111http://dynresmanagement.com/uploads/3/3/2/9/3329212/wl1pca.zip. For the awPCA implementation, we use condition instead of , to avoid unnecessary calculation of in Line 4 of Algorithm 2. The weight-based condition is similar to the original condition as the difference in the weight captures . For the experiment, we use parameters and for wPCA and awPCA and for awPCA, where the parameters are tuned based on pilot runs to balance the solution quality and execution time. We also set up maximum number of iterations to 200. We compare our algorithms with the algorithms in [3, 14, 15, 20]. The work in [4] provides R implementations of the algorithms in [3, 14, 15], which we denote as Brooks, Ke, Kwak, respectively. We implement the algorithm in [20] in R, which we denote as Nie.

Although algorithms Ke and Brooks are for (2) and Kwak and Nie are for the norm version of P2, we evaluate the objective function value for all benchmark algorithms and compare them against our algorithms. Especially, Kwak and Nie, which solve different -PCA problem, are included because

  • we observed that Kwak and Nie are better than Ke and Brooks for (2) for some instances, and

  • we found that Kwak and Nie are more scalable and solve larger data sets in a reasonable time in the experiment.

Therefore, we include Kwak and Nie for the comparison for solving P1.

It is worth to note that Ke and Brooks try to find the best fitting subspace, where definition of in (2) is replaced by and . The optimal solutions of the two formulations may be different despite both minimizing the distance from the original data.

Let represent the objective function value obtained by Ke, Brooks, Kwak, Nie, wPCA, awPCA, with respect to (2). For the comparison purposes for awPCA, we use the gap from the best objective function value defined as

,

for each awPCA, Ke, Brooks, Kwak, Nie. Similarly, for wPCA, we define

,

for each wPCA, Ke, Brooks, Kwak, Nie. Note that represents the gap between algo and the best of all algorithms. Note also that we set up an upper bound of 1 for . Hence, if the gap is larger than 1 (or 100%), then is assigned value of 1 (or 100%).

For all of the instances used in the experiment, we first standardize each column and deal exclusively with the standardized data. Hence, the reconstruction errors are also calculated based on the standardized data.

In the computational experiment, we observed that and are very similar while the execution time of awPCA is much faster. Hence, in this section, we first focus on presenting the performance of awPCA against the benchmark algorithms and after on comparing the difference between wPCA and awPCA.

The rest of the section is organized as follows. In Section III-A

, we present synthetic instance generation procedures and explain the instances from the UCI Machine Learning Repository

[2]. In Sections III-B and III-C, we present the performance of awPCA for the synthetic and UCI instances, respectively. In Section III-D, we compare the performance of wPCA and awPCA for the UCI instances.

Iii-a Instances

Iii-A1 Synthetic Instances

In order to provide a systematic analysis, we generate synthetic instances with presence of outliers and various , where , , and . For each , we generate 5 distinguished instances. Hence, we have a total of 80 generated instances. The synthetic instances used in the experiment are available on a web site 222http://dynresmanagement.com/uploads/3/3/2/9/3329212/pca_instance_park_klabjan.zip. The detailed algorithm is presented at the end of this section. In the generation procedure, of observations are generated to have a higher variance than the remaining normal observations. The instance generation algorithm needs additional parameter (target rank), which we fix to 10 for the instances we generated in this experiment. Hence, the instances we use in the experiment are likely to have rank equal to 10. We consider different values, where . Given that , we select values around 10.

The purpose of the instance generation algorithm is to generate instances with some of the observations as outliers, so that -PCA solutions are more likely to be away from -PCA solutions. In order to generate instances, we use the procedure described in [14] with a slight modification. The instances used in the experiment in [14] have fixed parameters and constant valued outliers to simulate data loss. To check the performance of the algorithms over various parameters and different (non-constant) patterns of outliers, we generate our own instances. In the generation procedure of [14], a matrix with a small fixed rank is generated and then extremely large constant values randomly replace the original data matrix. In their instances, outliers have the same value, which can be interpreted as data loss, but they do not consider outliers due to incorrect measurements or cases with only a few observations with outliers. Our procedure addresses all these cases.

We present the procedure in Algorithm 3.

0:  , , (target rank), (% outliers)
0:  
1:

  Generate random matrix

with
2:  for each row
3:   Generate random number
4:   if
5:    for each column
6:     Generate
7:     if , then
8:     else
9:    end for
10:   else
11:  end for
12:  Obtain , the SVD of
13:  Construct with generated in Steps 2 - 10
14:  Adjust to have 0 mean for each column
Algorithm 3 PCA instance generation

In Step 1, we first generate random matrix , where each

is from the uniform distribution between -100 and 100. Next in Steps

2 - 10, we generate random perturbation matrix with approximately percent of rows having extremely large perturbations, where each row has approximately 10% extreme value entries. After SVD of in Step 12, data matrix is generated, where is the submatrix of with the first columns, is the submatrix of with the first columns and rows, and is the submatrix of with the first columns. The final data matrix is generated in Step 14 after adjusting it to have 0 column means.

Iii-A2 UCI instances

We also consider classification datasets from the UCI Machine Learning Repository [2]

and adjust them to create PCA instances. Based on the assumption that observations in the same class of a classification data set have similar attribute distributions, we consider each class of the classification datasets. For each dataset, we partition the observations based on labels (classes). When there exist many labels, we select the top two labels with respect to the number of observations among all labels. For each partitioned data, labels and attributes with zero standard deviation (hence, meaningless) are removed and the matrix is standardized to have zero mean and unit standard deviation for each attribute.

In Table I, we list the PCA instances we used and the corresponding original dataset from [2]

. In the first column, abbreviate names of the original data sets are presented. The full names of the data sets are Breast Cancer Wisconsin, Indian Liver Patient Dataset, Cardiotocography, Ionosphere, Connectionist Bench (Sonar), Landsat Satellite, Spambase, Magic Gamma Telescope, Page Blocks Classification, and Pen-Based Recognition of Handwritten Digits. Each PCA instance is classified as small or large based on

and . If , the instance is classified as small, otherwise, the instance is classified as large. In the last column in Table I, the small and large instances are indicated by and , respectively. For the large instances, only Kwak and Nie are compared with the proposed algorithms, due to scalability issues of the other benchmark algorithms.

Original dataset from UCI PCA instance
Name Num labels Name size
cancer (9,699) 2 cancer_2 (9,444) S
cancer_4 (9,239) S
ilpd (10,583) 2 ilpd_1 (10,416) S
ilpd_2 (10,167) S
cardio (21,2126) 10 cardio_1 (19,384) S
cardio_2 (19,579) S
iono (34,351) 2 iono_b (33,126) S
iono_g (32,225) S
sonar (60,208) 1 sonar_g (60,111) S
sonar_r (60,97) S
landsat (36,4435) 7 landsat_1 (36,1072) L
landsat_3 (36,961) L
spam (57,4601) 2 spam_0 (57,2788) L
spam_1 (57,1813) L
magic (10,19020) 2 magic_g (10,12332) L
magic_h (10,6688) L
blocks (10,5473) 5 blocks_1 (10,4913) L
hand (16,10992) 10 hand_0 (16,1142) L
hand_1 (16,1143) L
TABLE I: PCA instances created based on the datasets from the UCI Machine Learning Repository [2]

Iii-B Performance of awPCA for Synthetic Instances

In Table II, we present the result for awPCA for the synthetic instances. Although we created synthetic instances with varying (% of outliers) values, we observed that the performances of the algorithms are very similar over different values for each triplet. Hence, in Table II, we present the average value over all . That is, each row of the table is the average of 20 instances for the corresponding pair given . The first two columns are the instance size and number of PCs, the next five columns are for all algorithms, and the last five columns are the execution times in seconds. For each row, the lowest value among the five algorithms is boldfaced.

Instance Gap from the best () Time (seconds)
() awPCA Ke Brooks Kwak Nie awPCA Ke Brooks Kwak Nie
(20, 100) 8 1% 6% 2% 12% 18% 0.0 0.7 1.3 0.0 0.0
9 4% 22% 3% 16% 26% 0.0 0.5 1.3 0.0 0.0
10 0% 0% 1% 7% 7% 0.0 0.4 1.3 0.0 0.0
11 0% 69% 2% 7% 12% 0.0 0.4 1.3 0.0 0.0
12 0% 70% 2% 7% 16% 0.0 0.4 1.2 0.0 0.0
(20, 300) 8 2% 8% 3% 10% 12% 0.0 4.9 9.5 0.0 0.1
9 3% 10% 3% 13% 19% 0.0 2.8 9.6 0.0 0.1
10 0% 0% 1% 3% 3% 0.0 1.2 9.6 0.0 0.1
11 0% 9% 1% 3% 6% 0.0 1.2 9.6 0.0 0.1
12 0% 7% 1% 3% 8% 0.0 1.2 9.6 0.0 0.1
(50, 100) 8 1% 3% 2% 13% 16% 0.0 3.8 19.2 0.0 0.0
9 1% 7% 3% 15% 21% 0.0 2.9 19.1 0.0 0.0
10 0% 0% 3% 7% 7% 0.0 2.9 19.3 0.0 0.0
11 0% 100% 2% 7% 8% 0.0 4.3 19.2 0.0 0.0
12 0% 100% 3% 7% 10% 0.0 4.8 19.2 0.0 0.0
(50, 300) 8 1% 3% 2% 9% 13% 0.1 27.0 227.4 0.0 0.1
9 2% 5% 3% 10% 16% 0.1 22.0 226.5 0.0 0.2
10 0% 0% 1% 3% 3% 0.0 18.7 226.7 0.0 0.2
11 0% 86% 1% 3% 4% 0.0 23.1 228.1 0.0 0.2
12 0% 98% 1% 3% 5% 0.0 27.2 226.0 0.0 0.2
TABLE II: Performance of awPCA for synthetic instances

Note that values are near zero for all instances. Further, has the lowest gaps (boldfaced numbers) for all instances among all algorithms except for one instance class. Brooks constantly gives the second best gaps while Ke gives 0% gaps for , third best gaps for and worst gaps for . Nie and Kwak generally give the similar result as they are designed to solve the same problem.

The execution times of the algorithms can be grouped into two groups: awPCA, Kwak, and Nie are in the faster group and Ke and Brooks are in the slower group. Ke and Brooks spend much more time on larger instances compared to the other three algorithms. Although it is not easy to compare, Kwak is the fastest among all algorithms, yet is not as low as . It is important to note that the difference in the execution time between Kwak and awPCA is negligible and awPCA is fastest among the algorithms designed to solve P1 with the norm (i.e., Ke, Brooks, and awPCA).

Iii-C Performance of awPCA for UCI Instances

Fig. 2: Heat map of the gap (%) from the best for small UCI instances
Fig. 3: Heat map of the execution times (seconds) for small UCI instances
Fig. 4: Heat map of the gap (%) from the best for large UCI instances
Fig. 5: Heat map of the execution times (seconds) for large UCI instances

For each PCA instance in Table I, we execute the algorithms with various values. The number of PCs covers the entire spectrum in increments of 2,3,5, or 10 depending on : cancer and ilpd with , cardio with , iono with , sonar and spam with , landsat with , magic and block with , and hand with .

For the small UCI instances, we present heat maps of and the execution times of all algorithms in Figures 2 and 3. In both figures, the numbers are in percentage or execution time in seconds, a white cell implies near-zero or near-zero execution time, and a dark gray cell implies the opposite. In Figure 2, awPCA is consistently best with Brooks usually being the second best algorithm. The value of is zero except for a few cases. For such cases with , Brooks performs the best. The values of , , and tend to increase in . In Figure 3, we observe the same trend from Section III-B: awPCA, Kwak, and Nie are in the faster group and Ke and Brooks become slower as instance size increases.

For the large UCI instances, we only compare awPCA against Kwak and Nie, due to scalability issues of Ke and Brooks. Hence, here is the gap from the best of awPCA, Kwak and Nie. In Figures 4 and 5, we present heat maps of and the execution times of the three algorithms. Similar to Figures 2 and 3, a white cell implies a low value. In Figure 4, awPCA is consistently best except for four cases and even for the four cases are very small. We observe that and tend to increase in , where is slightly smaller than in general. In Figure 5, the execution time of Kwak is the fastest, and awPCA and Nie are of the same magnitude, although awPCA is slightly faster than Nie.

Based on the results for the UCI instances, we conclude that awPCA performs the best while the execution time of awPCA is of the same order or lower than the remaining algorithm.

Iii-D Comparison of wPCA and awPCA for UCI Instances

In this section, we compare wPCA and awPCA for the UCI instances in terms of solution quality ( and ) and execution time. In Table III, we present the average performance of wPCA and awPCA for all values considered in Section III-C. The fourth column is defined as diff = , where a negative diff value implies that wPCA gives a better solution and near-zero diff value implies that and are similar. The seventh column is defined as execution time of awPCA / execution time of wPCA, where a less-than-one ratio value implies that awPCA is faster than wPCA. In Table III, we observe that and are very similar except for two instances (boldfaced values), while awPCA spends only 20% of the time of wPCA on average. Note also that awPCA is not always inferior to wPCA. Although it is rare, awPCA gives a better solution than wPCA for instances and . In general, we found that is very similar or slightly larger than , while awPCA is much faster. On the other hand, we can also ignore the time difference if execution times are within a few seconds. The UCI instances , and with clear time difference between wPCA and awPCA have . Therefore, we recommend to use wPCA when the data size small and awPCA when the data size is very large.

Gap from the best () Time (seconds)
Instance wPCA awPCA diff wPCA awPCA ratio
cancer_2 15.7% 19.0% -3.3% 1.7 0.1 0.1
cancer_4 0.8% 0.8% 0.0% 0.4 0.0 0.1
ilpd_1 1.0% 9.8% -8.7% 0.3 0.1 0.3
ilpd_2 0.9% 1.2% -0.3% 0.5 0.0 0.0
cardio_1 0.7% 0.6% 0.1% 0.3 0.1 0.3
cardio_2 1.0% 1.2% -0.2% 1.1 0.1 0.1
iono_r 0.4% 0.4% 0.0% 0.4 0.0 0.1
iono_g 2.8% 2.8% 0.0% 0.2 0.0 0.2
sonar_g 0.1% 0.1% 0.0% 1.0 0.1 0.1
sonar_r 0.0% 0.0% 0.0% 0.9 0.1 0.1
landsat_1 0.0% 0.0% 0.0% 3.2 0.6 0.2
landsat_3 0.6% 0.6% 0.0% 5.7 0.4 0.1
spam_0 0.0% 0.0% 0.0% 16.7 5.2 0.3
spam_1 0.0% 0.0% 0.0% 18.9 3.2 0.2
magic_g 0.1% 0.2% -0.1% 47.7 15.6 0.3
magic_h 0.6% 0.5% 0.1% 13.8 7.1 0.5
blocks_1 0.0% 0.0% 0.0% 1.8 0.3 0.2
hand_0 0.0% 0.0% 0.0% 1.8 0.3 0.2
hand_1 0.0% 0.0% 0.0% 2.2 0.3 0.1
average -0.7% 0.2
TABLE III: Comparison of wPCA and awPCA for UCI instances

Iv Conclusions

In this paper, we consider the -PCA problem minimizing the L1 reconstruction errors and present iterative algorithms, wPCA and awPCA, where awPCA is an approximation version of wPCA developed to avoid computationally expensive operations of SVD. The core of the algorithms relies on an iteratively reweighted least squares scheme and the expressions in (8). Although the optimality of -PCA was not able to be shown and remains unknown, we show that the eigenvalues of wPCA and awPCA converge and that eigenvectors of awPCA converge. In the computational experiment, we observe that awPCA outperforms all of the benchmark algorithms while the execution times are competitive. Out of the four algorithms designed to minimize the L1 reconstruction errors (Ke, Brooks, wPCA, awPCA), we observe that awPCA is the fastest algorithm with near-best solution qualities.

References

  • [1] A. Baccini, P. Besse, and A. de Faguerolles. A L1-norm PCA and a heuristic approach. In Proceedings of the International Conference on Ordinal and Symbolic Data Analysis, pages 359–368, March 1996.
  • [2] K. Bache and M. Lichman. UCI machine learning repository, 2013.
  • [3] J. Brooks, J. Dula, and E. Boone. A pure L1-norm principal component analysis. Computational Statistics and Data Analysis, 61:83–98, May 2013.
  • [4] J. Brooks and S. Jot. pcaL1: An implementation in R of three methods for L1-norm principal component analysis. Optimization Online, 2012.
  • [5] E. J. Candés, X. Li, Y. Ma, and J. Wright. Robust Principal Component Analysis? Journal of the ACM, 58(3), 2011.
  • [6] V. Choulakian. L1-norm projection pursuit principal component analysis. Computational Statistics and Data Analysis, 50(6):1441–1451, March 2006.
  • [7] C. Croux and A. Ruiz-Gazen. High breakdown estimators for principal components: the projection-pursuit approach revisited.

    Journal of Multivariate Analysis

    , 95(1):206 – 226, 2005.
  • [8] I. Daubechies, R. DeVore, M. Fornasier, and C. S. Güntürk. Iteratively reweighted least squares minimization for sparse recovery. Communications on Pure and Applied Mathematics, 63(1):1–38, 2010.
  • [9] J. Galpin and D. Hawkins. Methods of L1 estimation of a covariance matrix. Computational Statistics and Data Analysis, 5(4):305–319, September 1987.
  • [10] P. Geladi and B. R. Kowalski. Partial least-squares regression: a tutorial. Analytica Chimica Acta, 185:1–17, 1986.
  • [11] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, Cambridge, 2013.
  • [12] I. Jolliffe. Principal Component Analysis. Springer, 2002.
  • [13] M. Jorgensen. Iteratively Reweighted Least Squares. John Wiley & Sons, Ltd, 2006.
  • [14] Q. Ke and T. Kanade. Robust L1 norm factorization in the presence of outliers and missing data by alternative convex programming. In

    Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    , pages 739–746, June 2005.
  • [15] N. Kwak. Principal component analysis based on L1-norm maximization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(9):1672–1680, September 2008.
  • [16] P. D. Lax. Linear algebra and its applications. John Wiley & Sons, 2007.
  • [17] G. Li and Z. Chen. Projection-pursuit approach to robust dispersion matrices and principal components: primary theory and monte carlo. Journal of the American Statistical Association, 80(391):759–766, 1985.
  • [18] P. P. Markopoulos, G. N. Karystinos, and D. A. Pados. Optimal algorithms for L1-subspace signal processing. IEEE Transactions on Signal Processing, 62(19):5046–5058, October 2014.
  • [19] M. McCoy and J. A. Tropp. Two proposals for robust PCA using semidefinite programming. Electronic Journal of Statistics, 5:1123–1160, 2011.
  • [20] F. Nie, H. Huang, C. Ding, D. Luo, and H. Wang. Robust principal component analysis with non-greedy L1-norm maximization. In

    Proceeding of 22nd International Conference on Artificial Intelligence

    , pages 1433–1438, June 2011.
  • [21] S. Roweis. EM Algorithms for PCA and SPCA. In Advances in Neural Information Processing Systems, pages 626–632, 1998.
  • [22] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2014.
  • [23] Y. Shmueli, G. Wolf, and A. Averbuch. Updating kernel methods in spectral decomposition by affinity perturbations. Linear Algebra and its Applications, 437:1356–1365, 2012.
  • [24] Z. Wen and W. Yin. A feasible method for optimization with orthogonality constraints. Mathematical Programming, 142(1-2):397–434, 2013.