L1-norm Kernel PCA

09/28/2017 ∙ by Cheolmin Kim, et al. ∙ Northwestern University 0

We present the first model and algorithm for L1-norm kernel PCA. While L2-norm kernel PCA has been widely studied, there has been no work on L1-norm kernel PCA. For this non-convex and non-smooth problem, we offer geometric understandings through reformulations and present an efficient algorithm where the kernel trick is applicable. To attest the efficiency of the algorithm, we provide a convergence analysis including linear rate of convergence. Moreover, we prove that the output of our algorithm is a local optimal solution to the L1-norm kernel PCA problem. We also numerically show its robustness when extracting principal components in the presence of influential outliers, as well as its runtime comparability to L2-norm kernel PCA. Lastly, we introduce its application to outlier detection and show that the L1-norm kernel PCA based model outperforms especially for high dimensional data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Principal Component Analysis (PCA) is one of the most popular dimensionality reduction techniques [1]. Given a large set of possibly correlated features, it attempts to find a small set of features (principal components

) that retain as much information as possible. To generate such new dimensions, it linearly transforms original features by multiplying

loading vectors

in a way that newly generated features are orthogonal and have the largest variances.

In traditional PCA, variances are measured using the -norm. This has a nice property that although the problem itself is non-convex, the optimal solution can be easily found through matrix factorization. With this property, together with its easy interpretability, PCA has been extensively used in a variety of applications. However, despite of its success, it still has some limitations. First, since it generates new dimensions through a linear combination of features, it is not able to capture non-linear relationships between features. Second, as it uses the -norm for measuring variance, its solutions tend to be substantially affected by influential outliers. To overcome these limitations, the following two approaches have been proposed.

Kernel PCA The idea of kernel PCA is to map original features into a high-dimensional feature space, and perform PCA in that high-dimensional feature space [2]. With non-linear mappings, we can capture non-linear relationships among features, and this computation can be done efficiently using the kernel trick. With the kernel trick, computations of principal components can be done without an explicitly mapping.

L1-norm PCA To alleviate the effects of influential observations, L1-norm PCA uses the L1-norm instead of the L2-norm to measure variances. The L1-norm is more advantageous than the L2-norm when there are outliers having large feature values since it is less influenced by them. By utilizing this property, more robust results can be obtained through the -norm based formulation in the presence of influential outliers.

In this paper, we combine these two approaches for the variance maximization version of -norm PCA (which is not the same as minimizing the reconstruction error with respect to the -norm). In other words, we tackle a kernel version of -norm PCA. Unlike -norm kernel PCA, the kernel version of -norm PCA is a hard problem in that it is not only non-convex but also non-smooth. However, through a reformulation, we make it a geometrically interpretable problem where the goal is to minimize the -norm of a vector subject to a linear constraint involving the -norm terms. For this reformulated problem, we present a ”fixed point” type algorithm that iteratively computes a -1,1 weight for each data point based on the kernel matrix and previous weights. We show that the kernel trick is applicable to this algorithm. Moreover, we prove the efficiency of our algorithm through a convergence analysis. We show that our algorithm converges in a finite number of steps and the objective values decrease with a linear rate. Lastly, we computationally investigate the robustness of our algorithm and illustrate its use for outlier detection.

Our work has the following contributions.

  1. We are the first to present a model and an algorithm for -norm kernel PCA. While -norm kernel PCA has been widely used, a kernel version of -norm PCA has never been studied before. In this work, we show that the kernel trick which made -norm kernel PCA successful is also applicable for -norm kernel PCA.

  2. We provide a rate of convergence analysis for our -norm kernel PCA algorithm. Although many algorithms have been proposed for -norm PCA, none of them provided a rate of convergence analysis. The work shows that our algorithm achieves a linear rate of convergence by exploiting the structure of the problem.

  3. We introduce a methodology based on -norm kernel PCA for outlier detection.

In what follows, we always refer to the variance maximization version of

-norm PCA and we assume that every variable in the input data is standardized with a mean of 0 and standard deviation of 1. This paper is organized as follows. Section

2 reviews recent works on -norm PCA, and points out how our work is distinguishable from them. Section 3 covers various formulations of the kernel version of -norm PCA. Through reformulations, we offer an understanding of the problem and based on these understandings we present our algorithm in Section 4. Section 5 gives a convergence analysis for our algorithm. Lastly, we show the robustness of our algorithm and its application to outlier detection in Section 6.

2 Related Work

-norm PCA has been shown to be NP-hard in [3] and [4]. Nevertheless, an algorithm finding a global optimal solution is proposed in [3]. Utilizing the auxiliary-unit-vector technique [5], it computes a global optimal solution with complexity where is the number of observations, is the rank of the data matrix, and is the desired number of principal components. Assuming and are fixed, the runtime of this algorithm is polynomial in . However, if are large its computation time can be prohibitive. Rather than finding a global optimal solution which is intractable for general problems, our work focuses on developing an efficient algorithm finding a local optimal solution.

Recognizing the hardness of -norm PCA, an approximation algorithm is presented in [4] based on the known Nesterov’s theorem [6]. In this work, -norm PCA is relaxed to a semi-definite programming (SDP) problem and alternatively the SDP relaxation is considered. After solving the relaxed problem, it generates a random vector and uses a randomized rounding to produce a feasible solution. This randomized algorithm is a

-approximate algorithm in expectation. To achieve this approximation guarantee with high probability, it performs multiple times of randomized roundings and takes the one having the best objective value. Instead of providing an approximation guarantee by solving a relaxed problem, our work directly considers the

-norm kernel PCA problem, and develops an efficient algorithm finding a local optimal solution.

Another approach using a known mathematical programming model is introduced in [7]

. Specifically, it proposes an iterative algorithm that solves a mixed integer programming (MIP) problem in each iteration. Given an orthonormal matrix of loading vectors, it perturbs the matrix slightly in a way that the resulting matrix yields the largest objective value. After perturbation, it uses singular value decomposition to recover orthogonality. The algorithm is completely different from the one proposed herein, the objective values of the iterates do not necessarily improve over iterations. As opposed to it, our work shows monotone convergence of the objective values as well as a linear rate of convergence to a local optimal solution.

On the other hand, a simple numerical algorithm finding a local optimal solution is proposed in [8], and its extended version that finds multiple loading vectors at the same time is presented in [9]. In the former work, the optimal solution is assumed to have a certain form, and parameters involved in that form are updated at each iteration improving the objective values. For a linear kernel, our algorithm has the same form as this algorithm. However, while the algorithm in [8] is derived without any justification, we provide an understanding behind the algorithm. Moreover, we prove a rate of convergence analysis, and introduce a kernel version of it.

3 Kernel-based -norm PCA Formulations

We consider -norm PCA in a high-dimensional feature space . Suppose we map data vectors , into a feature space by a possibly non-linear mapping . Assuming for every , the kernel version of -norm PCA can be formulated as follows.

(1)
subject to

As shown in (1), we only consider extracting the first loading vector. This assumption is justifiable since subsequent loading vectors can be found by iteratively running the same algorithm. Specifically, each time a new loading vector x is obtained, we update the kernel matrix defined by by orthogonally projecting onto the space orthogonal to the most recently obtained loading vector, and then run the same algorithm on the updated kernel matrix .

The problem (1) has a convex non-smooth objective function to maximize. Moreover, the feasibility set is non-convex. To better understand this problem and derive an efficient algorithm, we reformulate (1) in the following way.

(2)
subject to

To prove the equivalence of two formulations, we show that an optimal solution of one formulation can be derived from an optimal solution of the other formulation.

Proposition 1.

The following holds.

  1. [nolistsep]

  2. If is optimal to (1), then is an optimal solution to (2).

  3. If is optimal to (2), then is an optimal solution to (1).

Proof.

It is easy to check that is feasible to (2). If is not optimal to (2), there exists z such that . From its feasibility, it is obvious that holds. Then, for , we have

In the same way, we have

due to . From ,

must hold, which contradicts the assumption that is an optimal solution of (1).

Again, it is easy to check is feasible to (1). If is not optimal to (1), then there exists w such that

Since , for , we have

On the other hand,

follows from resulting in

This contradicts the assumption that is optimal to (2). ∎

To understand formulation (2), we first examine the constraint set . Geometrically, this constraint set is symmetric with respect to the origin and represents a boundary of a polytope . It is easy to check that is a polytope since it can be represented by a finite set of linear constraints as

Therefore, formulation (2) is to find the closest point to the origin from the boundary of the polytope . The following proposition shows an optimal solution must be perpendicular to one of the faces.

Proposition 2.

An optimal solution is perpendicular to the face which it lies on.

Proof.

Suppose that an optimal solution of (2) is . Assuming , consider the face

If is not perpendicular to face , then

is the closest point to the origin from

having . Now, let us define its scalar multiple

By construction, z is a feasible solution to (2) and has the objective value of

From

we have

As a result,

follows. This contradicts the assumption that is an optimal solution to (2). Therefore, the optimal solution must be perpendicular to face . ∎

From Proposition 2, we can easily derive the following Corollary 1.

Corollary 1.

An optimal solution of (2) must have the form of where and .

The form that is a scalar multiple of is assumed in [8] for the linear kernel case without any justification but by Corollary 1, it follows that it is essentially the right form for the optimal solution. Moreover, from

(3)

we can further show that the optimal solution of formulation (2) can be derived from the optimal solution of the following binary problem.

(4)
Proposition 3.

Let an optimal solution of binary formulation (4) be . Then, satisfies , for . Therefore, it follows that is an optimal solution for formulation (2).

Proof.

Since is an optimal solution of (4), flipping the sign of any must not improve the objective value, .

To deduce a contradiction, let us assume that there exists some such that for .

Then, for any , flipping the sign of gives

which contradicts the assumption that is an optimal solution to (4). Therefore, must satisfy

Since maximizes ,

is a minimizer for (2) due to Corollary 1 and (3). ∎

The following result has been shown in [3] for the linear kernel case but here we generalize it.

Corollary 2.

Formulation (2) is equivalent to formulation (4).

Proof.

Due to Corollary 1 and (3), formulation (2) comes down to

subject to

Since an optimal solution to (4) satisfies the constraints by Proposition 3, the two formulations are essentially the same. ∎

Interestingly, the binary formulation (4) is NP-hard. Since expanding gives

assuming be the weight of the complete graph having nodes , we can show the equivalence of the quadratic binary program (4) and the max-cut problem.

4 Algorithm

In this section, we develop an efficient algorithm that finds a local optimal solution to problem (2) based on the findings in Section 3. Before giving the details of the algorithm, we first provide an idea behind the algorithm.

The main idea of the algorithm is to move along the boundary of so that the -norm of the iterate decreases. Figure 1 graphically shows a step of Algorithm 1. Starting with iterate

, we first identify hyperplane

which current iterate lies on. After identifying the equation of , we find the closest point to the origin from , which we denote by . After that, we obtain by projecting to the constraint set by multiplying it by an appropriate scalar. We repeat this process until iterate converges.


Fig. 1: Geometric interpretation of the algorithm

Next, we develop an algorithm based on the above idea. Let . From Corollary 1, we know that the optimal solution has the form of

Utilizing the fact that the optimal solution is characterized by the sign vector , we characterize the initial iterate with the sign vector as

With , the equation of the hyperplane is represented by

The closest point to the origin among the points in the hyperplane has the form of . By plugging into , we have

We multiply by to make it feasible and thereby get

(5)

Utilizing

(6)

we get the followings.

(7)
(8)
(9)

By plugging (7) into (9), can be represented as

This implies that we only need to update at each iteration.
From

we only require to compute

at each iteration.
From

we get the following termination criteria:

resulting in Algorithm 1.

  Input: data vectors , kernel matrix , starting sign vector
  while  do
     Compute ,
     
  end while
Algorithm 1 L1-norm Kernel PCA

After getting the final from Algorithm 1, we can compute principal scores without explicit mapping . For example, the principal component of observation can be computed by

We can also proceed to find more principal components without explicit mapping . As computing a loading vector and principal components only require the kernel matrix , we only need to update the kernel matrix each time a new loading vector is found. We can update the kernel matrix without explicit mapping by

5 Convergence Analysis

In this section, we provide a convergence analysis of Algorithm 1. We first prove finite convergence, and then provide a rate of convergence analysis.

Before proving the convergence of the algorithm, we first show that the sequence is non-increasing.

Lemma 1.

We have .

Proof.

Inequality follows from

(10)

Here, the second equality is from (6) and the last inequality holds by Cauchy-Schwarz inequality where the equality holds if is a scalar multiple of .
Next, we have

(11)

Finally, follows from

Lemma 2.

If , we have , and .

Proof.

From Lemma 1, we have . Then, from (10), is a scalar multiple of . Assuming , follows from (6) resulting in

We can show in the same manner. As a result, holds by (8). From , we have

Theorem 1.

The sequence converges in a finite number of steps.

Proof.

Suppose the sequence does not converge. As vector is solely determined by , the number of possible is finite. Therefore, if the sequence does not converge, then some vectors appear more than once in the sequence .

Without loss of generality, let . By Lemma 1, we have

forcing us to have . Now, by Lemma 2, must hold, which contradicts the assumption that the sequence does not converge. In other words, the algorithm stops at iteration . Therefore, the sequence generated by Algorithm 1 converges in a finite number of steps. ∎

Next, we prove that the sequence generated by Algorithm 1 converges with a linear rate.

Theorem 2.

Let Algorithm 1 start from and terminate with at iteration . Then we have where for all .

Proof.

From (5), we have

Since by Lemma 1, we have

(12)

Now, we show