Principal Component Analysis (PCA) is one of the most popular dimensionality reduction techniques . Given a large set of possibly correlated features, it attempts to find a small set of features (principal components
) that retain as much information as possible. To generate such new dimensions, it linearly transforms original features by multiplying
in a way that newly generated features are orthogonal and have the largest variances.
In traditional PCA, variances are measured using the -norm. This has a nice property that although the problem itself is non-convex, the optimal solution can be easily found through matrix factorization. With this property, together with its easy interpretability, PCA has been extensively used in a variety of applications. However, despite of its success, it still has some limitations. First, since it generates new dimensions through a linear combination of features, it is not able to capture non-linear relationships between features. Second, as it uses the -norm for measuring variance, its solutions tend to be substantially affected by influential outliers. To overcome these limitations, the following two approaches have been proposed.
Kernel PCA The idea of kernel PCA is to map original features into a high-dimensional feature space, and perform PCA in that high-dimensional feature space . With non-linear mappings, we can capture non-linear relationships among features, and this computation can be done efficiently using the kernel trick. With the kernel trick, computations of principal components can be done without an explicitly mapping.
L1-norm PCA To alleviate the effects of influential observations, L1-norm PCA uses the L1-norm instead of the L2-norm to measure variances. The L1-norm is more advantageous than the L2-norm when there are outliers having large feature values since it is less influenced by them. By utilizing this property, more robust results can be obtained through the -norm based formulation in the presence of influential outliers.
In this paper, we combine these two approaches for the variance maximization version of -norm PCA (which is not the same as minimizing the reconstruction error with respect to the -norm). In other words, we tackle a kernel version of -norm PCA. Unlike -norm kernel PCA, the kernel version of -norm PCA is a hard problem in that it is not only non-convex but also non-smooth. However, through a reformulation, we make it a geometrically interpretable problem where the goal is to minimize the -norm of a vector subject to a linear constraint involving the -norm terms. For this reformulated problem, we present a ”fixed point” type algorithm that iteratively computes a -1,1 weight for each data point based on the kernel matrix and previous weights. We show that the kernel trick is applicable to this algorithm. Moreover, we prove the efficiency of our algorithm through a convergence analysis. We show that our algorithm converges in a finite number of steps and the objective values decrease with a linear rate. Lastly, we computationally investigate the robustness of our algorithm and illustrate its use for outlier detection.
Our work has the following contributions.
We are the first to present a model and an algorithm for -norm kernel PCA. While -norm kernel PCA has been widely used, a kernel version of -norm PCA has never been studied before. In this work, we show that the kernel trick which made -norm kernel PCA successful is also applicable for -norm kernel PCA.
We provide a rate of convergence analysis for our -norm kernel PCA algorithm. Although many algorithms have been proposed for -norm PCA, none of them provided a rate of convergence analysis. The work shows that our algorithm achieves a linear rate of convergence by exploiting the structure of the problem.
We introduce a methodology based on -norm kernel PCA for outlier detection.
In what follows, we always refer to the variance maximization version of
-norm PCA and we assume that every variable in the input data is standardized with a mean of 0 and standard deviation of 1. This paper is organized as follows. Section2 reviews recent works on -norm PCA, and points out how our work is distinguishable from them. Section 3 covers various formulations of the kernel version of -norm PCA. Through reformulations, we offer an understanding of the problem and based on these understandings we present our algorithm in Section 4. Section 5 gives a convergence analysis for our algorithm. Lastly, we show the robustness of our algorithm and its application to outlier detection in Section 6.
2 Related Work
-norm PCA has been shown to be NP-hard in  and . Nevertheless, an algorithm finding a global optimal solution is proposed in . Utilizing the auxiliary-unit-vector technique , it computes a global optimal solution with complexity where is the number of observations, is the rank of the data matrix, and is the desired number of principal components. Assuming and are fixed, the runtime of this algorithm is polynomial in . However, if are large its computation time can be prohibitive. Rather than finding a global optimal solution which is intractable for general problems, our work focuses on developing an efficient algorithm finding a local optimal solution.
Recognizing the hardness of -norm PCA, an approximation algorithm is presented in  based on the known Nesterov’s theorem . In this work, -norm PCA is relaxed to a semi-definite programming (SDP) problem and alternatively the SDP relaxation is considered. After solving the relaxed problem, it generates a random vector and uses a randomized rounding to produce a feasible solution. This randomized algorithm is a
-approximate algorithm in expectation. To achieve this approximation guarantee with high probability, it performs multiple times of randomized roundings and takes the one having the best objective value. Instead of providing an approximation guarantee by solving a relaxed problem, our work directly considers the-norm kernel PCA problem, and develops an efficient algorithm finding a local optimal solution.
Another approach using a known mathematical programming model is introduced in 
. Specifically, it proposes an iterative algorithm that solves a mixed integer programming (MIP) problem in each iteration. Given an orthonormal matrix of loading vectors, it perturbs the matrix slightly in a way that the resulting matrix yields the largest objective value. After perturbation, it uses singular value decomposition to recover orthogonality. The algorithm is completely different from the one proposed herein, the objective values of the iterates do not necessarily improve over iterations. As opposed to it, our work shows monotone convergence of the objective values as well as a linear rate of convergence to a local optimal solution.
On the other hand, a simple numerical algorithm finding a local optimal solution is proposed in , and its extended version that finds multiple loading vectors at the same time is presented in . In the former work, the optimal solution is assumed to have a certain form, and parameters involved in that form are updated at each iteration improving the objective values. For a linear kernel, our algorithm has the same form as this algorithm. However, while the algorithm in  is derived without any justification, we provide an understanding behind the algorithm. Moreover, we prove a rate of convergence analysis, and introduce a kernel version of it.
3 Kernel-based -norm PCA Formulations
We consider -norm PCA in a high-dimensional feature space . Suppose we map data vectors , into a feature space by a possibly non-linear mapping . Assuming for every , the kernel version of -norm PCA can be formulated as follows.
As shown in (1), we only consider extracting the first loading vector. This assumption is justifiable since subsequent loading vectors can be found by iteratively running the same algorithm. Specifically, each time a new loading vector x is obtained, we update the kernel matrix defined by by orthogonally projecting onto the space orthogonal to the most recently obtained loading vector, and then run the same algorithm on the updated kernel matrix .
The problem (1) has a convex non-smooth objective function to maximize. Moreover, the feasibility set is non-convex. To better understand this problem and derive an efficient algorithm, we reformulate (1) in the following way.
To prove the equivalence of two formulations, we show that an optimal solution of one formulation can be derived from an optimal solution of the other formulation.
In the same way, we have
due to . From ,
must hold, which contradicts the assumption that is an optimal solution of (1).
To understand formulation (2), we first examine the constraint set . Geometrically, this constraint set is symmetric with respect to the origin and represents a boundary of a polytope . It is easy to check that is a polytope since it can be represented by a finite set of linear constraints as
Therefore, formulation (2) is to find the closest point to the origin from the boundary of the polytope . The following proposition shows an optimal solution must be perpendicular to one of the faces.
An optimal solution is perpendicular to the face which it lies on.
Suppose that an optimal solution of (2) is . Assuming , consider the face
If is not perpendicular to face , then
is the closest point to the origin from
having . Now, let us define its scalar multiple
By construction, z is a feasible solution to (2) and has the objective value of
As a result,
follows. This contradicts the assumption that is an optimal solution to (2). Therefore, the optimal solution must be perpendicular to face . ∎
An optimal solution of (2) must have the form of where and .
The form that is a scalar multiple of is assumed in  for the linear kernel case without any justification but by Corollary 1, it follows that it is essentially the right form for the optimal solution. Moreover, from
we can further show that the optimal solution of formulation (2) can be derived from the optimal solution of the following binary problem.
Since is an optimal solution of (4), flipping the sign of any must not improve the objective value, .
To deduce a contradiction, let us assume that there exists some such that for .
The following result has been shown in  for the linear kernel case but here we generalize it.
In this section, we develop an efficient algorithm that finds a local optimal solution to problem (2) based on the findings in Section 3. Before giving the details of the algorithm, we first provide an idea behind the algorithm.
, we first identify hyperplanewhich current iterate lies on. After identifying the equation of , we find the closest point to the origin from , which we denote by . After that, we obtain by projecting to the constraint set by multiplying it by an appropriate scalar. We repeat this process until iterate converges.
Next, we develop an algorithm based on the above idea. Let . From Corollary 1, we know that the optimal solution has the form of
Utilizing the fact that the optimal solution is characterized by the sign vector , we characterize the initial iterate with the sign vector as
With , the equation of the hyperplane is represented by
The closest point to the origin among the points in the hyperplane has the form of . By plugging into , we have
We multiply by to make it feasible and thereby get
we get the followings.
This implies that we only need to update at each iteration.
we only require to compute
at each iteration.
we get the following termination criteria:
resulting in Algorithm 1.
After getting the final from Algorithm 1, we can compute principal scores without explicit mapping . For example, the principal component of observation can be computed by
We can also proceed to find more principal components without explicit mapping . As computing a loading vector and principal components only require the kernel matrix , we only need to update the kernel matrix each time a new loading vector is found. We can update the kernel matrix without explicit mapping by
5 Convergence Analysis
In this section, we provide a convergence analysis of Algorithm 1. We first prove finite convergence, and then provide a rate of convergence analysis.
Before proving the convergence of the algorithm, we first show that the sequence is non-increasing.
We have .
Inequality follows from
Here, the second equality is from (6) and the last inequality holds by Cauchy-Schwarz inequality where the equality holds if is a scalar multiple of .
Next, we have
Finally, follows from
If , we have , and .
The sequence converges in a finite number of steps.
Suppose the sequence does not converge. As vector is solely determined by , the number of possible is finite. Therefore, if the sequence does not converge, then some vectors appear more than once in the sequence .
Without loss of generality, let . By Lemma 1, we have
forcing us to have . Now, by Lemma 2, must hold, which contradicts the assumption that the sequence does not converge. In other words, the algorithm stops at iteration . Therefore, the sequence generated by Algorithm 1 converges in a finite number of steps. ∎
Next, we prove that the sequence generated by Algorithm 1 converges with a linear rate.
Let Algorithm 1 start from and terminate with at iteration . Then we have where for all .