Maximum Margin Principal Components

05/17/2017
by   Xianghui Luo, et al.
0

Principal Component Analysis (PCA) is a very successful dimensionality reduction technique, widely used in predictive modeling. A key factor in its widespread use in this domain is the fact that the projection of a dataset onto its first K principal components minimizes the sum of squared errors between the original data and the projected data over all possible rank K projections. Thus, PCA provides optimal low-rank representations of data for least-squares linear regression under standard modeling assumptions. On the other hand, when the loss function for a prediction problem is not the least-squares error, PCA is typically a heuristic choice of dimensionality reduction -- in particular for classification problems under the zero-one loss. In this paper we target classification problems by proposing a straightforward alternative to PCA that aims to minimize the difference in margin distribution between the original and the projected data. Extensive experiments show that our simple approach typically outperforms PCA on any particular dataset, in terms of classification error, though this difference is not always statistically significant, and despite being a filter method is frequently competitive with Partial Least Squares (PLS) and Lasso on a wide range of datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/09/2020

Spatial noise-aware temperature retrieval from infrared sounder data

In this paper we present a combined strategy for the retrieval of atmosp...
10/08/2020

Dataset Augmentation and Dimensionality Reduction of Pinna-Related Transfer Functions

Efficient modeling of the inter-individual variations of head-related tr...
10/25/2017

DPCA: Dimensionality Reduction for Discriminative Analytics of Multiple Large-Scale Datasets

Principal component analysis (PCA) has well-documented merits for data e...
09/05/2017

Linear Optimal Low Rank Projection for High-Dimensional Multi-Class Data

Classification of individual samples into one or more categories is crit...
07/28/2019

Approximating the Span of Principal Components via Iterative Least-Squares

In the course of the last century, Principal Component Analysis (PCA) ha...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Background and Introduction

Dimensionality reduction techniques are a core part of the Statistics and Machine Learning toolbox, widely used in predictive modeling to improve a range of measures including generalization performance, interpretability or identifiability of models, and the time and space complexity of learning or prediction. There are many such techniques, see e.g.

[7]

for a survey, but chief amongst them are simple linear techniques and the most widely-used of these in practice is probably Principal Components Analysis (PCA)

[15] and its variants. Applications of PCA include [27, 28, 33].

PCA works as follows: Suppose we have a data matrix of , -dimensional observations. PCA works by linearly projecting our original -dimensional data onto uncorrelated (orthogonal) directions – the first ‘Principal Components’ – where typically . Denote by the matrix with the Principal Components as columns, then the Principal Components are chosen to maximize the orthogonal projection of the dataset onto the column space of , that is P is chosen to satisfy:

(1)

As a result, PCA gives the best -dimensional representation of the original

-dimensional data in the least-squares sense or, equivalently, if the data are centered, using PCA means that (for a fixed dataset) we discard the smallest amount of the total sample variation of any linear dimensionality reduction scheme. For data analysis tasks other than Ordinary Least Squares Regression (OLS) using PCA is a heuristic which works well frequently but, as noted in

[11, 20, 16, 12], it can work very poorly (even for a linear regression task) in some natural scenarios. For the task of classification, PCA often seems to work very well experimentally [13, 14], but it is trivial to construct examples for which PCA will work badly in a classification task111For example, when the most discriminative features have small between-class variance relative to the variance of the whole dataset, or relative to the within-class variance.

. In other words since the PCA objective is disconnected from the classification task at hand and, in particular it does not take account of the class structure inherent in the problem, vanilla PCA is prone to underfitting. In supervised settings where class labels are present, it would be a waste not to use the label information in the selection of useful features for classification, thus we propose a (weakly) supervised variant of PCA. There are a few supervised dimensionality reduction approaches, of which the most related to PCA is Fisher Linear Discriminant (FLD). FLD picks out the most useful feature for classification by maximizing the ratio of between-class variance to with-in class variance. It is supposed to outperform PCA for classification tasks in theory. However, as pointed out in

[19]

, PCA often outperforms FLD in practice, especially for small datasets. This is due the poor estimations of between-class variance and with-in class variance. Another important and widely used supervised approach is the wrapper methods based on partial least squares (PLS), e.g.

[4] and references therein. These take advantage of known structure in the classification problem by rotating the uncorrelated components to minimize the least square error on the predicted labels. While such wrapper approaches are frequently successful in practice, it is well known that they need careful tuning to avoid overfitting especially when sample size is small [4]

. More recently, a discriminative feature extraction method inspired by PCA was proposed in

[17]

. This method picks out the directions along which two classes differ most in their class-conditional second moments. However, it also suffers from the inaccurate second moment estimation problem that plagues FLD in the case of small size high dimensional datasets. Finally, the

regularization and sparse model selection method Lasso [26] widely used in regression can also be used for dimensionality reduction for classification tasks, like -regularized SVM. LARS [5] is another similar approach.

In this paper we describe and empirically validate a PCA variant that attempts to address both issues: In particular, we choose a projection that maximally preserves (a proxy for) the sample margin distribution between two classes. Our approach is very simple to implement and only involves changing several lines of code to vanilla PCA but, as far as we are aware, it is neither previously published nor folklore. More importantly it has the same time and space complexity as PCA, targets a sensible objective (for classification) yet it is not a wrapper method, has a clear interpretation in terms of the problem at hand, and in our experience it works unreasonably well – our experimental outcomes are typically better than vanilla PCA, frequently significantly better, and never significantly worse, and also competitive with PLS and Lasso.

The remainder of this paper is structured as follows: In the next section we describe PCA more precisely and recall that the principal components are the solution to a straightforward eigenvector problem. Next we discuss the importance of the

margin distribution as a measure of the difficulty of a classification problem, which motivates our altering the PCA objective in order to maximally preserve the sample margin distribution. Then we introduce several schemes based on our idea and provide some theoretical intuitions for how they work. We present extensive experiments on several datasets which demonstrate the utility of our approach and compare its performance to vanilla PCA, PLS, and Lasso, we use a simple non-parametric sign test to demonstrate the statistically significant superiority of our approach over vanilla PCA. Finally we summarize our findings and discuss some possible future directions for this research.

2 Preliminaries

Our main contribution is a novel PCA variant that aims to preserve the margin distribution, and in this section we therefore briefly review vanilla PCA, minimum margin and the margin distribution.

2.1 Principal Component Analysis

Given points in and a target dimension , PCA finds a linear projection and embedding such that

(2)

is minimized. In words, PCA finds the linear projection that best preserves the data in the least squares sense when the reconstruction is also a linear map. Representing and by matrices, it is easy to show that the optimal solution satisfies:

(3)

Therefore, PCA amounts to solving the convex optimization problem:

(4)

where is the space of matrices. Solving (4) is equivalent to solving:

(5)

where . Since is positive semi-definite the principal components can be found simply by diagonalizing as where is orthogonal, , and

is a diagonal matrix of the non-negative eigenvalues of

in descending order of magnitude. The first principle components comprise the eigenvectors of corresponding to the largest eigenvalues: That is, the first rows of .

The computational cost of PCA is dominated by constructing the matrix and diagonalizing it. The former step takes calculations and the latter costs , though if the time complexity can be cut to . Thus the overall time complexity for PCA is depending on whether the sample size is greater than the dimensionality or not.

Remark.

Above we introduced PCA as a linear (lossy) compression approach. The well-known PCA as a dimensionality reduction approach is actually to apply the common practice to run it on the ‘mean-centered’ data. Denote by and take , that is is proportional to the maximum likelihood consistent estimator of the covariance matrix. In this case the projection onto the column space of P maximizes the retained total sample variance. For some further perspectives on PCA, we refer the interested reader to [3].

2.2 Margin and Margin Distribution

Let

be a hypothesis class of linear classifiers (separating hyperplane normals). The

minimum margin or just margin between two separable classes is defined to be the supremum of the smallest Euclidean distances between the members of the classes and the separating hyperplane of classifiers . That is, if represent the distribution of the two separable classes with , then we define the margin to be:

(6)

where is any point on the separating hyperplane, is the label of , and are the supremum and infimum operators. For non-separable classes the margin is negative222Our definition of margin here is different from conventional definitions., so one can either enforce a non-zero ‘soft’ margin by allowing a fraction of observations to be misclassified at training, or given a fixed classifying hyperplane consider instead the distribution of the signed distances between the members of classes and the fixed classifying hyperplane by defining the margin at the point with respect to classifier by:

(7)

where again is any point on the separating hyperplane. We call the distribution of the margin distribution with respect to . The empirical margin and empirical margin distribution of a sample are defined similarly – for a fixed classifier and training set of size , we define the empirical margin:

(8)

where represents a point on the separating hyperplane of , and the empirical margin at a point with respect to as:

(9)

The empirical margin distribution with respect to is defined as the distribution of .

The importance of the margin between classes for classification has been studied extensively, where it can be viewed as a dimension-free measure of the difficulty of a classification task. For separable classes, [29]

bounds the generalization performance of the support vector machine (SVM) in terms of the empirical margin of the training dataset, and indeed both hard- and soft-margin SVM learn a discriminative hyperplane which maximizes the single-point minimum margin

[32]. It has been pointed out that the information of the margin distribution is largely lost in the minimum margin which depends so critically on a small number of the training data points [23], but there is no such problem for the margin distribution. Thus tighter bounds for the generalization error of classifiers utilizing the margin distribution as the measure of difficulty of a classification problem have been obtained in, for example: [8, 24, 23]. Alternative classification algorithms based on the idea of margin distribution optimization have also been proposed [9, 21, 1, 31, 32] and found empirically to outperform SVM for many real datasets. Therefore, the margin distribution provides discriminative information that is crucial for classification. However, the margin and margin distribution can only be evaluated once the classifier has been learned. The dimensionality reduction schemes we introduce in this paper are based on the idea of preserving the margin distribution, but crucially without having access to the classifier.

3 Algorithms

3.1 Motivation

From the definitions above it is clear that optimizing the margin distribution requires a classifying hyperplane from which to measure it, and applying PCA to preserve exactly the sample margin distribution would therefore require a wrapper approach which could be prone to overfitting in small sample conditions. We therefore propose to run PCA on a proxy for the optimal margin distribution to obtain uncorrelated features that approximately preserve the margin distribution, and hence the important discriminative information for classification, without the same risk of overfitting.

Our heuristic argument runs as follows: PCA is the best linear (lossy) compression method if the reconstruction process is also linear. Therefore, the strength of PCA is in preserving the data on which it is applied to. For the purpose of dimensionality reduction for linear classification tasks, we are not interested in preserving the data points, but we are interested in selecting the features best for discriminating the data points. If we know in advance what information is useful for classification, we can extract some structures containing that information from the dataset and run PCA on those structures to obtain the features that are best for preserving them. If we then reduce the dimension of the original data by projecting to these features, we expect that the discriminative information is preserved in a good way by the projection. Based on this intuition, we devised four PCA variants for the task of linear classification, which we will evaluate in Section 4, 5. For now, we simply present the four heuristic alternatives in this section.

To illustrate the basic ideas, we focus on the two-class classification problem since generalization to multi-class cases is straightforward. Let be a set of labeled training data points, where for convenience we assume is a point in , and is the class label, . The crucial question is how to approximate the margin distribution of the dataset without access to a classifying hyperplane. Now the margin distribution contains the information about the differences between the data points of the two classes that is meaningful for the classification problem. Therefore, our first PCA variant represents the margin distribution as the differences between the data points of the two classes. Let and be the sets of indices of the data points that belong to class -1 and class +1 respectively, i.e. .

3.2 Algorithm: M-PCA0

Define the following structures

(10)

Then the structures should reasonably represent the margin distribution well. We can then run “uncentered” PCA on these structures to obtain the most significant principal components. These principal components are the features that best represent this proxy for margin distribution and therefore contain the most discriminative information. We call this scheme M-PCA0. The algorithm of M-PCA0 is shown in Figure 1.

M-PCA0
input:
target dimension
let
let
let
let be the eigenvectors of
with the largest eigenvalues
let
output: ,
Figure 1: The M-PCA0 Algorithm

It is not hard to see that the size of is , where , , which is very large for large sample data. As a result, the M-PCA0 algorithm is more computationally expensive than usual PCA, it will be .

3.3 Algorithm: M-PCA1a and M-PCA1b

With a little more consideration, it is not hard to come up with a natural alternative to that resolves this time complexity issue. The Maximum Likelihood (ML) consistent estimates of the class means are

(11)

Define the variables

(12)

In words, is defined as the difference between and the mean of the other class. Intuitively, captures the same information as and it can be viewed as a conditioning of the previous problem. The sample size is now which is typically much smaller than the size of . We call the resulting algorithm M-PCA1a.

In the definition (12), the sample means of the two classes are used. Especially in the situation of small sample data or very imbalanced classes, the quality of the estimates of the class conditional means is a concern. In this case, replacing the sample means with the sample medoids may be a more robust option333The class-conditional medoid is the vector consisting of the class-conditional sample median of each individual feature.. We call the corresponding algorithm M-PCA1b. Both versions of the algorithm are shown in Figure 2. Time complexity for this approach is the same as the standard PCA.

M-PCA1a(M-PCA1b)
input:
target dimension
let
let be the mean(medoid) of
let be the mean(medoid) of
let
let
let be the eigenvectors of
with the largest eigenvalues
let
output: ,
Figure 2: The M-PCA1a(M-PCA1b) Algorithm

3.4 Algorithm: M-PCA2

Our final algorithm, M-PCA2, attempts to capture the margin distribution more closely by simulating equation (9) more closely. In this algorithm, we do not use all the pairs of data points in the two classes. Instead, we construct the difference vector between a datapoint and its nearest neighbor in the other class as a proxy for the margin distribution.

M-PCA2
input:
target dimension
let
let , such that
or
let
let be the eigenvectors of
with the largest eigenvalues
let
output: ,
Figure 3: The M-PCA2 Algorithm

Our experimental results show that M-PCA2 works extremely well, especially for small size high dimensional datasets. It is not hard to see that constructing the takes operations and the size of is no more than . Therefore, the time complexity of M-PCA2 is the same as the vanilla PCA.

4 Theory

Analyzing our schemes in full generality is difficult. To give some insight into how these work, in this section, we analyze a toy setting to provide some intuition, which suggests why our schemes can work better than PCA.

We consider a simple case where the classes have a shared class-conditional covariance matrix and for convenience we assume the class-conditional distributions are multivariate Gaussian, and the difference between the class means is aligned with one eigenvector of the class-conditional covariance matrix. Without loss of generality, we assume the class-conditional covariance matrix is diagonal. Let and

be two random variables such that

(13)
(14)
(15)

where

(16)
(17)
(18)

and where are not in any particular order. As a result, we have

(19)

It is clear in this setting that the only feature that is useful for classification is the first coordinate. A successful dimensionality reduction approach for this classification problem would have to include the first coordinate in the selected features. In the case , the optimal separating hyperplane is and the corresponding margin distribution is . The whole margin distribution would be preserved if an approach includes the first coordinate in the reduced features.

Now suppose we sample data points , then PCA works by eigen-decomposing

(20)

where is the sample mean. In expectation the above expression is just the covariance of the variable , which is equal to

(21)

Suppose the target dimension is , then for PCA to be successful, has to be larger than . In most classification problems the between-class variance is usually larger than within-class variances, that is is usually large enough that is larger than . This is why PCA can work well for classification tasks even though it is not designed for that purpose. In other situations, PCA may not work well. In particular here it will either be as good as possible or as bad as possible.

To see how M-PCA0 works, we define the random variable

(22)

M-PCA0 works by eigen-decomposing . In expectation, this matrix is equal to

(23)

Since , M-PCA0 has a better chance of selecting the first coordinate in the case . Our experiments indeed shows that M-PCA0 usually performs better than PCA.

To see how M-PCA1a and M-PCA1b work, we use similar arguments. First, define the following random variable

(24)

From the definition, we have

(25)

M-PCA1a and M-PCA1b work by eigen-decomposing , the expectation of which is equal to

(26)

This indicates that M-PCA1a and M-PCA1b have a much better chance than PCA of selecting the first coordinate as the useful feature in the case the target dimension is 1, since a larger quantity is added to rather than .

Remark.

From the insight obtained above, PCA works poorly in the case of class imbalance due to the fact that the more class imbalance the smaller becomes. Our schemes do not suffer from the same problem.

Remark.

Our schemes are much less vulnerable to the low between-class variance to within-class variance ratio problem that makes PCA fail. However, our schemes can still fail in the extreme cases. To make it even less vulnerable, we devise the scheme M-PCA2. M-PCA2 is based on our intuitive arguments.

The situation quickly becomes intractable to analysis in the more general case that the difference between the class means is not aligned with any eigenvector of the class-conditional covariance matrix. In this case with other quantities assumed the same as above. Now the covariance matrices to be eigen-decomposed are , , respectively. These are not diagonal and there is no simple analytical form for the eigenvectors. The most discriminative feature in theory is now FLD along the direction

(27)

which is not an eigenvector of the above matrices. We shall see from our experiments in Section 5 that for small target dimension , our approach nevertheless outperforms PCA in general, and despite being a (supervised) filter method, is highly competitive with wrapper approaches such as PLS and Lasso.

5 Experiments

In this section, we present empirical results on the performance of our schemes: M-PCA0, M-PCA1a, M-PCA1b, and M-PCA2. For comparison, we compare them to vanilla PCA, PLS, and Lasso in terms of test errors. To obtain a comprehensive picture of the performance of the new schemes, we run experiments evaluating several widely used classifiers on a range of publicly available datasets with different characteristics. The classifiers used were Fisher Linear Discriminant (FLD), SVM, Logistic Regression (LR), and Naive Bayes (NB). We use two groups of publicly available real datasets. The first group contains datasets with the numbers of data points larger than the dimensions; these are

ionosphere [18], sonar [18], mushrooms [18], and splice [18]. The other group contains several small sample datasets with the numbers of data points smaller than the dimensions. These datasets are colon [2], prostate [25], ovarian [22], leukemia [10], leukemia large [10], and duke [30]. The information on these datasets is shown in Table 1.

name source #instances #features
ionosphere [18] 126+225 34
sonar [18] 111+97 60
mushrooms [18] 3916+4208 112
splice [18] 1527+1648 60
colon [2] 22+40 2000
prostate [25] 50+52 6033
ovarian [22] 24+30 1536
leukemia [10] 47+25 3571
leukemia large [10] 47+25 7129
duke [30] 21+23 7129
Table 1: Datasets

5.1 Experiment Setup

In the experiments, the test errors of combinations of a dimensionality reduction scheme, a target dimension , a classifier and a dataset are obtained. Each combination of a dimensionality reduction scheme, a target dimension

, a classifier and a dataset is fed 50 independent partitions of the dataset into a training set and a test set, where four fifths of the data was used for training and the remainder for testing, and the sampling was stratified to preserve class membership proportions. Hence 50 test errors are produced for each combination. These are then used to compute the mean and the standard deviation of the test errors for that combination. For each loop iteration for a particular dataset, the data splits were held constant. We did not use cross validation since sign test assumes independent observations.

We choose two representative target dimensions for each dataset. For small sample datasets, the target dimensions are and , where is the rank of the training data matrix, which is roughly four fifths of the number of instances. While for the other datasets , the target dimensions are chosen to be and , where is the number of features (original data dimension). We do not include here larger target dimensions. This is because, on one hand, small target dimensions are the settings of practical interest. On the other hand, with increased target dimension , the differences between different dimensionality reduction schemes become less obvious. This is due to that fact that as the target dimension approaches the rank of the training data matrix, the number of features retained by different schemes are large enough to incorporate almost all the discriminative information. Our own experiments, not presented here, also indicate this fact.

For the datasets with more features than instances, the test errors of M-PCA0, M-PCA1a, M-PCA1b, and M-PCA2 are compared with that of PCA, PLS, and Lasso. While for the datasets with fewer features than instances, due to the high computational cost of the M-PCA0 scheme, only M-PCA1a, M-PCA1b, and M-PCA2 are run and compared to PCA, PLS, and Lasso. We use the SVM and Logistic Regression implemented by liblinear [6]. The version of SVM classifier used for experiment is the -regularized -loss SVM. While the logistic regression classifier used here is -regularized. The classifier parameters used were selected by cross validation on the whole original dataset to provide a consistent baseline, even though this may have favored the full wrapper methods since they have access to the tuned classifier. The Naive Bayes model uses a Gaussian class-conditional likelihood model and we used the built-in MATLAB version.

Finally, the way we run PLS and Lasso to reduce dimensionality for the classification tasks is by casting the classification tasks as regression tasks with discrete targets and obtain the important features. The implementations of PLS and Lasso used are the built-in MATLAB functions plsregress and lasso.

5.2 Results

Figure 4: Representative example plot of test errors for PCA-vs-M-PCA1b with SVM classifier for Ionosphere with K = 5.
K Classifier M-PCA1a M-PCA1b M-PCA2
Ionosphere
5 SVM 0.3318 0.6864
LR 0.8569 0.9878
11 SVM
LR
Sonar
10 SVM 0.5 0.9981 0.6911
LR 0.2272 0.9061 0.9061
20 SVM 0.998 0.1215 0.111
LR 0.9988 0.6358 0.2664
Mushrooms
18 SVM 0.7336
LR 0.1856
37 SVM
LR
Splice
10 SVM 1
LR 1
20 SVM 1
LR 1
Table 2: Non-parametric sign test on per data split test errors for PCA vs. our variants on datasets. All -values correspond to : PCA test error is strictly smaller than our variants. The boldfaced numbers indicate the cases where .
K Classifier M-PCA0 M-PCA1a M-PCA1b M-PCA2
Colon
12 SVM 0.25 0.75 0.25 0.3125
LR 0.9648 0.7734 0.1445 0.3036
24 SVM 0.5 0.125 0.5
LR 0.1051
Prostate
20 SVM 0.5 0.212 0.0898
LR 0.0898
40 SVM 0.5 0.2266 0.3633 0.2272
LR 0.5 0.5 0.5 0.1509
Ovarian
10 SVM 0.6762 0.6762 0.968 0.1635
LR 0.2905 0.1316
21 SVM 0.8867 0.8867 0.9961 0.2272
LR 0.8555 0.8555 0.9648
Leukemia
14 SVM 0.5 0.9375 0.875 0.2539
LR 0.3125 0.5 0.1094
28 SVM 1 1 0.5 0.9375
LR 0.5 0.5 0.875 0.6875
Leuk-large
14 SVM 0.1133 0.0547 0.1719 0.5982
LR 0.8867 0.1334 0.1938
28 SVM 0.1875 0.5 0.5 0.3633
LR 0.0625 0.5
Duke
8 SVM 0.073 0.1509 0.1334 0.3318
LR 0.068
17 SVM 0.125 0.125 0.3125 0.3872
LR 0.1875 0.3438 0.2266 0.337
Table 3: Non-parametric sign test on per data split test errors for PCA vs. our variants on datasets. All -values correspond to : PCA test error is strictly smaller than our variants. The boldfaced numbers indicate the cases where .

The detailed results are in the appendix and shown in Table 4 - 13. As can be seen from the tables, our schemes are superior to PCA in general and highly competitive with PLS and Lasso. This indicates that our schemes are indeed able to select the most discriminative features. Due to the large variance of the test errors444This is more obvious for small sample datasets due to smaller number of observations and the fact that the test error is very sensitive to different partitions of the original datasets., it appears initially that our schemes are not statistically significantly better than PCA, however these averages are across the different partitions of the datasets and performance on the different individual data splits is highly variable. To see if our approach indeed significantly outperforms PCA, we compare test errors on independent data splits and use a non-parametric sign test to evaluate the null: See Figure 4 – Each test error corresponds to an independent data split. We see that for this dataset M-PCA1b is at least as good as PCA on every individual split, but the high between-sample variance in the test error will mask this fact if we only consider the average test error across data splits. We tabulate the sign test p-values of our algorithms versus PCA in Tables 2 and 3. Here we only show results on SVM and Logistic Regression. The boldfaced numbers indicate the cases where and we have made no correction to these values for multiple comparisons since the smallest p-values are generally below in any case. We see that our approach does significantly outperform PCA, especially when the number of retained dimensions is small.

6 Discussion

We presented four simple filter PCA variants, each of which seems to improve the performance on projected data in a classification task over standard PCA. Our ideas in this paper are primarily based on the observation that the margin distribution contains most of the discriminative information, for a classification task and the PCA objective is not obviously aligned with preserving this quantity. Therefore we propose four heuristic structures to represent the margin distribution on which we then perform PCA. Extensive empirical evaluations suggest these do indeed represent the margin distribution well. Whether there are some better structures for representing the margin distribution that can be evaluated outside of a wrapper approach is an interesting question for future research. Any such better structures should give us a better dimensionality reduction scheme for linear classification. Whether theoretical guarantees on classification performance using our approach are possible looks like a difficult open problem - the main hurdle is how to provide typical case guarantees for a deterministic algorithm without imposing restrictive conditions on the data generator. A further direction for future research is to consider non-linear dimensionality reduction schemes, such as kernel PCA in a similar light.

Appendix A Detailed Experimental Results

In the tables below, the test errors (meanstd) are shown in percent and the boldfaced numbers represent the best performing schemes.

K Scheme SVM LR FLD NB
5 PCA 26.73.0 26.42.8 17.03.7
PLS 23.53.0 23.43.2 20.75.1
Lasso 27.72.6 27.02.8 20.03.9 13.23.7
M-PCA1a 26.63.0 26.52.8 16.43.8 17.24.3
M-PCA1b 16.73.6 17.54.3
M-PCA2 26.92.4 27.12.6 21.34.0 18.84.7
11 PCA 25.32.6 24.42.7 18.43.6 10.43.7
PLS 20.13.7 19.53.3 14.84.3
Lasso 20.23.4
M-PCA1a 24.72.9 24.12.7 18.53.7 11.23.9
M-PCA1b 21.12.9 20.43.2 18.03.8 11.13.9
M-PCA2 23.53.2 23.22.9 20.73.6 15.25.0
Table 4: Ionosphere Test Errors
K Scheme SVM LR FLD NB
10 PCA 26.85.7 25.96.2 24.46.7
PLS 30.27.0 29.06.5 25.66.8 37.58.1
Lasso 30.95.9 27.86.3 24.66.2 32.86.5
M-PCA1a 24.36.5
M-PCA1b 27.85.5 26.56.1 24.46.6 24.96.3
M-PCA2 27.65.7 26.16.0 23.86.6 24.86.9
20 PCA 27.55.7 25.86.7
PLS 27.66.2 27.95.4 28.26.1 38.58.4
Lasso 30.15.5 29.37.0 27.15.4 32.17.2
M-PCA1a 28.35.7 26.76.2 23.87.0 26.06.0
M-PCA1b 27.15.9 26.06.2 23.76.7 28.28.0
M-PCA2 24.06.1 27.28.2
Table 5: Sonar Test Errors
K Scheme SVM LR FLD NB
18 PCA 3.20.3 3.10.4 3.80.5 11.40.5
PLS 0.50.3 0.70.4 11.90.7
Lasso 1.61.1
M-PCA1a 2.80.4 2.70.4 3.50.5 11.70.6
M-PCA1b 3.20.3 3.00.4 3.10.4 11.40.6
M-PCA2 1.50.4 1.70.3 2.70.4 10.70.9
37 PCA 2.00.3 2.00.3 3.00.5 10.00.7
PLS 0.20.1 0.70.6 12.10.8
Lasso 1.30.8
M-PCA1a 1.80.3 1.90.3 2.80.5 10.80.6
M-PCA1b 1.30.3 2.00.3 3.10.5 8.50.6
M-PCA2 0.30.1 0.40.2 1.70.5 11.10.9
Table 6: Mushrooms Test Errors
K Scheme SVM LR FLD NB
10 PCA 18.21.3 18.21.1 17.91.3 16.11.3
PLS 22.41.5
Lasso 18.41.3 18.51.9 18.01.6 15.11.6
M-PCA1a 16.51.3 16.41.2 16.01.2 18.11.5
M-PCA1b 22.01.4 22.31.6 21.41.8 22.21.7
M-PCA2 17.01.3 16.81.3 16.21.2
20 PCA 17.91.3 17.71.0 17.11.5 16.01.5
PLS 15.41.2 22.31.6
Lasso 16.31.4 16.21.3
M-PCA1a 16.81.3 16.41.2 15.91.3 18.01.6
M-PCA1b 21.01.3 21.11.3 20.01.8 21.01.7
M-PCA2 17.01.3 16.81.3 16.21.3 14.31.2
Table 7: Splice Test Errors
K Scheme SVM LR FLD NB
12 PCA 11.79.1 13.58.9 17.58.6
PLS 14.09.3 13.28.9 14.77.3 17.88.6
Lasso 21.211.9 13.28.9 22.28.8
M-PCA0 11.39.2 14.09.0 12.28.1 21.89.8
M-PCA1a 11.79.1 13.79.3 12.27.9 15.88.5
M-PCA1b 12.28.5 16.38.6
M-PCA2 11.39.2 12.89.1 12.78.6 14.38.4
24 PCA 12.89.4 14.89.1 13.78.2 20.29.5
PLS 15.39.7 14.59.8 14.07.6 21.58.6
Lasso 23.511.0 15.28.7 22.28.8
M-PCA0 12.59.1 20.08.7
M-PCA1a 12.39.4 13.09.8 13.38.1 15.78.4
M-PCA1b 12.79.6 13.09.4 14.07.6 15.58.4
M-PCA2 13.88.8 13.87.3 14.08.5
Table 8: Colon Test Errors
K Scheme SVM LR FLD NB
20 PCA 10.15.2 10.46.9 9.04.8 33.912.8
PLS 9.25.1 8.26.1 8.54.8 43.97.8
Lasso 9.36.1 8.35.9 10.35.6
M-PCA0 9.95.3 9.86.8 8.85.0 39.79.9
M-PCA1a 9.44.7 9.26.6 8.75.0 42.78.8
M-PCA1b 9.14.8 8.95.9 8.05.1 35.09.0
M-PCA2 30.311.0
40 PCA 8.34.7 8.06.1 7.54.7 42.18.9
PLS 9.34.8 8.46.1 8.54.8 46.55.4
Lasso 10.05.4
M-PCA0 8.24.8 7.96.0 7.24.6 45.06.7
M-PCA1a 8.04.7 7.96.0 47.35.4
M-PCA1b 8.14.9 7.95.5 7.44.9 42.77.8
M-PCA2 7.95.2 7.46.4 7.25.2 33.411.3
Table 9: Prostate Test Errors
K Scheme SVM LR FLD NB
10 PCA 18.510.9 23.110.6 26.011.6 45.810.4
PLS 19.110.7 17.38.9 48.411.5
Lasso 22.513.3 26.710.5 27.811.4
M-PCA0 18.510.4 22.410.6 23.611.2 45.313.6
M-PCA1a 18.410.1 20.710.9 22.212.2 47.611.8
M-PCA1b 20.210.6 21.811.6 22.011.2 47.311.8
M-PCA2 20.410.2 33.612.5
21 PCA 18.710.5 19.19.6 20.010.1 42.511.4
PLS 50.48.3
Lasso 18.511.9 24.511.7 24.913.3
M-PCA0 19.310.3 19.59.2 19.110.9 44.211.8
M-PCA1a 19.311.1 19.59.7 18.911.1 45.111.6
M-PCA1b 19.811.0 19.89.3 21.311.4 45.811.3
M-PCA2 18.010.1 16.09.5 19.612.0 29.312.4
Table 10: Ovarian Test Errors
K Scheme SVM LR FLD NB
14 PCA 3.13.6 3.43.9 2.03.2 2.43.7
PLS 4.04.1 3.14.1 2.93.5 3.44.8
Lasso 5.94.9 4.15.0 5.06.2 4.45.2
M-PCA0 3.03.6 3.14.1
M-PCA1a 3.43.6 3.34.1 2.03.2 2.43.4
M-PCA1b 3.33.6 2.94.1 2.03.2 2.63.5
M-PCA2 2.13.3 2.74.1
28 PCA 2.43.4 2.44.0 2.03.2 2.94.6
PLS 4.74.2 3.34.4 2.93.5 3.64.8
Lasso 6.05.3 4.64.9 6.06.7 4.34.8
M-PCA0 2.93.5 1.93.2 4.75.7
M-PCA1a 2.43.4 2.33.9 2.03.2
M-PCA1b 2.64.0 2.33.4
M-PCA2 2.73.5 2.44.0 2.13.3 3.03.8
Table 11: Leukemia Test Errors
K Scheme SVM LR FLD NB
14 PCA 5.75.6 6.14.8 3.94.4 11.07.2
PLS 4.74.5 5.05.4 3.95.6 32.613.6
Lasso 8.97.2
M-PCA0 4.74.9 6.35.5 28.47.7
M-PCA1a 4.64.7 5.44.9 3.44.4 21.410.6
M-PCA1b 4.95.1 5.35.2 4.05.2 18.69.9
M-PCA2 5.35.0 4.45.2 3.65.6 26.48.2
28 PCA 4.74.7 5.34.7 2.94.6 16.39.7
PLS 4.44.5 4.44.8 3.95.6 30.712.2
Lasso 8.16.9
M-PCA0 4.14.4 4.64.9 33.64.4
M-PCA1a 4.64.5 4.64.7 3.14.6 25.78.9
M-PCA1b 4.64.5 4.34.6 3.14.6 24.78.9
M-PCA2 4.44.8 5.15.4 3.75.4 16.111.7
Table 12: Leuk-large Test Errors
K Scheme SVM LR FLD NB
8 PCA 14.911.6 16.412.1 16.211.7 22.015.5
PLS 42.913.1
Lasso 18.711.1 19.311.6 18.012.9 20.412.4
M-PCA0 12.99.6 13.811.6 13.810.9
M-PCA1a 13.19.2 13.311.0 12.010.2 21.816.0
M-PCA1b 13.18.9 13.111.6 12.410.7 20.215.8
M-PCA2 14.29.5 12.912.4 14.211.5 31.615.3
17 PCA 13.88.9 12.910.8 13.111.6 27.817.4
PLS 49.812.9
Lasso 16.211.0 16.410.8 18.010.3
M-PCA0 13.18.9 12.210.1 12.411.1 21.813.6
M-PCA1a 13.18.9 12.49.9 12.011.0 33.316.9
M-PCA1b 13.39.0 12.210.1 12.911.1 27.318.1
M-PCA2 12.79.8 12.210.8 12.79.5 38.916.4
Table 13: Duke Test Errors

Acknowledgement

XL is supported by an internal study award at the University of Waikato.

References

  • [1] Fabio Aiolli, Giovanni Da San Martino, and Alessandro Sperduti. A Kernel Method for the Optimization of the Margin Distribution, pages 305–314. Springer Berlin Heidelberg, Berlin, Heidelberg, 2008.
  • [2] U Alon, N Barkai, D a Notterman, K Gish, S Ybarra, D Mack, and a J Levine.

    Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.

    Proceedings of the National Academy of Sciences of the United States of America, 96(12):6745–6750, 1999.
  • [3] Tijl De Bie, Nello Cristianini, and Roman Rosipal.

    Eigenproblems in pattern recognition.

    In Handbook of Geometric Computing, pages 129–167. Springer Berlin Heidelberg, 2005.
  • [4] Richard G Brereton and Gavin R Lloyd. Partial least squares discriminant analysis: taking the magic away. Journal of Chemometrics, 28(4):213–225, 2014.
  • [5] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression. Annals of Statistics, 32(2):407–451, 2004.
  • [6] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIBLINEAR: A Library for Large Linear Classification. The Journal of Machine Learning, 9(2008):1871–1874, 2008.
  • [7] Imola K Fodor. A survey of dimension reduction techniques, 2002.
  • [8] Yoav Freund and Robert E. Schapire.

    Large Margin Classification Using the Perceptron Algorithm.

    Machine Learning - The Eleventh Annual Conference on computational Learning Theory

    , 37(3):277 – 296, 1999.
  • [9] Ashutosh Garg and Dan Roth. Margin Distribution and Learning Algorithms. In Proceedings of the Twentieth International Conference on Machine Learning, pages 210–217, 2003.
  • [10] T. R. Golub. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, 286(5439):531–537, 1999.
  • [11] Stefan Haufe, Frank Meinecke, Kai Görgen, Sven Dähne, John-Dylan Haynes, Benjamin Blankertz, and Felix Bießmann. On the interpretation of weight vectors of linear models in multivariate neuroimaging. Neuroimage, 87:96–110, 2014.
  • [12] R Carter Hill, Thomas B Fomby, and S R Johnson. Component selection norms for principal components regression. Communications in StatisticsTheory and Methods, 6(4):309–334, 1977.
  • [13] Tom Howley, Michael G Madden, Marie-Louise O’Connell, and Alan G Ryder. The effect of principal component analysis on machine learning accuracy with high-dimensional spectral data. Knowledge-Based Systems, 19(5):363–370, 2006.
  • [14] Andreas Janecek, Wilfried Gansterer, M A Demel, and G F Ecker.

    On the relationship between feature selection and classification accuracy.

    In Journal of Machine Learning Research Workshop and Conference Proceedings 4, volume 91, pages 90–105, 2008.
  • [15] I . T. Jolliffe. Principal Component Analysis. Springer-Verlag New York, 2nd edition, 2002.
  • [16] Ian T. Jolliffe. A Note on the Use of Principal Components in Regression. Journal of the Royal Statistical Society. Series C, 31(3):300–303, 1982.
  • [17] Nikos Karampatziakis and Paul Mineiro. Discriminative Features via Generalized Eigenvectors. Proceedings of The 31st International Conference on Machine Learning, pages 494–502, 2014.
  • [18] M Lichman. {UCI} Machine Learning Repository, 2013.
  • [19] Aleix M. Martinez and Avinash C. Kak. PCA versus LDA. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2):228–233, 2001.
  • [20] Mykola Pechenizkiy, Alexey Tsymbal, and Seppo Puuronen. PCA-based feature transformation for classification: issues in medical diagnostics. In Proceedings of the 17th IEEE Symposium on Computer-Based Medical Systems, pages 535–540, 2004.
  • [21] Kristiaan Pelckmans, Johan Suykens, and Bart D Moor. A Risk Minimization Principle for a Class of Parzen Estimators. In J C Platt, D Koller, Y Singer, and S T Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1137–1144. Curran Associates, Inc., 2008.
  • [22] Michèl Schummer, WaiLap V. Ng, Roger E. Bumgarner, Peter S. Nelson, Bernhard Schummer, David W. Bednarski, Laurie Hassell, Rae Lynn Baldwin, Beth Y. Karlan, and Leroy Hood. Comparative hybridization of an array of 21 500 ovarian cDNAs for the discovery of genes overexpressed in ovarian carcinomas. Gene, 238(2):375–385, 1999.
  • [23] J Shawe-Taylor and N Cristianini. Further Results on the Margin Distribution. In Proc. 12th Annu. Conf. on Comput. Learning Theory, pages 278–285. ACM Press, 1999.
  • [24] J Shawe-Taylor and N Cristianini. Margin distribution bounds on generalization. In Paul Fischer and Hans Ulrich Simon, editors, Computational Learning Theory: 4th European Conference, EuroCOLT’99 Nordkirchen, Germany, March 29–31, 1999 Proceedings, pages 263—-273. Springer Berlin Heidelberg, Berlin, Heidelberg, 1999.
  • [25] Dinesh Singh, Phillip G. Febbo, Kenneth Ross, Donald G. Jackson, Judith Manola, Christine Ladd, Pablo Tamayo, Andrew A. Renshaw, Anthony V. D’Amico, Jerome P. Richie, Eric S. Lander, Massimo Loda, Philip W. Kantoff, Todd R. Golub, and William R. Sellers. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1(2):203–209, 2002.
  • [26] Robert Tibshirani. Regression Selection and Shrinkage via the Lasso. Journal of the Royal Statistical Society B, 58(1):267–288, 1996.
  • [27] Matthew Turk and Alex Pentland. Eigenfaces for recognition. Journal of cognitive neuroscience, 3(1):71–86, 1991.
  • [28] Matthew A Turk and Alex P Pentland. Face recognition using eigenfaces. In Computer Vision and Pattern Recognition, 1991. Proceedings CVPR’91., IEEE Computer Society Conference on, pages 586–591. IEEE, 1991.
  • [29] Vladimir Vapnik.

    The Nature of Statistical Learning Theory

    .
    Springer-Verlag New York, 2nd edition, 2000.
  • [30] M West, C Blanchette, H Dressman, E Huang, S Ishida, R Spang, H Zuzan, J A Olson, J R Marks, and J R Nevins. Predicting the clinical status of human breast cancer by using gene expression profiles. Proceedings of the National Academy of Sciences of the United States of America, 98(20):11462–7, 2001.
  • [31] Teng Zhang and Zhi-Hua Zhou. Large margin distribution machine. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’14, pages 313–322, 2014.
  • [32] Teng Zhang and Zhi-Hua Zhou. Optimal Margin Distribution Machine. arXiv preprint arXiv:1604.03348, 2016.
  • [33] W Zhao, R Chellappa, P.J. Phillips, and a Rosenfeld. Face recognition: A literature survey. Acm Computing Surveys, 35(4):399–458, 2003.