1 Background and Introduction
Dimensionality reduction techniques are a core part of the Statistics and Machine Learning toolbox, widely used in predictive modeling to improve a range of measures including generalization performance, interpretability or identifiability of models, and the time and space complexity of learning or prediction. There are many such techniques, see e.g.
[7]for a survey, but chief amongst them are simple linear techniques and the most widely-used of these in practice is probably Principal Components Analysis (PCA)
[15] and its variants. Applications of PCA include [27, 28, 33].PCA works as follows: Suppose we have a data matrix of , -dimensional observations. PCA works by linearly projecting our original -dimensional data onto uncorrelated (orthogonal) directions – the first ‘Principal Components’ – where typically . Denote by the matrix with the Principal Components as columns, then the Principal Components are chosen to maximize the orthogonal projection of the dataset onto the column space of , that is P is chosen to satisfy:
(1) |
As a result, PCA gives the best -dimensional representation of the original
-dimensional data in the least-squares sense or, equivalently, if the data are centered, using PCA means that (for a fixed dataset) we discard the smallest amount of the total sample variation of any linear dimensionality reduction scheme. For data analysis tasks other than Ordinary Least Squares Regression (OLS) using PCA is a heuristic which works well frequently but, as noted in
[11, 20, 16, 12], it can work very poorly (even for a linear regression task) in some natural scenarios. For the task of classification, PCA often seems to work very well experimentally [13, 14], but it is trivial to construct examples for which PCA will work badly in a classification task111For example, when the most discriminative features have small between-class variance relative to the variance of the whole dataset, or relative to the within-class variance.. In other words since the PCA objective is disconnected from the classification task at hand and, in particular it does not take account of the class structure inherent in the problem, vanilla PCA is prone to underfitting. In supervised settings where class labels are present, it would be a waste not to use the label information in the selection of useful features for classification, thus we propose a (weakly) supervised variant of PCA. There are a few supervised dimensionality reduction approaches, of which the most related to PCA is Fisher Linear Discriminant (FLD). FLD picks out the most useful feature for classification by maximizing the ratio of between-class variance to with-in class variance. It is supposed to outperform PCA for classification tasks in theory. However, as pointed out in
[19], PCA often outperforms FLD in practice, especially for small datasets. This is due the poor estimations of between-class variance and with-in class variance. Another important and widely used supervised approach is the wrapper methods based on partial least squares (PLS), e.g.
[4] and references therein. These take advantage of known structure in the classification problem by rotating the uncorrelated components to minimize the least square error on the predicted labels. While such wrapper approaches are frequently successful in practice, it is well known that they need careful tuning to avoid overfitting especially when sample size is small [4]. More recently, a discriminative feature extraction method inspired by PCA was proposed in
[17]. This method picks out the directions along which two classes differ most in their class-conditional second moments. However, it also suffers from the inaccurate second moment estimation problem that plagues FLD in the case of small size high dimensional datasets. Finally, the
regularization and sparse model selection method Lasso [26] widely used in regression can also be used for dimensionality reduction for classification tasks, like -regularized SVM. LARS [5] is another similar approach.In this paper we describe and empirically validate a PCA variant that attempts to address both issues: In particular, we choose a projection that maximally preserves (a proxy for) the sample margin distribution between two classes. Our approach is very simple to implement and only involves changing several lines of code to vanilla PCA but, as far as we are aware, it is neither previously published nor folklore. More importantly it has the same time and space complexity as PCA, targets a sensible objective (for classification) yet it is not a wrapper method, has a clear interpretation in terms of the problem at hand, and in our experience it works unreasonably well – our experimental outcomes are typically better than vanilla PCA, frequently significantly better, and never significantly worse, and also competitive with PLS and Lasso.
The remainder of this paper is structured as follows: In the next section we describe PCA more precisely and recall that the principal components are the solution to a straightforward eigenvector problem. Next we discuss the importance of the
margin distribution as a measure of the difficulty of a classification problem, which motivates our altering the PCA objective in order to maximally preserve the sample margin distribution. Then we introduce several schemes based on our idea and provide some theoretical intuitions for how they work. We present extensive experiments on several datasets which demonstrate the utility of our approach and compare its performance to vanilla PCA, PLS, and Lasso, we use a simple non-parametric sign test to demonstrate the statistically significant superiority of our approach over vanilla PCA. Finally we summarize our findings and discuss some possible future directions for this research.2 Preliminaries
Our main contribution is a novel PCA variant that aims to preserve the margin distribution, and in this section we therefore briefly review vanilla PCA, minimum margin and the margin distribution.
2.1 Principal Component Analysis
Given points in and a target dimension , PCA finds a linear projection and embedding such that
(2) |
is minimized. In words, PCA finds the linear projection that best preserves the data in the least squares sense when the reconstruction is also a linear map. Representing and by matrices, it is easy to show that the optimal solution satisfies:
(3) |
Therefore, PCA amounts to solving the convex optimization problem:
(4) |
where is the space of matrices. Solving (4) is equivalent to solving:
(5) |
where . Since is positive semi-definite the principal components can be found simply by diagonalizing as where is orthogonal, , and
is a diagonal matrix of the non-negative eigenvalues of
in descending order of magnitude. The first principle components comprise the eigenvectors of corresponding to the largest eigenvalues: That is, the first rows of .The computational cost of PCA is dominated by constructing the matrix and diagonalizing it. The former step takes calculations and the latter costs , though if the time complexity can be cut to . Thus the overall time complexity for PCA is depending on whether the sample size is greater than the dimensionality or not.
Remark.
Above we introduced PCA as a linear (lossy) compression approach. The well-known PCA as a dimensionality reduction approach is actually to apply the common practice to run it on the ‘mean-centered’ data. Denote by and take , that is is proportional to the maximum likelihood consistent estimator of the covariance matrix. In this case the projection onto the column space of P maximizes the retained total sample variance. For some further perspectives on PCA, we refer the interested reader to [3].
2.2 Margin and Margin Distribution
Let
be a hypothesis class of linear classifiers (separating hyperplane normals). The
minimum margin or just margin between two separable classes is defined to be the supremum of the smallest Euclidean distances between the members of the classes and the separating hyperplane of classifiers . That is, if represent the distribution of the two separable classes with , then we define the margin to be:(6) |
where is any point on the separating hyperplane, is the label of , and are the supremum and infimum operators. For non-separable classes the margin is negative222Our definition of margin here is different from conventional definitions., so one can either enforce a non-zero ‘soft’ margin by allowing a fraction of observations to be misclassified at training, or given a fixed classifying hyperplane consider instead the distribution of the signed distances between the members of classes and the fixed classifying hyperplane by defining the margin at the point with respect to classifier by:
(7) |
where again is any point on the separating hyperplane. We call the distribution of the margin distribution with respect to . The empirical margin and empirical margin distribution of a sample are defined similarly – for a fixed classifier and training set of size , we define the empirical margin:
(8) |
where represents a point on the separating hyperplane of , and the empirical margin at a point with respect to as:
(9) |
The empirical margin distribution with respect to is defined as the distribution of .
The importance of the margin between classes for classification has been studied extensively, where it can be viewed as a dimension-free measure of the difficulty of a classification task. For separable classes, [29]
bounds the generalization performance of the support vector machine (SVM) in terms of the empirical margin of the training dataset, and indeed both hard- and soft-margin SVM learn a discriminative hyperplane which maximizes the single-point minimum margin
[32]. It has been pointed out that the information of the margin distribution is largely lost in the minimum margin which depends so critically on a small number of the training data points [23], but there is no such problem for the margin distribution. Thus tighter bounds for the generalization error of classifiers utilizing the margin distribution as the measure of difficulty of a classification problem have been obtained in, for example: [8, 24, 23]. Alternative classification algorithms based on the idea of margin distribution optimization have also been proposed [9, 21, 1, 31, 32] and found empirically to outperform SVM for many real datasets. Therefore, the margin distribution provides discriminative information that is crucial for classification. However, the margin and margin distribution can only be evaluated once the classifier has been learned. The dimensionality reduction schemes we introduce in this paper are based on the idea of preserving the margin distribution, but crucially without having access to the classifier.3 Algorithms
3.1 Motivation
From the definitions above it is clear that optimizing the margin distribution requires a classifying hyperplane from which to measure it, and applying PCA to preserve exactly the sample margin distribution would therefore require a wrapper approach which could be prone to overfitting in small sample conditions. We therefore propose to run PCA on a proxy for the optimal margin distribution to obtain uncorrelated features that approximately preserve the margin distribution, and hence the important discriminative information for classification, without the same risk of overfitting.
Our heuristic argument runs as follows: PCA is the best linear (lossy) compression method if the reconstruction process is also linear. Therefore, the strength of PCA is in preserving the data on which it is applied to. For the purpose of dimensionality reduction for linear classification tasks, we are not interested in preserving the data points, but we are interested in selecting the features best for discriminating the data points. If we know in advance what information is useful for classification, we can extract some structures containing that information from the dataset and run PCA on those structures to obtain the features that are best for preserving them. If we then reduce the dimension of the original data by projecting to these features, we expect that the discriminative information is preserved in a good way by the projection. Based on this intuition, we devised four PCA variants for the task of linear classification, which we will evaluate in Section 4, 5. For now, we simply present the four heuristic alternatives in this section.
To illustrate the basic ideas, we focus on the two-class classification problem since generalization to multi-class cases is straightforward. Let be a set of labeled training data points, where for convenience we assume is a point in , and is the class label, . The crucial question is how to approximate the margin distribution of the dataset without access to a classifying hyperplane. Now the margin distribution contains the information about the differences between the data points of the two classes that is meaningful for the classification problem. Therefore, our first PCA variant represents the margin distribution as the differences between the data points of the two classes. Let and be the sets of indices of the data points that belong to class -1 and class +1 respectively, i.e. .
3.2 Algorithm: M-PCA0
Define the following structures
(10) |
Then the structures should reasonably represent the margin distribution well. We can then run “uncentered” PCA on these structures to obtain the most significant principal components. These principal components are the features that best represent this proxy for margin distribution and therefore contain the most discriminative information. We call this scheme M-PCA0. The algorithm of M-PCA0 is shown in Figure 1.
M-PCA0 | |
---|---|
input: | |
target dimension | |
let | |
let | |
let | |
let be the eigenvectors of | |
with the largest eigenvalues | |
let | |
output: | , |
It is not hard to see that the size of is , where , , which is very large for large sample data. As a result, the M-PCA0 algorithm is more computationally expensive than usual PCA, it will be .
3.3 Algorithm: M-PCA1a and M-PCA1b
With a little more consideration, it is not hard to come up with a natural alternative to that resolves this time complexity issue. The Maximum Likelihood (ML) consistent estimates of the class means are
(11) |
Define the variables
(12) |
In words, is defined as the difference between and the mean of the other class. Intuitively, captures the same information as and it can be viewed as a conditioning of the previous problem. The sample size is now which is typically much smaller than the size of . We call the resulting algorithm M-PCA1a.
In the definition (12), the sample means of the two classes are used. Especially in the situation of small sample data or very imbalanced classes, the quality of the estimates of the class conditional means is a concern. In this case, replacing the sample means with the sample medoids may be a more robust option333The class-conditional medoid is the vector consisting of the class-conditional sample median of each individual feature.. We call the corresponding algorithm M-PCA1b. Both versions of the algorithm are shown in Figure 2. Time complexity for this approach is the same as the standard PCA.
M-PCA1a(M-PCA1b) | |
---|---|
input: | |
target dimension | |
let | |
let be the mean(medoid) of | |
let be the mean(medoid) of | |
let | |
let | |
let be the eigenvectors of | |
with the largest eigenvalues | |
let | |
output: | , |
3.4 Algorithm: M-PCA2
Our final algorithm, M-PCA2, attempts to capture the margin distribution more closely by simulating equation (9) more closely. In this algorithm, we do not use all the pairs of data points in the two classes. Instead, we construct the difference vector between a datapoint and its nearest neighbor in the other class as a proxy for the margin distribution.
M-PCA2 | |
---|---|
input: | |
target dimension | |
let | |
let , such that | |
or | |
let | |
let be the eigenvectors of | |
with the largest eigenvalues | |
let | |
output: | , |
Our experimental results show that M-PCA2 works extremely well, especially for small size high dimensional datasets. It is not hard to see that constructing the takes operations and the size of is no more than . Therefore, the time complexity of M-PCA2 is the same as the vanilla PCA.
4 Theory
Analyzing our schemes in full generality is difficult. To give some insight into how these work, in this section, we analyze a toy setting to provide some intuition, which suggests why our schemes can work better than PCA.
We consider a simple case where the classes have a shared class-conditional covariance matrix and for convenience we assume the class-conditional distributions are multivariate Gaussian, and the difference between the class means is aligned with one eigenvector of the class-conditional covariance matrix. Without loss of generality, we assume the class-conditional covariance matrix is diagonal. Let and
be two random variables such that
(13) | |||
(14) | |||
(15) |
where
(16) | |||
(17) | |||
(18) |
and where are not in any particular order. As a result, we have
(19) |
It is clear in this setting that the only feature that is useful for classification is the first coordinate. A successful dimensionality reduction approach for this classification problem would have to include the first coordinate in the selected features. In the case , the optimal separating hyperplane is and the corresponding margin distribution is . The whole margin distribution would be preserved if an approach includes the first coordinate in the reduced features.
Now suppose we sample data points , then PCA works by eigen-decomposing
(20) |
where is the sample mean. In expectation the above expression is just the covariance of the variable , which is equal to
(21) |
Suppose the target dimension is , then for PCA to be successful, has to be larger than . In most classification problems the between-class variance is usually larger than within-class variances, that is is usually large enough that is larger than . This is why PCA can work well for classification tasks even though it is not designed for that purpose. In other situations, PCA may not work well. In particular here it will either be as good as possible or as bad as possible.
To see how M-PCA0 works, we define the random variable
(22) |
M-PCA0 works by eigen-decomposing . In expectation, this matrix is equal to
(23) |
Since , M-PCA0 has a better chance of selecting the first coordinate in the case . Our experiments indeed shows that M-PCA0 usually performs better than PCA.
To see how M-PCA1a and M-PCA1b work, we use similar arguments. First, define the following random variable
(24) |
From the definition, we have
(25) |
M-PCA1a and M-PCA1b work by eigen-decomposing , the expectation of which is equal to
(26) |
This indicates that M-PCA1a and M-PCA1b have a much better chance than PCA of selecting the first coordinate as the useful feature in the case the target dimension is 1, since a larger quantity is added to rather than .
Remark.
From the insight obtained above, PCA works poorly in the case of class imbalance due to the fact that the more class imbalance the smaller becomes. Our schemes do not suffer from the same problem.
Remark.
Our schemes are much less vulnerable to the low between-class variance to within-class variance ratio problem that makes PCA fail. However, our schemes can still fail in the extreme cases. To make it even less vulnerable, we devise the scheme M-PCA2. M-PCA2 is based on our intuitive arguments.
The situation quickly becomes intractable to analysis in the more general case that the difference between the class means is not aligned with any eigenvector of the class-conditional covariance matrix. In this case with other quantities assumed the same as above. Now the covariance matrices to be eigen-decomposed are , , respectively. These are not diagonal and there is no simple analytical form for the eigenvectors. The most discriminative feature in theory is now FLD along the direction
(27) |
which is not an eigenvector of the above matrices. We shall see from our experiments in Section 5 that for small target dimension , our approach nevertheless outperforms PCA in general, and despite being a (supervised) filter method, is highly competitive with wrapper approaches such as PLS and Lasso.
5 Experiments
In this section, we present empirical results on the performance of our schemes: M-PCA0, M-PCA1a, M-PCA1b, and M-PCA2. For comparison, we compare them to vanilla PCA, PLS, and Lasso in terms of test errors. To obtain a comprehensive picture of the performance of the new schemes, we run experiments evaluating several widely used classifiers on a range of publicly available datasets with different characteristics. The classifiers used were Fisher Linear Discriminant (FLD), SVM, Logistic Regression (LR), and Naive Bayes (NB). We use two groups of publicly available real datasets. The first group contains datasets with the numbers of data points larger than the dimensions; these are
ionosphere [18], sonar [18], mushrooms [18], and splice [18]. The other group contains several small sample datasets with the numbers of data points smaller than the dimensions. These datasets are colon [2], prostate [25], ovarian [22], leukemia [10], leukemia large [10], and duke [30]. The information on these datasets is shown in Table 1.name | source | #instances | #features |
---|---|---|---|
ionosphere | [18] | 126+225 | 34 |
sonar | [18] | 111+97 | 60 |
mushrooms | [18] | 3916+4208 | 112 |
splice | [18] | 1527+1648 | 60 |
colon | [2] | 22+40 | 2000 |
prostate | [25] | 50+52 | 6033 |
ovarian | [22] | 24+30 | 1536 |
leukemia | [10] | 47+25 | 3571 |
leukemia large | [10] | 47+25 | 7129 |
duke | [30] | 21+23 | 7129 |
5.1 Experiment Setup
In the experiments, the test errors of combinations of a dimensionality reduction scheme, a target dimension , a classifier and a dataset are obtained. Each combination of a dimensionality reduction scheme, a target dimension
, a classifier and a dataset is fed 50 independent partitions of the dataset into a training set and a test set, where four fifths of the data was used for training and the remainder for testing, and the sampling was stratified to preserve class membership proportions. Hence 50 test errors are produced for each combination. These are then used to compute the mean and the standard deviation of the test errors for that combination. For each loop iteration for a particular dataset, the data splits were held constant. We did not use cross validation since sign test assumes independent observations.
We choose two representative target dimensions for each dataset. For small sample datasets, the target dimensions are and , where is the rank of the training data matrix, which is roughly four fifths of the number of instances. While for the other datasets , the target dimensions are chosen to be and , where is the number of features (original data dimension). We do not include here larger target dimensions. This is because, on one hand, small target dimensions are the settings of practical interest. On the other hand, with increased target dimension , the differences between different dimensionality reduction schemes become less obvious. This is due to that fact that as the target dimension approaches the rank of the training data matrix, the number of features retained by different schemes are large enough to incorporate almost all the discriminative information. Our own experiments, not presented here, also indicate this fact.
For the datasets with more features than instances, the test errors of M-PCA0, M-PCA1a, M-PCA1b, and M-PCA2 are compared with that of PCA, PLS, and Lasso. While for the datasets with fewer features than instances, due to the high computational cost of the M-PCA0 scheme, only M-PCA1a, M-PCA1b, and M-PCA2 are run and compared to PCA, PLS, and Lasso. We use the SVM and Logistic Regression implemented by liblinear [6]. The version of SVM classifier used for experiment is the -regularized -loss SVM. While the logistic regression classifier used here is -regularized. The classifier parameters used were selected by cross validation on the whole original dataset to provide a consistent baseline, even though this may have favored the full wrapper methods since they have access to the tuned classifier. The Naive Bayes model uses a Gaussian class-conditional likelihood model and we used the built-in MATLAB version.
Finally, the way we run PLS and Lasso to reduce dimensionality for the classification tasks is by casting the classification tasks as regression tasks with discrete targets and obtain the important features. The implementations of PLS and Lasso used are the built-in MATLAB functions plsregress and lasso.
5.2 Results

K | Classifier | M-PCA1a | M-PCA1b | M-PCA2 |
---|---|---|---|---|
Ionosphere | ||||
5 | SVM | 0.3318 | 0.6864 | |
LR | 0.8569 | 0.9878 | ||
11 | SVM | |||
LR | ||||
Sonar | ||||
10 | SVM | 0.5 | 0.9981 | 0.6911 |
LR | 0.2272 | 0.9061 | 0.9061 | |
20 | SVM | 0.998 | 0.1215 | 0.111 |
LR | 0.9988 | 0.6358 | 0.2664 | |
Mushrooms | ||||
18 | SVM | 0.7336 | ||
LR | 0.1856 | |||
37 | SVM | |||
LR | ||||
Splice | ||||
10 | SVM | 1 | ||
LR | 1 | |||
20 | SVM | 1 | ||
LR | 1 |
K | Classifier | M-PCA0 | M-PCA1a | M-PCA1b | M-PCA2 |
Colon | |||||
12 | SVM | 0.25 | 0.75 | 0.25 | 0.3125 |
LR | 0.9648 | 0.7734 | 0.1445 | 0.3036 | |
24 | SVM | 0.5 | 0.125 | 0.5 | |
LR | 0.1051 | ||||
Prostate | |||||
20 | SVM | 0.5 | 0.212 | 0.0898 | |
LR | 0.0898 | ||||
40 | SVM | 0.5 | 0.2266 | 0.3633 | 0.2272 |
LR | 0.5 | 0.5 | 0.5 | 0.1509 | |
Ovarian | |||||
10 | SVM | 0.6762 | 0.6762 | 0.968 | 0.1635 |
LR | 0.2905 | 0.1316 | |||
21 | SVM | 0.8867 | 0.8867 | 0.9961 | 0.2272 |
LR | 0.8555 | 0.8555 | 0.9648 | ||
Leukemia | |||||
14 | SVM | 0.5 | 0.9375 | 0.875 | 0.2539 |
LR | 0.3125 | 0.5 | 0.1094 | ||
28 | SVM | 1 | 1 | 0.5 | 0.9375 |
LR | 0.5 | 0.5 | 0.875 | 0.6875 | |
Leuk-large | |||||
14 | SVM | 0.1133 | 0.0547 | 0.1719 | 0.5982 |
LR | 0.8867 | 0.1334 | 0.1938 | ||
28 | SVM | 0.1875 | 0.5 | 0.5 | 0.3633 |
LR | 0.0625 | 0.5 | |||
Duke | |||||
8 | SVM | 0.073 | 0.1509 | 0.1334 | 0.3318 |
LR | 0.068 | ||||
17 | SVM | 0.125 | 0.125 | 0.3125 | 0.3872 |
LR | 0.1875 | 0.3438 | 0.2266 | 0.337 |
The detailed results are in the appendix and shown in Table 4 - 13. As can be seen from the tables, our schemes are superior to PCA in general and highly competitive with PLS and Lasso. This indicates that our schemes are indeed able to select the most discriminative features. Due to the large variance of the test errors444This is more obvious for small sample datasets due to smaller number of observations and the fact that the test error is very sensitive to different partitions of the original datasets., it appears initially that our schemes are not statistically significantly better than PCA, however these averages are across the different partitions of the datasets and performance on the different individual data splits is highly variable. To see if our approach indeed significantly outperforms PCA, we compare test errors on independent data splits and use a non-parametric sign test to evaluate the null: See Figure 4 – Each test error corresponds to an independent data split. We see that for this dataset M-PCA1b is at least as good as PCA on every individual split, but the high between-sample variance in the test error will mask this fact if we only consider the average test error across data splits. We tabulate the sign test p-values of our algorithms versus PCA in Tables 2 and 3. Here we only show results on SVM and Logistic Regression. The boldfaced numbers indicate the cases where and we have made no correction to these values for multiple comparisons since the smallest p-values are generally below in any case. We see that our approach does significantly outperform PCA, especially when the number of retained dimensions is small.
6 Discussion
We presented four simple filter PCA variants, each of which seems to improve the performance on projected data in a classification task over standard PCA. Our ideas in this paper are primarily based on the observation that the margin distribution contains most of the discriminative information, for a classification task and the PCA objective is not obviously aligned with preserving this quantity. Therefore we propose four heuristic structures to represent the margin distribution on which we then perform PCA. Extensive empirical evaluations suggest these do indeed represent the margin distribution well. Whether there are some better structures for representing the margin distribution that can be evaluated outside of a wrapper approach is an interesting question for future research. Any such better structures should give us a better dimensionality reduction scheme for linear classification. Whether theoretical guarantees on classification performance using our approach are possible looks like a difficult open problem - the main hurdle is how to provide typical case guarantees for a deterministic algorithm without imposing restrictive conditions on the data generator. A further direction for future research is to consider non-linear dimensionality reduction schemes, such as kernel PCA in a similar light.
Appendix A Detailed Experimental Results
In the tables below, the test errors (meanstd) are shown in percent and the boldfaced numbers represent the best performing schemes.
K | Scheme | SVM | LR | FLD | NB |
---|---|---|---|---|---|
5 | PCA | 26.73.0 | 26.42.8 | 17.03.7 | |
PLS | 23.53.0 | 23.43.2 | 20.75.1 | ||
Lasso | 27.72.6 | 27.02.8 | 20.03.9 | 13.23.7 | |
M-PCA1a | 26.63.0 | 26.52.8 | 16.43.8 | 17.24.3 | |
M-PCA1b | 16.73.6 | 17.54.3 | |||
M-PCA2 | 26.92.4 | 27.12.6 | 21.34.0 | 18.84.7 | |
11 | PCA | 25.32.6 | 24.42.7 | 18.43.6 | 10.43.7 |
PLS | 20.13.7 | 19.53.3 | 14.84.3 | ||
Lasso | 20.23.4 | ||||
M-PCA1a | 24.72.9 | 24.12.7 | 18.53.7 | 11.23.9 | |
M-PCA1b | 21.12.9 | 20.43.2 | 18.03.8 | 11.13.9 | |
M-PCA2 | 23.53.2 | 23.22.9 | 20.73.6 | 15.25.0 |
K | Scheme | SVM | LR | FLD | NB |
---|---|---|---|---|---|
10 | PCA | 26.85.7 | 25.96.2 | 24.46.7 | |
PLS | 30.27.0 | 29.06.5 | 25.66.8 | 37.58.1 | |
Lasso | 30.95.9 | 27.86.3 | 24.66.2 | 32.86.5 | |
M-PCA1a | 24.36.5 | ||||
M-PCA1b | 27.85.5 | 26.56.1 | 24.46.6 | 24.96.3 | |
M-PCA2 | 27.65.7 | 26.16.0 | 23.86.6 | 24.86.9 | |
20 | PCA | 27.55.7 | 25.86.7 | ||
PLS | 27.66.2 | 27.95.4 | 28.26.1 | 38.58.4 | |
Lasso | 30.15.5 | 29.37.0 | 27.15.4 | 32.17.2 | |
M-PCA1a | 28.35.7 | 26.76.2 | 23.87.0 | 26.06.0 | |
M-PCA1b | 27.15.9 | 26.06.2 | 23.76.7 | 28.28.0 | |
M-PCA2 | 24.06.1 | 27.28.2 |
K | Scheme | SVM | LR | FLD | NB |
---|---|---|---|---|---|
18 | PCA | 3.20.3 | 3.10.4 | 3.80.5 | 11.40.5 |
PLS | 0.50.3 | 0.70.4 | 11.90.7 | ||
Lasso | 1.61.1 | ||||
M-PCA1a | 2.80.4 | 2.70.4 | 3.50.5 | 11.70.6 | |
M-PCA1b | 3.20.3 | 3.00.4 | 3.10.4 | 11.40.6 | |
M-PCA2 | 1.50.4 | 1.70.3 | 2.70.4 | 10.70.9 | |
37 | PCA | 2.00.3 | 2.00.3 | 3.00.5 | 10.00.7 |
PLS | 0.20.1 | 0.70.6 | 12.10.8 | ||
Lasso | 1.30.8 | ||||
M-PCA1a | 1.80.3 | 1.90.3 | 2.80.5 | 10.80.6 | |
M-PCA1b | 1.30.3 | 2.00.3 | 3.10.5 | 8.50.6 | |
M-PCA2 | 0.30.1 | 0.40.2 | 1.70.5 | 11.10.9 |
K | Scheme | SVM | LR | FLD | NB |
---|---|---|---|---|---|
10 | PCA | 18.21.3 | 18.21.1 | 17.91.3 | 16.11.3 |
PLS | 22.41.5 | ||||
Lasso | 18.41.3 | 18.51.9 | 18.01.6 | 15.11.6 | |
M-PCA1a | 16.51.3 | 16.41.2 | 16.01.2 | 18.11.5 | |
M-PCA1b | 22.01.4 | 22.31.6 | 21.41.8 | 22.21.7 | |
M-PCA2 | 17.01.3 | 16.81.3 | 16.21.2 | ||
20 | PCA | 17.91.3 | 17.71.0 | 17.11.5 | 16.01.5 |
PLS | 15.41.2 | 22.31.6 | |||
Lasso | 16.31.4 | 16.21.3 | |||
M-PCA1a | 16.81.3 | 16.41.2 | 15.91.3 | 18.01.6 | |
M-PCA1b | 21.01.3 | 21.11.3 | 20.01.8 | 21.01.7 | |
M-PCA2 | 17.01.3 | 16.81.3 | 16.21.3 | 14.31.2 |
K | Scheme | SVM | LR | FLD | NB |
---|---|---|---|---|---|
12 | PCA | 11.79.1 | 13.58.9 | 17.58.6 | |
PLS | 14.09.3 | 13.28.9 | 14.77.3 | 17.88.6 | |
Lasso | 21.211.9 | 13.28.9 | 22.28.8 | ||
M-PCA0 | 11.39.2 | 14.09.0 | 12.28.1 | 21.89.8 | |
M-PCA1a | 11.79.1 | 13.79.3 | 12.27.9 | 15.88.5 | |
M-PCA1b | 12.28.5 | 16.38.6 | |||
M-PCA2 | 11.39.2 | 12.89.1 | 12.78.6 | 14.38.4 | |
24 | PCA | 12.89.4 | 14.89.1 | 13.78.2 | 20.29.5 |
PLS | 15.39.7 | 14.59.8 | 14.07.6 | 21.58.6 | |
Lasso | 23.511.0 | 15.28.7 | 22.28.8 | ||
M-PCA0 | 12.59.1 | 20.08.7 | |||
M-PCA1a | 12.39.4 | 13.09.8 | 13.38.1 | 15.78.4 | |
M-PCA1b | 12.79.6 | 13.09.4 | 14.07.6 | 15.58.4 | |
M-PCA2 | 13.88.8 | 13.87.3 | 14.08.5 |
K | Scheme | SVM | LR | FLD | NB |
---|---|---|---|---|---|
20 | PCA | 10.15.2 | 10.46.9 | 9.04.8 | 33.912.8 |
PLS | 9.25.1 | 8.26.1 | 8.54.8 | 43.97.8 | |
Lasso | 9.36.1 | 8.35.9 | 10.35.6 | ||
M-PCA0 | 9.95.3 | 9.86.8 | 8.85.0 | 39.79.9 | |
M-PCA1a | 9.44.7 | 9.26.6 | 8.75.0 | 42.78.8 | |
M-PCA1b | 9.14.8 | 8.95.9 | 8.05.1 | 35.09.0 | |
M-PCA2 | 30.311.0 | ||||
40 | PCA | 8.34.7 | 8.06.1 | 7.54.7 | 42.18.9 |
PLS | 9.34.8 | 8.46.1 | 8.54.8 | 46.55.4 | |
Lasso | 10.05.4 | ||||
M-PCA0 | 8.24.8 | 7.96.0 | 7.24.6 | 45.06.7 | |
M-PCA1a | 8.04.7 | 7.96.0 | 47.35.4 | ||
M-PCA1b | 8.14.9 | 7.95.5 | 7.44.9 | 42.77.8 | |
M-PCA2 | 7.95.2 | 7.46.4 | 7.25.2 | 33.411.3 |
K | Scheme | SVM | LR | FLD | NB |
---|---|---|---|---|---|
10 | PCA | 18.510.9 | 23.110.6 | 26.011.6 | 45.810.4 |
PLS | 19.110.7 | 17.38.9 | 48.411.5 | ||
Lasso | 22.513.3 | 26.710.5 | 27.811.4 | ||
M-PCA0 | 18.510.4 | 22.410.6 | 23.611.2 | 45.313.6 | |
M-PCA1a | 18.410.1 | 20.710.9 | 22.212.2 | 47.611.8 | |
M-PCA1b | 20.210.6 | 21.811.6 | 22.011.2 | 47.311.8 | |
M-PCA2 | 20.410.2 | 33.612.5 | |||
21 | PCA | 18.710.5 | 19.19.6 | 20.010.1 | 42.511.4 |
PLS | 50.48.3 | ||||
Lasso | 18.511.9 | 24.511.7 | 24.913.3 | ||
M-PCA0 | 19.310.3 | 19.59.2 | 19.110.9 | 44.211.8 | |
M-PCA1a | 19.311.1 | 19.59.7 | 18.911.1 | 45.111.6 | |
M-PCA1b | 19.811.0 | 19.89.3 | 21.311.4 | 45.811.3 | |
M-PCA2 | 18.010.1 | 16.09.5 | 19.612.0 | 29.312.4 |
K | Scheme | SVM | LR | FLD | NB |
---|---|---|---|---|---|
14 | PCA | 3.13.6 | 3.43.9 | 2.03.2 | 2.43.7 |
PLS | 4.04.1 | 3.14.1 | 2.93.5 | 3.44.8 | |
Lasso | 5.94.9 | 4.15.0 | 5.06.2 | 4.45.2 | |
M-PCA0 | 3.03.6 | 3.14.1 | |||
M-PCA1a | 3.43.6 | 3.34.1 | 2.03.2 | 2.43.4 | |
M-PCA1b | 3.33.6 | 2.94.1 | 2.03.2 | 2.63.5 | |
M-PCA2 | 2.13.3 | 2.74.1 | |||
28 | PCA | 2.43.4 | 2.44.0 | 2.03.2 | 2.94.6 |
PLS | 4.74.2 | 3.34.4 | 2.93.5 | 3.64.8 | |
Lasso | 6.05.3 | 4.64.9 | 6.06.7 | 4.34.8 | |
M-PCA0 | 2.93.5 | 1.93.2 | 4.75.7 | ||
M-PCA1a | 2.43.4 | 2.33.9 | 2.03.2 | ||
M-PCA1b | 2.64.0 | 2.33.4 | |||
M-PCA2 | 2.73.5 | 2.44.0 | 2.13.3 | 3.03.8 |
K | Scheme | SVM | LR | FLD | NB |
---|---|---|---|---|---|
14 | PCA | 5.75.6 | 6.14.8 | 3.94.4 | 11.07.2 |
PLS | 4.74.5 | 5.05.4 | 3.95.6 | 32.613.6 | |
Lasso | 8.97.2 | ||||
M-PCA0 | 4.74.9 | 6.35.5 | 28.47.7 | ||
M-PCA1a | 4.64.7 | 5.44.9 | 3.44.4 | 21.410.6 | |
M-PCA1b | 4.95.1 | 5.35.2 | 4.05.2 | 18.69.9 | |
M-PCA2 | 5.35.0 | 4.45.2 | 3.65.6 | 26.48.2 | |
28 | PCA | 4.74.7 | 5.34.7 | 2.94.6 | 16.39.7 |
PLS | 4.44.5 | 4.44.8 | 3.95.6 | 30.712.2 | |
Lasso | 8.16.9 | ||||
M-PCA0 | 4.14.4 | 4.64.9 | 33.64.4 | ||
M-PCA1a | 4.64.5 | 4.64.7 | 3.14.6 | 25.78.9 | |
M-PCA1b | 4.64.5 | 4.34.6 | 3.14.6 | 24.78.9 | |
M-PCA2 | 4.44.8 | 5.15.4 | 3.75.4 | 16.111.7 |
K | Scheme | SVM | LR | FLD | NB |
---|---|---|---|---|---|
8 | PCA | 14.911.6 | 16.412.1 | 16.211.7 | 22.015.5 |
PLS | 42.913.1 | ||||
Lasso | 18.711.1 | 19.311.6 | 18.012.9 | 20.412.4 | |
M-PCA0 | 12.99.6 | 13.811.6 | 13.810.9 | ||
M-PCA1a | 13.19.2 | 13.311.0 | 12.010.2 | 21.816.0 | |
M-PCA1b | 13.18.9 | 13.111.6 | 12.410.7 | 20.215.8 | |
M-PCA2 | 14.29.5 | 12.912.4 | 14.211.5 | 31.615.3 | |
17 | PCA | 13.88.9 | 12.910.8 | 13.111.6 | 27.817.4 |
PLS | 49.812.9 | ||||
Lasso | 16.211.0 | 16.410.8 | 18.010.3 | ||
M-PCA0 | 13.18.9 | 12.210.1 | 12.411.1 | 21.813.6 | |
M-PCA1a | 13.18.9 | 12.49.9 | 12.011.0 | 33.316.9 | |
M-PCA1b | 13.39.0 | 12.210.1 | 12.911.1 | 27.318.1 | |
M-PCA2 | 12.79.8 | 12.210.8 | 12.79.5 | 38.916.4 |
Acknowledgement
XL is supported by an internal study award at the University of Waikato.
References
- [1] Fabio Aiolli, Giovanni Da San Martino, and Alessandro Sperduti. A Kernel Method for the Optimization of the Margin Distribution, pages 305–314. Springer Berlin Heidelberg, Berlin, Heidelberg, 2008.
-
[2]
U Alon, N Barkai, D a Notterman, K Gish, S Ybarra, D Mack, and a J Levine.
Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.
Proceedings of the National Academy of Sciences of the United States of America, 96(12):6745–6750, 1999. -
[3]
Tijl De Bie, Nello Cristianini, and Roman Rosipal.
Eigenproblems in pattern recognition.
In Handbook of Geometric Computing, pages 129–167. Springer Berlin Heidelberg, 2005. - [4] Richard G Brereton and Gavin R Lloyd. Partial least squares discriminant analysis: taking the magic away. Journal of Chemometrics, 28(4):213–225, 2014.
- [5] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression. Annals of Statistics, 32(2):407–451, 2004.
- [6] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIBLINEAR: A Library for Large Linear Classification. The Journal of Machine Learning, 9(2008):1871–1874, 2008.
- [7] Imola K Fodor. A survey of dimension reduction techniques, 2002.
-
[8]
Yoav Freund and Robert E. Schapire.
Large Margin Classification Using the Perceptron Algorithm.
Machine Learning - The Eleventh Annual Conference on computational Learning Theory
, 37(3):277 – 296, 1999. - [9] Ashutosh Garg and Dan Roth. Margin Distribution and Learning Algorithms. In Proceedings of the Twentieth International Conference on Machine Learning, pages 210–217, 2003.
- [10] T. R. Golub. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, 286(5439):531–537, 1999.
- [11] Stefan Haufe, Frank Meinecke, Kai Görgen, Sven Dähne, John-Dylan Haynes, Benjamin Blankertz, and Felix Bießmann. On the interpretation of weight vectors of linear models in multivariate neuroimaging. Neuroimage, 87:96–110, 2014.
- [12] R Carter Hill, Thomas B Fomby, and S R Johnson. Component selection norms for principal components regression. Communications in StatisticsTheory and Methods, 6(4):309–334, 1977.
- [13] Tom Howley, Michael G Madden, Marie-Louise O’Connell, and Alan G Ryder. The effect of principal component analysis on machine learning accuracy with high-dimensional spectral data. Knowledge-Based Systems, 19(5):363–370, 2006.
-
[14]
Andreas Janecek, Wilfried Gansterer, M A Demel, and G F Ecker.
On the relationship between feature selection and classification accuracy.
In Journal of Machine Learning Research Workshop and Conference Proceedings 4, volume 91, pages 90–105, 2008. - [15] I . T. Jolliffe. Principal Component Analysis. Springer-Verlag New York, 2nd edition, 2002.
- [16] Ian T. Jolliffe. A Note on the Use of Principal Components in Regression. Journal of the Royal Statistical Society. Series C, 31(3):300–303, 1982.
- [17] Nikos Karampatziakis and Paul Mineiro. Discriminative Features via Generalized Eigenvectors. Proceedings of The 31st International Conference on Machine Learning, pages 494–502, 2014.
- [18] M Lichman. {UCI} Machine Learning Repository, 2013.
- [19] Aleix M. Martinez and Avinash C. Kak. PCA versus LDA. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2):228–233, 2001.
- [20] Mykola Pechenizkiy, Alexey Tsymbal, and Seppo Puuronen. PCA-based feature transformation for classification: issues in medical diagnostics. In Proceedings of the 17th IEEE Symposium on Computer-Based Medical Systems, pages 535–540, 2004.
- [21] Kristiaan Pelckmans, Johan Suykens, and Bart D Moor. A Risk Minimization Principle for a Class of Parzen Estimators. In J C Platt, D Koller, Y Singer, and S T Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1137–1144. Curran Associates, Inc., 2008.
- [22] Michèl Schummer, WaiLap V. Ng, Roger E. Bumgarner, Peter S. Nelson, Bernhard Schummer, David W. Bednarski, Laurie Hassell, Rae Lynn Baldwin, Beth Y. Karlan, and Leroy Hood. Comparative hybridization of an array of 21 500 ovarian cDNAs for the discovery of genes overexpressed in ovarian carcinomas. Gene, 238(2):375–385, 1999.
- [23] J Shawe-Taylor and N Cristianini. Further Results on the Margin Distribution. In Proc. 12th Annu. Conf. on Comput. Learning Theory, pages 278–285. ACM Press, 1999.
- [24] J Shawe-Taylor and N Cristianini. Margin distribution bounds on generalization. In Paul Fischer and Hans Ulrich Simon, editors, Computational Learning Theory: 4th European Conference, EuroCOLT’99 Nordkirchen, Germany, March 29–31, 1999 Proceedings, pages 263—-273. Springer Berlin Heidelberg, Berlin, Heidelberg, 1999.
- [25] Dinesh Singh, Phillip G. Febbo, Kenneth Ross, Donald G. Jackson, Judith Manola, Christine Ladd, Pablo Tamayo, Andrew A. Renshaw, Anthony V. D’Amico, Jerome P. Richie, Eric S. Lander, Massimo Loda, Philip W. Kantoff, Todd R. Golub, and William R. Sellers. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1(2):203–209, 2002.
- [26] Robert Tibshirani. Regression Selection and Shrinkage via the Lasso. Journal of the Royal Statistical Society B, 58(1):267–288, 1996.
- [27] Matthew Turk and Alex Pentland. Eigenfaces for recognition. Journal of cognitive neuroscience, 3(1):71–86, 1991.
- [28] Matthew A Turk and Alex P Pentland. Face recognition using eigenfaces. In Computer Vision and Pattern Recognition, 1991. Proceedings CVPR’91., IEEE Computer Society Conference on, pages 586–591. IEEE, 1991.
-
[29]
Vladimir Vapnik.
The Nature of Statistical Learning Theory
. Springer-Verlag New York, 2nd edition, 2000. - [30] M West, C Blanchette, H Dressman, E Huang, S Ishida, R Spang, H Zuzan, J A Olson, J R Marks, and J R Nevins. Predicting the clinical status of human breast cancer by using gene expression profiles. Proceedings of the National Academy of Sciences of the United States of America, 98(20):11462–7, 2001.
- [31] Teng Zhang and Zhi-Hua Zhou. Large margin distribution machine. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’14, pages 313–322, 2014.
- [32] Teng Zhang and Zhi-Hua Zhou. Optimal Margin Distribution Machine. arXiv preprint arXiv:1604.03348, 2016.
- [33] W Zhao, R Chellappa, P.J. Phillips, and a Rosenfeld. Face recognition: A literature survey. Acm Computing Surveys, 35(4):399–458, 2003.