1 Background and Introduction
Dimensionality reduction techniques are a core part of the Statistics and Machine Learning toolbox, widely used in predictive modeling to improve a range of measures including generalization performance, interpretability or identifiability of models, and the time and space complexity of learning or prediction. There are many such techniques, see e.g.
[7]for a survey, but chief amongst them are simple linear techniques and the most widelyused of these in practice is probably Principal Components Analysis (PCA)
[15] and its variants. Applications of PCA include [27, 28, 33].PCA works as follows: Suppose we have a data matrix of , dimensional observations. PCA works by linearly projecting our original dimensional data onto uncorrelated (orthogonal) directions – the first ‘Principal Components’ – where typically . Denote by the matrix with the Principal Components as columns, then the Principal Components are chosen to maximize the orthogonal projection of the dataset onto the column space of , that is P is chosen to satisfy:
(1) 
As a result, PCA gives the best dimensional representation of the original
dimensional data in the leastsquares sense or, equivalently, if the data are centered, using PCA means that (for a fixed dataset) we discard the smallest amount of the total sample variation of any linear dimensionality reduction scheme. For data analysis tasks other than Ordinary Least Squares Regression (OLS) using PCA is a heuristic which works well frequently but, as noted in
[11, 20, 16, 12], it can work very poorly (even for a linear regression task) in some natural scenarios. For the task of classification, PCA often seems to work very well experimentally [13, 14], but it is trivial to construct examples for which PCA will work badly in a classification task^{1}^{1}1For example, when the most discriminative features have small betweenclass variance relative to the variance of the whole dataset, or relative to the withinclass variance.. In other words since the PCA objective is disconnected from the classification task at hand and, in particular it does not take account of the class structure inherent in the problem, vanilla PCA is prone to underfitting. In supervised settings where class labels are present, it would be a waste not to use the label information in the selection of useful features for classification, thus we propose a (weakly) supervised variant of PCA. There are a few supervised dimensionality reduction approaches, of which the most related to PCA is Fisher Linear Discriminant (FLD). FLD picks out the most useful feature for classification by maximizing the ratio of betweenclass variance to within class variance. It is supposed to outperform PCA for classification tasks in theory. However, as pointed out in
[19], PCA often outperforms FLD in practice, especially for small datasets. This is due the poor estimations of betweenclass variance and within class variance. Another important and widely used supervised approach is the wrapper methods based on partial least squares (PLS), e.g.
[4] and references therein. These take advantage of known structure in the classification problem by rotating the uncorrelated components to minimize the least square error on the predicted labels. While such wrapper approaches are frequently successful in practice, it is well known that they need careful tuning to avoid overfitting especially when sample size is small [4]. More recently, a discriminative feature extraction method inspired by PCA was proposed in
[17]. This method picks out the directions along which two classes differ most in their classconditional second moments. However, it also suffers from the inaccurate second moment estimation problem that plagues FLD in the case of small size high dimensional datasets. Finally, the
regularization and sparse model selection method Lasso [26] widely used in regression can also be used for dimensionality reduction for classification tasks, like regularized SVM. LARS [5] is another similar approach.In this paper we describe and empirically validate a PCA variant that attempts to address both issues: In particular, we choose a projection that maximally preserves (a proxy for) the sample margin distribution between two classes. Our approach is very simple to implement and only involves changing several lines of code to vanilla PCA but, as far as we are aware, it is neither previously published nor folklore. More importantly it has the same time and space complexity as PCA, targets a sensible objective (for classification) yet it is not a wrapper method, has a clear interpretation in terms of the problem at hand, and in our experience it works unreasonably well – our experimental outcomes are typically better than vanilla PCA, frequently significantly better, and never significantly worse, and also competitive with PLS and Lasso.
The remainder of this paper is structured as follows: In the next section we describe PCA more precisely and recall that the principal components are the solution to a straightforward eigenvector problem. Next we discuss the importance of the
margin distribution as a measure of the difficulty of a classification problem, which motivates our altering the PCA objective in order to maximally preserve the sample margin distribution. Then we introduce several schemes based on our idea and provide some theoretical intuitions for how they work. We present extensive experiments on several datasets which demonstrate the utility of our approach and compare its performance to vanilla PCA, PLS, and Lasso, we use a simple nonparametric sign test to demonstrate the statistically significant superiority of our approach over vanilla PCA. Finally we summarize our findings and discuss some possible future directions for this research.2 Preliminaries
Our main contribution is a novel PCA variant that aims to preserve the margin distribution, and in this section we therefore briefly review vanilla PCA, minimum margin and the margin distribution.
2.1 Principal Component Analysis
Given points in and a target dimension , PCA finds a linear projection and embedding such that
(2) 
is minimized. In words, PCA finds the linear projection that best preserves the data in the least squares sense when the reconstruction is also a linear map. Representing and by matrices, it is easy to show that the optimal solution satisfies:
(3) 
Therefore, PCA amounts to solving the convex optimization problem:
(4) 
where is the space of matrices. Solving (4) is equivalent to solving:
(5) 
where . Since is positive semidefinite the principal components can be found simply by diagonalizing as where is orthogonal, , and
is a diagonal matrix of the nonnegative eigenvalues of
in descending order of magnitude. The first principle components comprise the eigenvectors of corresponding to the largest eigenvalues: That is, the first rows of .The computational cost of PCA is dominated by constructing the matrix and diagonalizing it. The former step takes calculations and the latter costs , though if the time complexity can be cut to . Thus the overall time complexity for PCA is depending on whether the sample size is greater than the dimensionality or not.
Remark.
Above we introduced PCA as a linear (lossy) compression approach. The wellknown PCA as a dimensionality reduction approach is actually to apply the common practice to run it on the ‘meancentered’ data. Denote by and take , that is is proportional to the maximum likelihood consistent estimator of the covariance matrix. In this case the projection onto the column space of P maximizes the retained total sample variance. For some further perspectives on PCA, we refer the interested reader to [3].
2.2 Margin and Margin Distribution
Let
be a hypothesis class of linear classifiers (separating hyperplane normals). The
minimum margin or just margin between two separable classes is defined to be the supremum of the smallest Euclidean distances between the members of the classes and the separating hyperplane of classifiers . That is, if represent the distribution of the two separable classes with , then we define the margin to be:(6) 
where is any point on the separating hyperplane, is the label of , and are the supremum and infimum operators. For nonseparable classes the margin is negative^{2}^{2}2Our definition of margin here is different from conventional definitions., so one can either enforce a nonzero ‘soft’ margin by allowing a fraction of observations to be misclassified at training, or given a fixed classifying hyperplane consider instead the distribution of the signed distances between the members of classes and the fixed classifying hyperplane by defining the margin at the point with respect to classifier by:
(7) 
where again is any point on the separating hyperplane. We call the distribution of the margin distribution with respect to . The empirical margin and empirical margin distribution of a sample are defined similarly – for a fixed classifier and training set of size , we define the empirical margin:
(8) 
where represents a point on the separating hyperplane of , and the empirical margin at a point with respect to as:
(9) 
The empirical margin distribution with respect to is defined as the distribution of .
The importance of the margin between classes for classification has been studied extensively, where it can be viewed as a dimensionfree measure of the difficulty of a classification task. For separable classes, [29]
bounds the generalization performance of the support vector machine (SVM) in terms of the empirical margin of the training dataset, and indeed both hard and softmargin SVM learn a discriminative hyperplane which maximizes the singlepoint minimum margin
[32]. It has been pointed out that the information of the margin distribution is largely lost in the minimum margin which depends so critically on a small number of the training data points [23], but there is no such problem for the margin distribution. Thus tighter bounds for the generalization error of classifiers utilizing the margin distribution as the measure of difficulty of a classification problem have been obtained in, for example: [8, 24, 23]. Alternative classification algorithms based on the idea of margin distribution optimization have also been proposed [9, 21, 1, 31, 32] and found empirically to outperform SVM for many real datasets. Therefore, the margin distribution provides discriminative information that is crucial for classification. However, the margin and margin distribution can only be evaluated once the classifier has been learned. The dimensionality reduction schemes we introduce in this paper are based on the idea of preserving the margin distribution, but crucially without having access to the classifier.3 Algorithms
3.1 Motivation
From the definitions above it is clear that optimizing the margin distribution requires a classifying hyperplane from which to measure it, and applying PCA to preserve exactly the sample margin distribution would therefore require a wrapper approach which could be prone to overfitting in small sample conditions. We therefore propose to run PCA on a proxy for the optimal margin distribution to obtain uncorrelated features that approximately preserve the margin distribution, and hence the important discriminative information for classification, without the same risk of overfitting.
Our heuristic argument runs as follows: PCA is the best linear (lossy) compression method if the reconstruction process is also linear. Therefore, the strength of PCA is in preserving the data on which it is applied to. For the purpose of dimensionality reduction for linear classification tasks, we are not interested in preserving the data points, but we are interested in selecting the features best for discriminating the data points. If we know in advance what information is useful for classification, we can extract some structures containing that information from the dataset and run PCA on those structures to obtain the features that are best for preserving them. If we then reduce the dimension of the original data by projecting to these features, we expect that the discriminative information is preserved in a good way by the projection. Based on this intuition, we devised four PCA variants for the task of linear classification, which we will evaluate in Section 4, 5. For now, we simply present the four heuristic alternatives in this section.
To illustrate the basic ideas, we focus on the twoclass classification problem since generalization to multiclass cases is straightforward. Let be a set of labeled training data points, where for convenience we assume is a point in , and is the class label, . The crucial question is how to approximate the margin distribution of the dataset without access to a classifying hyperplane. Now the margin distribution contains the information about the differences between the data points of the two classes that is meaningful for the classification problem. Therefore, our first PCA variant represents the margin distribution as the differences between the data points of the two classes. Let and be the sets of indices of the data points that belong to class 1 and class +1 respectively, i.e. .
3.2 Algorithm: MPCA0
Define the following structures
(10) 
Then the structures should reasonably represent the margin distribution well. We can then run “uncentered” PCA on these structures to obtain the most significant principal components. These principal components are the features that best represent this proxy for margin distribution and therefore contain the most discriminative information. We call this scheme MPCA0. The algorithm of MPCA0 is shown in Figure 1.
MPCA0  

input:  
target dimension  
let  
let  
let  
let be the eigenvectors of  
with the largest eigenvalues  
let  
output:  , 
It is not hard to see that the size of is , where , , which is very large for large sample data. As a result, the MPCA0 algorithm is more computationally expensive than usual PCA, it will be .
3.3 Algorithm: MPCA1a and MPCA1b
With a little more consideration, it is not hard to come up with a natural alternative to that resolves this time complexity issue. The Maximum Likelihood (ML) consistent estimates of the class means are
(11) 
Define the variables
(12) 
In words, is defined as the difference between and the mean of the other class. Intuitively, captures the same information as and it can be viewed as a conditioning of the previous problem. The sample size is now which is typically much smaller than the size of . We call the resulting algorithm MPCA1a.
In the definition (12), the sample means of the two classes are used. Especially in the situation of small sample data or very imbalanced classes, the quality of the estimates of the class conditional means is a concern. In this case, replacing the sample means with the sample medoids may be a more robust option^{3}^{3}3The classconditional medoid is the vector consisting of the classconditional sample median of each individual feature.. We call the corresponding algorithm MPCA1b. Both versions of the algorithm are shown in Figure 2. Time complexity for this approach is the same as the standard PCA.
MPCA1a(MPCA1b)  

input:  
target dimension  
let  
let be the mean(medoid) of  
let be the mean(medoid) of  
let  
let  
let be the eigenvectors of  
with the largest eigenvalues  
let  
output:  , 
3.4 Algorithm: MPCA2
Our final algorithm, MPCA2, attempts to capture the margin distribution more closely by simulating equation (9) more closely. In this algorithm, we do not use all the pairs of data points in the two classes. Instead, we construct the difference vector between a datapoint and its nearest neighbor in the other class as a proxy for the margin distribution.
MPCA2  

input:  
target dimension  
let  
let , such that  
or  
let  
let be the eigenvectors of  
with the largest eigenvalues  
let  
output:  , 
Our experimental results show that MPCA2 works extremely well, especially for small size high dimensional datasets. It is not hard to see that constructing the takes operations and the size of is no more than . Therefore, the time complexity of MPCA2 is the same as the vanilla PCA.
4 Theory
Analyzing our schemes in full generality is difficult. To give some insight into how these work, in this section, we analyze a toy setting to provide some intuition, which suggests why our schemes can work better than PCA.
We consider a simple case where the classes have a shared classconditional covariance matrix and for convenience we assume the classconditional distributions are multivariate Gaussian, and the difference between the class means is aligned with one eigenvector of the classconditional covariance matrix. Without loss of generality, we assume the classconditional covariance matrix is diagonal. Let and
be two random variables such that
(13)  
(14)  
(15) 
where
(16)  
(17)  
(18) 
and where are not in any particular order. As a result, we have
(19) 
It is clear in this setting that the only feature that is useful for classification is the first coordinate. A successful dimensionality reduction approach for this classification problem would have to include the first coordinate in the selected features. In the case , the optimal separating hyperplane is and the corresponding margin distribution is . The whole margin distribution would be preserved if an approach includes the first coordinate in the reduced features.
Now suppose we sample data points , then PCA works by eigendecomposing
(20) 
where is the sample mean. In expectation the above expression is just the covariance of the variable , which is equal to
(21) 
Suppose the target dimension is , then for PCA to be successful, has to be larger than . In most classification problems the betweenclass variance is usually larger than withinclass variances, that is is usually large enough that is larger than . This is why PCA can work well for classification tasks even though it is not designed for that purpose. In other situations, PCA may not work well. In particular here it will either be as good as possible or as bad as possible.
To see how MPCA0 works, we define the random variable
(22) 
MPCA0 works by eigendecomposing . In expectation, this matrix is equal to
(23) 
Since , MPCA0 has a better chance of selecting the first coordinate in the case . Our experiments indeed shows that MPCA0 usually performs better than PCA.
To see how MPCA1a and MPCA1b work, we use similar arguments. First, define the following random variable
(24) 
From the definition, we have
(25) 
MPCA1a and MPCA1b work by eigendecomposing , the expectation of which is equal to
(26) 
This indicates that MPCA1a and MPCA1b have a much better chance than PCA of selecting the first coordinate as the useful feature in the case the target dimension is 1, since a larger quantity is added to rather than .
Remark.
From the insight obtained above, PCA works poorly in the case of class imbalance due to the fact that the more class imbalance the smaller becomes. Our schemes do not suffer from the same problem.
Remark.
Our schemes are much less vulnerable to the low betweenclass variance to withinclass variance ratio problem that makes PCA fail. However, our schemes can still fail in the extreme cases. To make it even less vulnerable, we devise the scheme MPCA2. MPCA2 is based on our intuitive arguments.
The situation quickly becomes intractable to analysis in the more general case that the difference between the class means is not aligned with any eigenvector of the classconditional covariance matrix. In this case with other quantities assumed the same as above. Now the covariance matrices to be eigendecomposed are , , respectively. These are not diagonal and there is no simple analytical form for the eigenvectors. The most discriminative feature in theory is now FLD along the direction
(27) 
which is not an eigenvector of the above matrices. We shall see from our experiments in Section 5 that for small target dimension , our approach nevertheless outperforms PCA in general, and despite being a (supervised) filter method, is highly competitive with wrapper approaches such as PLS and Lasso.
5 Experiments
In this section, we present empirical results on the performance of our schemes: MPCA0, MPCA1a, MPCA1b, and MPCA2. For comparison, we compare them to vanilla PCA, PLS, and Lasso in terms of test errors. To obtain a comprehensive picture of the performance of the new schemes, we run experiments evaluating several widely used classifiers on a range of publicly available datasets with different characteristics. The classifiers used were Fisher Linear Discriminant (FLD), SVM, Logistic Regression (LR), and Naive Bayes (NB). We use two groups of publicly available real datasets. The first group contains datasets with the numbers of data points larger than the dimensions; these are
ionosphere [18], sonar [18], mushrooms [18], and splice [18]. The other group contains several small sample datasets with the numbers of data points smaller than the dimensions. These datasets are colon [2], prostate [25], ovarian [22], leukemia [10], leukemia large [10], and duke [30]. The information on these datasets is shown in Table 1.name  source  #instances  #features 

ionosphere  [18]  126+225  34 
sonar  [18]  111+97  60 
mushrooms  [18]  3916+4208  112 
splice  [18]  1527+1648  60 
colon  [2]  22+40  2000 
prostate  [25]  50+52  6033 
ovarian  [22]  24+30  1536 
leukemia  [10]  47+25  3571 
leukemia large  [10]  47+25  7129 
duke  [30]  21+23  7129 
5.1 Experiment Setup
In the experiments, the test errors of combinations of a dimensionality reduction scheme, a target dimension , a classifier and a dataset are obtained. Each combination of a dimensionality reduction scheme, a target dimension
, a classifier and a dataset is fed 50 independent partitions of the dataset into a training set and a test set, where four fifths of the data was used for training and the remainder for testing, and the sampling was stratified to preserve class membership proportions. Hence 50 test errors are produced for each combination. These are then used to compute the mean and the standard deviation of the test errors for that combination. For each loop iteration for a particular dataset, the data splits were held constant. We did not use cross validation since sign test assumes independent observations.
We choose two representative target dimensions for each dataset. For small sample datasets, the target dimensions are and , where is the rank of the training data matrix, which is roughly four fifths of the number of instances. While for the other datasets , the target dimensions are chosen to be and , where is the number of features (original data dimension). We do not include here larger target dimensions. This is because, on one hand, small target dimensions are the settings of practical interest. On the other hand, with increased target dimension , the differences between different dimensionality reduction schemes become less obvious. This is due to that fact that as the target dimension approaches the rank of the training data matrix, the number of features retained by different schemes are large enough to incorporate almost all the discriminative information. Our own experiments, not presented here, also indicate this fact.
For the datasets with more features than instances, the test errors of MPCA0, MPCA1a, MPCA1b, and MPCA2 are compared with that of PCA, PLS, and Lasso. While for the datasets with fewer features than instances, due to the high computational cost of the MPCA0 scheme, only MPCA1a, MPCA1b, and MPCA2 are run and compared to PCA, PLS, and Lasso. We use the SVM and Logistic Regression implemented by liblinear [6]. The version of SVM classifier used for experiment is the regularized loss SVM. While the logistic regression classifier used here is regularized. The classifier parameters used were selected by cross validation on the whole original dataset to provide a consistent baseline, even though this may have favored the full wrapper methods since they have access to the tuned classifier. The Naive Bayes model uses a Gaussian classconditional likelihood model and we used the builtin MATLAB version.
Finally, the way we run PLS and Lasso to reduce dimensionality for the classification tasks is by casting the classification tasks as regression tasks with discrete targets and obtain the important features. The implementations of PLS and Lasso used are the builtin MATLAB functions plsregress and lasso.
5.2 Results
K  Classifier  MPCA1a  MPCA1b  MPCA2 

Ionosphere  
5  SVM  0.3318  0.6864  
LR  0.8569  0.9878  
11  SVM  
LR  
Sonar  
10  SVM  0.5  0.9981  0.6911 
LR  0.2272  0.9061  0.9061  
20  SVM  0.998  0.1215  0.111 
LR  0.9988  0.6358  0.2664  
Mushrooms  
18  SVM  0.7336  
LR  0.1856  
37  SVM  
LR  
Splice  
10  SVM  1  
LR  1  
20  SVM  1  
LR  1 
K  Classifier  MPCA0  MPCA1a  MPCA1b  MPCA2 
Colon  
12  SVM  0.25  0.75  0.25  0.3125 
LR  0.9648  0.7734  0.1445  0.3036  
24  SVM  0.5  0.125  0.5  
LR  0.1051  
Prostate  
20  SVM  0.5  0.212  0.0898  
LR  0.0898  
40  SVM  0.5  0.2266  0.3633  0.2272 
LR  0.5  0.5  0.5  0.1509  
Ovarian  
10  SVM  0.6762  0.6762  0.968  0.1635 
LR  0.2905  0.1316  
21  SVM  0.8867  0.8867  0.9961  0.2272 
LR  0.8555  0.8555  0.9648  
Leukemia  
14  SVM  0.5  0.9375  0.875  0.2539 
LR  0.3125  0.5  0.1094  
28  SVM  1  1  0.5  0.9375 
LR  0.5  0.5  0.875  0.6875  
Leuklarge  
14  SVM  0.1133  0.0547  0.1719  0.5982 
LR  0.8867  0.1334  0.1938  
28  SVM  0.1875  0.5  0.5  0.3633 
LR  0.0625  0.5  
Duke  
8  SVM  0.073  0.1509  0.1334  0.3318 
LR  0.068  
17  SVM  0.125  0.125  0.3125  0.3872 
LR  0.1875  0.3438  0.2266  0.337 
The detailed results are in the appendix and shown in Table 4  13. As can be seen from the tables, our schemes are superior to PCA in general and highly competitive with PLS and Lasso. This indicates that our schemes are indeed able to select the most discriminative features. Due to the large variance of the test errors^{4}^{4}4This is more obvious for small sample datasets due to smaller number of observations and the fact that the test error is very sensitive to different partitions of the original datasets., it appears initially that our schemes are not statistically significantly better than PCA, however these averages are across the different partitions of the datasets and performance on the different individual data splits is highly variable. To see if our approach indeed significantly outperforms PCA, we compare test errors on independent data splits and use a nonparametric sign test to evaluate the null: See Figure 4 – Each test error corresponds to an independent data split. We see that for this dataset MPCA1b is at least as good as PCA on every individual split, but the high betweensample variance in the test error will mask this fact if we only consider the average test error across data splits. We tabulate the sign test pvalues of our algorithms versus PCA in Tables 2 and 3. Here we only show results on SVM and Logistic Regression. The boldfaced numbers indicate the cases where and we have made no correction to these values for multiple comparisons since the smallest pvalues are generally below in any case. We see that our approach does significantly outperform PCA, especially when the number of retained dimensions is small.
6 Discussion
We presented four simple filter PCA variants, each of which seems to improve the performance on projected data in a classification task over standard PCA. Our ideas in this paper are primarily based on the observation that the margin distribution contains most of the discriminative information, for a classification task and the PCA objective is not obviously aligned with preserving this quantity. Therefore we propose four heuristic structures to represent the margin distribution on which we then perform PCA. Extensive empirical evaluations suggest these do indeed represent the margin distribution well. Whether there are some better structures for representing the margin distribution that can be evaluated outside of a wrapper approach is an interesting question for future research. Any such better structures should give us a better dimensionality reduction scheme for linear classification. Whether theoretical guarantees on classification performance using our approach are possible looks like a difficult open problem  the main hurdle is how to provide typical case guarantees for a deterministic algorithm without imposing restrictive conditions on the data generator. A further direction for future research is to consider nonlinear dimensionality reduction schemes, such as kernel PCA in a similar light.
Appendix A Detailed Experimental Results
In the tables below, the test errors (meanstd) are shown in percent and the boldfaced numbers represent the best performing schemes.
K  Scheme  SVM  LR  FLD  NB 

5  PCA  26.73.0  26.42.8  17.03.7  
PLS  23.53.0  23.43.2  20.75.1  
Lasso  27.72.6  27.02.8  20.03.9  13.23.7  
MPCA1a  26.63.0  26.52.8  16.43.8  17.24.3  
MPCA1b  16.73.6  17.54.3  
MPCA2  26.92.4  27.12.6  21.34.0  18.84.7  
11  PCA  25.32.6  24.42.7  18.43.6  10.43.7 
PLS  20.13.7  19.53.3  14.84.3  
Lasso  20.23.4  
MPCA1a  24.72.9  24.12.7  18.53.7  11.23.9  
MPCA1b  21.12.9  20.43.2  18.03.8  11.13.9  
MPCA2  23.53.2  23.22.9  20.73.6  15.25.0 
K  Scheme  SVM  LR  FLD  NB 

10  PCA  26.85.7  25.96.2  24.46.7  
PLS  30.27.0  29.06.5  25.66.8  37.58.1  
Lasso  30.95.9  27.86.3  24.66.2  32.86.5  
MPCA1a  24.36.5  
MPCA1b  27.85.5  26.56.1  24.46.6  24.96.3  
MPCA2  27.65.7  26.16.0  23.86.6  24.86.9  
20  PCA  27.55.7  25.86.7  
PLS  27.66.2  27.95.4  28.26.1  38.58.4  
Lasso  30.15.5  29.37.0  27.15.4  32.17.2  
MPCA1a  28.35.7  26.76.2  23.87.0  26.06.0  
MPCA1b  27.15.9  26.06.2  23.76.7  28.28.0  
MPCA2  24.06.1  27.28.2 
K  Scheme  SVM  LR  FLD  NB 

18  PCA  3.20.3  3.10.4  3.80.5  11.40.5 
PLS  0.50.3  0.70.4  11.90.7  
Lasso  1.61.1  
MPCA1a  2.80.4  2.70.4  3.50.5  11.70.6  
MPCA1b  3.20.3  3.00.4  3.10.4  11.40.6  
MPCA2  1.50.4  1.70.3  2.70.4  10.70.9  
37  PCA  2.00.3  2.00.3  3.00.5  10.00.7 
PLS  0.20.1  0.70.6  12.10.8  
Lasso  1.30.8  
MPCA1a  1.80.3  1.90.3  2.80.5  10.80.6  
MPCA1b  1.30.3  2.00.3  3.10.5  8.50.6  
MPCA2  0.30.1  0.40.2  1.70.5  11.10.9 
K  Scheme  SVM  LR  FLD  NB 

10  PCA  18.21.3  18.21.1  17.91.3  16.11.3 
PLS  22.41.5  
Lasso  18.41.3  18.51.9  18.01.6  15.11.6  
MPCA1a  16.51.3  16.41.2  16.01.2  18.11.5  
MPCA1b  22.01.4  22.31.6  21.41.8  22.21.7  
MPCA2  17.01.3  16.81.3  16.21.2  
20  PCA  17.91.3  17.71.0  17.11.5  16.01.5 
PLS  15.41.2  22.31.6  
Lasso  16.31.4  16.21.3  
MPCA1a  16.81.3  16.41.2  15.91.3  18.01.6  
MPCA1b  21.01.3  21.11.3  20.01.8  21.01.7  
MPCA2  17.01.3  16.81.3  16.21.3  14.31.2 
K  Scheme  SVM  LR  FLD  NB 

12  PCA  11.79.1  13.58.9  17.58.6  
PLS  14.09.3  13.28.9  14.77.3  17.88.6  
Lasso  21.211.9  13.28.9  22.28.8  
MPCA0  11.39.2  14.09.0  12.28.1  21.89.8  
MPCA1a  11.79.1  13.79.3  12.27.9  15.88.5  
MPCA1b  12.28.5  16.38.6  
MPCA2  11.39.2  12.89.1  12.78.6  14.38.4  
24  PCA  12.89.4  14.89.1  13.78.2  20.29.5 
PLS  15.39.7  14.59.8  14.07.6  21.58.6  
Lasso  23.511.0  15.28.7  22.28.8  
MPCA0  12.59.1  20.08.7  
MPCA1a  12.39.4  13.09.8  13.38.1  15.78.4  
MPCA1b  12.79.6  13.09.4  14.07.6  15.58.4  
MPCA2  13.88.8  13.87.3  14.08.5 
K  Scheme  SVM  LR  FLD  NB 

20  PCA  10.15.2  10.46.9  9.04.8  33.912.8 
PLS  9.25.1  8.26.1  8.54.8  43.97.8  
Lasso  9.36.1  8.35.9  10.35.6  
MPCA0  9.95.3  9.86.8  8.85.0  39.79.9  
MPCA1a  9.44.7  9.26.6  8.75.0  42.78.8  
MPCA1b  9.14.8  8.95.9  8.05.1  35.09.0  
MPCA2  30.311.0  
40  PCA  8.34.7  8.06.1  7.54.7  42.18.9 
PLS  9.34.8  8.46.1  8.54.8  46.55.4  
Lasso  10.05.4  
MPCA0  8.24.8  7.96.0  7.24.6  45.06.7  
MPCA1a  8.04.7  7.96.0  47.35.4  
MPCA1b  8.14.9  7.95.5  7.44.9  42.77.8  
MPCA2  7.95.2  7.46.4  7.25.2  33.411.3 
K  Scheme  SVM  LR  FLD  NB 

10  PCA  18.510.9  23.110.6  26.011.6  45.810.4 
PLS  19.110.7  17.38.9  48.411.5  
Lasso  22.513.3  26.710.5  27.811.4  
MPCA0  18.510.4  22.410.6  23.611.2  45.313.6  
MPCA1a  18.410.1  20.710.9  22.212.2  47.611.8  
MPCA1b  20.210.6  21.811.6  22.011.2  47.311.8  
MPCA2  20.410.2  33.612.5  
21  PCA  18.710.5  19.19.6  20.010.1  42.511.4 
PLS  50.48.3  
Lasso  18.511.9  24.511.7  24.913.3  
MPCA0  19.310.3  19.59.2  19.110.9  44.211.8  
MPCA1a  19.311.1  19.59.7  18.911.1  45.111.6  
MPCA1b  19.811.0  19.89.3  21.311.4  45.811.3  
MPCA2  18.010.1  16.09.5  19.612.0  29.312.4 
K  Scheme  SVM  LR  FLD  NB 

14  PCA  3.13.6  3.43.9  2.03.2  2.43.7 
PLS  4.04.1  3.14.1  2.93.5  3.44.8  
Lasso  5.94.9  4.15.0  5.06.2  4.45.2  
MPCA0  3.03.6  3.14.1  
MPCA1a  3.43.6  3.34.1  2.03.2  2.43.4  
MPCA1b  3.33.6  2.94.1  2.03.2  2.63.5  
MPCA2  2.13.3  2.74.1  
28  PCA  2.43.4  2.44.0  2.03.2  2.94.6 
PLS  4.74.2  3.34.4  2.93.5  3.64.8  
Lasso  6.05.3  4.64.9  6.06.7  4.34.8  
MPCA0  2.93.5  1.93.2  4.75.7  
MPCA1a  2.43.4  2.33.9  2.03.2  
MPCA1b  2.64.0  2.33.4  
MPCA2  2.73.5  2.44.0  2.13.3  3.03.8 
K  Scheme  SVM  LR  FLD  NB 

14  PCA  5.75.6  6.14.8  3.94.4  11.07.2 
PLS  4.74.5  5.05.4  3.95.6  32.613.6  
Lasso  8.97.2  
MPCA0  4.74.9  6.35.5  28.47.7  
MPCA1a  4.64.7  5.44.9  3.44.4  21.410.6  
MPCA1b  4.95.1  5.35.2  4.05.2  18.69.9  
MPCA2  5.35.0  4.45.2  3.65.6  26.48.2  
28  PCA  4.74.7  5.34.7  2.94.6  16.39.7 
PLS  4.44.5  4.44.8  3.95.6  30.712.2  
Lasso  8.16.9  
MPCA0  4.14.4  4.64.9  33.64.4  
MPCA1a  4.64.5  4.64.7  3.14.6  25.78.9  
MPCA1b  4.64.5  4.34.6  3.14.6  24.78.9  
MPCA2  4.44.8  5.15.4  3.75.4  16.111.7 
K  Scheme  SVM  LR  FLD  NB 

8  PCA  14.911.6  16.412.1  16.211.7  22.015.5 
PLS  42.913.1  
Lasso  18.711.1  19.311.6  18.012.9  20.412.4  
MPCA0  12.99.6  13.811.6  13.810.9  
MPCA1a  13.19.2  13.311.0  12.010.2  21.816.0  
MPCA1b  13.18.9  13.111.6  12.410.7  20.215.8  
MPCA2  14.29.5  12.912.4  14.211.5  31.615.3  
17  PCA  13.88.9  12.910.8  13.111.6  27.817.4 
PLS  49.812.9  
Lasso  16.211.0  16.410.8  18.010.3  
MPCA0  13.18.9  12.210.1  12.411.1  21.813.6  
MPCA1a  13.18.9  12.49.9  12.011.0  33.316.9  
MPCA1b  13.39.0  12.210.1  12.911.1  27.318.1  
MPCA2  12.79.8  12.210.8  12.79.5  38.916.4 
Acknowledgement
XL is supported by an internal study award at the University of Waikato.
References
 [1] Fabio Aiolli, Giovanni Da San Martino, and Alessandro Sperduti. A Kernel Method for the Optimization of the Margin Distribution, pages 305–314. Springer Berlin Heidelberg, Berlin, Heidelberg, 2008.

[2]
U Alon, N Barkai, D a Notterman, K Gish, S Ybarra, D Mack, and a J Levine.
Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.
Proceedings of the National Academy of Sciences of the United States of America, 96(12):6745–6750, 1999. 
[3]
Tijl De Bie, Nello Cristianini, and Roman Rosipal.
Eigenproblems in pattern recognition.
In Handbook of Geometric Computing, pages 129–167. Springer Berlin Heidelberg, 2005.  [4] Richard G Brereton and Gavin R Lloyd. Partial least squares discriminant analysis: taking the magic away. Journal of Chemometrics, 28(4):213–225, 2014.
 [5] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression. Annals of Statistics, 32(2):407–451, 2004.
 [6] RongEn Fan, KaiWei Chang, ChoJui Hsieh, XiangRui Wang, and ChihJen Lin. LIBLINEAR: A Library for Large Linear Classification. The Journal of Machine Learning, 9(2008):1871–1874, 2008.
 [7] Imola K Fodor. A survey of dimension reduction techniques, 2002.

[8]
Yoav Freund and Robert E. Schapire.
Large Margin Classification Using the Perceptron Algorithm.
Machine Learning  The Eleventh Annual Conference on computational Learning Theory
, 37(3):277 – 296, 1999.  [9] Ashutosh Garg and Dan Roth. Margin Distribution and Learning Algorithms. In Proceedings of the Twentieth International Conference on Machine Learning, pages 210–217, 2003.
 [10] T. R. Golub. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, 286(5439):531–537, 1999.
 [11] Stefan Haufe, Frank Meinecke, Kai Görgen, Sven Dähne, JohnDylan Haynes, Benjamin Blankertz, and Felix Bießmann. On the interpretation of weight vectors of linear models in multivariate neuroimaging. Neuroimage, 87:96–110, 2014.
 [12] R Carter Hill, Thomas B Fomby, and S R Johnson. Component selection norms for principal components regression. Communications in StatisticsTheory and Methods, 6(4):309–334, 1977.
 [13] Tom Howley, Michael G Madden, MarieLouise O’Connell, and Alan G Ryder. The effect of principal component analysis on machine learning accuracy with highdimensional spectral data. KnowledgeBased Systems, 19(5):363–370, 2006.

[14]
Andreas Janecek, Wilfried Gansterer, M A Demel, and G F Ecker.
On the relationship between feature selection and classification accuracy.
In Journal of Machine Learning Research Workshop and Conference Proceedings 4, volume 91, pages 90–105, 2008.  [15] I . T. Jolliffe. Principal Component Analysis. SpringerVerlag New York, 2nd edition, 2002.
 [16] Ian T. Jolliffe. A Note on the Use of Principal Components in Regression. Journal of the Royal Statistical Society. Series C, 31(3):300–303, 1982.
 [17] Nikos Karampatziakis and Paul Mineiro. Discriminative Features via Generalized Eigenvectors. Proceedings of The 31st International Conference on Machine Learning, pages 494–502, 2014.
 [18] M Lichman. {UCI} Machine Learning Repository, 2013.
 [19] Aleix M. Martinez and Avinash C. Kak. PCA versus LDA. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2):228–233, 2001.
 [20] Mykola Pechenizkiy, Alexey Tsymbal, and Seppo Puuronen. PCAbased feature transformation for classification: issues in medical diagnostics. In Proceedings of the 17th IEEE Symposium on ComputerBased Medical Systems, pages 535–540, 2004.
 [21] Kristiaan Pelckmans, Johan Suykens, and Bart D Moor. A Risk Minimization Principle for a Class of Parzen Estimators. In J C Platt, D Koller, Y Singer, and S T Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1137–1144. Curran Associates, Inc., 2008.
 [22] Michèl Schummer, WaiLap V. Ng, Roger E. Bumgarner, Peter S. Nelson, Bernhard Schummer, David W. Bednarski, Laurie Hassell, Rae Lynn Baldwin, Beth Y. Karlan, and Leroy Hood. Comparative hybridization of an array of 21 500 ovarian cDNAs for the discovery of genes overexpressed in ovarian carcinomas. Gene, 238(2):375–385, 1999.
 [23] J ShaweTaylor and N Cristianini. Further Results on the Margin Distribution. In Proc. 12th Annu. Conf. on Comput. Learning Theory, pages 278–285. ACM Press, 1999.
 [24] J ShaweTaylor and N Cristianini. Margin distribution bounds on generalization. In Paul Fischer and Hans Ulrich Simon, editors, Computational Learning Theory: 4th European Conference, EuroCOLT’99 Nordkirchen, Germany, March 29–31, 1999 Proceedings, pages 263—273. Springer Berlin Heidelberg, Berlin, Heidelberg, 1999.
 [25] Dinesh Singh, Phillip G. Febbo, Kenneth Ross, Donald G. Jackson, Judith Manola, Christine Ladd, Pablo Tamayo, Andrew A. Renshaw, Anthony V. D’Amico, Jerome P. Richie, Eric S. Lander, Massimo Loda, Philip W. Kantoff, Todd R. Golub, and William R. Sellers. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1(2):203–209, 2002.
 [26] Robert Tibshirani. Regression Selection and Shrinkage via the Lasso. Journal of the Royal Statistical Society B, 58(1):267–288, 1996.
 [27] Matthew Turk and Alex Pentland. Eigenfaces for recognition. Journal of cognitive neuroscience, 3(1):71–86, 1991.
 [28] Matthew A Turk and Alex P Pentland. Face recognition using eigenfaces. In Computer Vision and Pattern Recognition, 1991. Proceedings CVPR’91., IEEE Computer Society Conference on, pages 586–591. IEEE, 1991.

[29]
Vladimir Vapnik.
The Nature of Statistical Learning Theory
. SpringerVerlag New York, 2nd edition, 2000.  [30] M West, C Blanchette, H Dressman, E Huang, S Ishida, R Spang, H Zuzan, J A Olson, J R Marks, and J R Nevins. Predicting the clinical status of human breast cancer by using gene expression profiles. Proceedings of the National Academy of Sciences of the United States of America, 98(20):11462–7, 2001.
 [31] Teng Zhang and ZhiHua Zhou. Large margin distribution machine. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining  KDD ’14, pages 313–322, 2014.
 [32] Teng Zhang and ZhiHua Zhou. Optimal Margin Distribution Machine. arXiv preprint arXiv:1604.03348, 2016.
 [33] W Zhao, R Chellappa, P.J. Phillips, and a Rosenfeld. Face recognition: A literature survey. Acm Computing Surveys, 35(4):399–458, 2003.