1 Introduction
Dimensionality reduction (DR) is a common preprocessing step for classification and other tasks. Learning a classifier on lowdimensional inputs is fast (though learning the DR itself may be costly). More importantly, DR can help learn a better classifier, particularly when the data does have a lowdimensional structure, and with small datasets, where DR has a regularizing effect that can help avoid overfitting. The reason is that DR can remove two types of “noise” from the input: (1) independent random noise, which is uncorrelated with the input and the label, and mostly perturbs points away from the data manifold. Simply running PCA, or other unsupervised DR algorithm, with an adequate number of components, can achieve this to some extent. (2) Unwanted degrees of freedom, which are possibly nonlinear, along which the input changes but the label does not. This more radical form of denoising requires the DR to be informed by the labels, of course, and is commonly called supervised DR.
Call the DR mapping, which takes an input and projects it to dimensions, and
the classifier, which applies to the lowdimensional vector
and produces a label , so that the overall classifier is . The great majority of supervised DR algorithms are “filter” approaches (Kohavi and John, 1998; Guyon and Elisseeff, 2003), where one first learns from the training set of pairs , fixes it, and then train on the pairs , using a standard classification algorithm as if the inputs were . An example of supervised DR is linear discriminant analysis (LDA), which learns the best linear DR in the sense of minimal intraclass scatter and maximal acrossclass scatter. The key in filter approaches is the design of a proxy objective function over that leads to learning a good overall classifier . Although the particulars differ among existing supervised DR methods (e.g. Belhumeur et al., 1997; Globerson and Roweis, 2006), usually they encourage to separate inputs or manifolds having different labels from each other. While this makes intuitive sense, and filter methods (and even PCA) can often do a reasonable job, it is clear that the best and are not obtained by optimizing a proxy function over and then having minimize the classification error (our real objective), but by jointly minimizing the latter over and (the “wrapper” approach). However, filter approaches (particularly using a linear DR) are far more popular than wrapper ones. With filters, the classifier is learned as usual once has been learned. With wrappers, learning and involve a considerably more difficult, nonconvex optimization, particularly with nonlinear DR, having many more parameters that are coupled with each other. At this point, an important question arises: what is the real role of (nonlinear) DR in classification, and how does it depend on the choice of mapping and of latent dimensionality ? Guided by this overall goal, the contributions of this paper are as follows. (1) We propose a simple, efficient, scalable and generic way of jointly optimizing the classifier’s loss over , and in this paper we apply it to the case where is a RBF network and a linear SVM. (2) Armed with this algorithm, we study the role of nonlinear DR in the classifier’s performance and the latent space representation, and find lessons that apply to filter design. (3) We obtain a nonlinear lowdimensional SVM classifier that achieves stateoftheart performance while being fast at test time.A shorter version of this work appears in a conference paper (Wang and CarreiraPerpiñán, 2014).
2 Joint optimization of mapping and classifier using auxiliary coordinates
We describe first the approach for binary classification and focus on the case where is a linear SVM. We give the multiclass case in section 2.4. Given a training set of input patterns and corresponding labels , , we want to learn a nonlinear lowdimensional classifier that optimizes the following objective:
(1)  
This is the usual linear SVM objective of finding a separating hyperplane with maximum margin, but with inputs given by
, which has a regularization term , where are the weights and bias of the linear SVM , and with a slack variable per point , where is a penalty parameter for the slackness. The difficulty is that the constraints are heavily nonconvex because of the nonlinear mapping . However, the problem can be significantly simplified if we use the recently introduced method of auxiliary coordinates (MAC) (CarreiraPerpiñán and Wang, 2012, 2014) for nested systems. The idea is to introduce auxiliary variables that break nested functional dependences into simpler shallow mappings and . In our case, we introduce one auxiliary vector per input pattern and define the following constrained problem, which can be proven (CarreiraPerpiñán and Wang, 2012) to be equivalent to (1):(2)  
This seems like a notational trick, but now we solve this with a quadraticpenalty method (Nocedal and Wright, 2006). We optimize the following problem for fixed penalty parameter and drive :
(3) 
This defines a continuous path which, under mild assumptions, converges to a minimum of the constrained problem (2), and thus to a minimum of the original problem (1) (CarreiraPerpiñán and Wang, 2012). Although problem (3) has more parameters, all the terms are simple and partially separable. The auxiliary vector can be interpreted as a target in latent space for the mapping , but these targets are themselves coordinated with the classifier. Using alternating optimization of (3) over results in very simple, convex steps. The step is a usual RBF regression and linear SVM classification done independently from each other reusing existing, welldeveloped algorithms. The step has a closedform solution for each separately. We describe the steps next. The complete alternating optimization procedure is given in Algorithm 1.
2.1 The step
For fixed , optimizing over is just training an ordinary linear SVM with lowdimensional inputs . Much work exists on fast, scalable SVM training. We use LIBSVM (Chang and Lin, 2011). With classes, each SVM can be trained independently of the others.
2.2 The step
For fixed , optimizing over is just a regularized regression with inputs and lowdimensional outputs . So far the approach is generic over , which could be a deep net or a Gaussian process, for example, and we would use its corresponding training method within this step. However, we now focus on a special case which results in a particularly efficient step over
, and which includes linear DR as a particular case. We use radial basis function (RBF) networks, which are universal function approximators,
, with Gaussian RBFs , and is a quadratic regularizer on the weights. As commonly done in practice (Bishop, 2006), we determine the centers by means on , and then the weights have a unique solution given by a linear system. The total cost is in memory and in training time, mainly driven by setting up the linear system for (involving the Gram matrix ); solving it exactly is a negligible since in practice, which can be reduced to by using warmstarts or caching its Cholesky factor. Note we only need to run means and factorize the linear system once and for all, since its input does not change. Thus, the step simply involves a linear system for .2.3 The step
For fixed , eq. (3) decouples on each , so instead of one large problem on parameters, we have independent small problems each on parameters, of the form (omitting subindex ):
(4)  
where we have also included the slack variable, since we find this speeds up the alternating optimization. This is a convex quadratic program, whose closedform solution is , where is a scalar which takes one of three possible values, and costs .
This can be seen using the KKT theorem. The Lagrangian of (4) is
(5) 
where and are Lagrange multipliers for the two inequality constraints. Its KKT system is
From the first equation, the optimal lies on a line which passes through and is parallel to the normal direction of . If , we have , , and the KKT system reduces to , . We have three cases.
 Case 1

This means, if is satisfied (i.e., is classified correctly with a margin of ), we should simply set and get an objective function value of (the ideal case). This situation is demonstrated in fig. 1 (left plot).
Otherwise, we must have . The KKT system reduces to (using the relation ):and from the first two equations we get
Substituting the above KKT conditions into the Lagrangian , we can express the dual of (4) using only as
The objective function is a concave parabola with maximum achieved at .
 Case 2

Thus if , the optimum of (4) is achieved by , , and .
 Case 3

Otherwise the optimum of (4) is achieved by , , and .
By now, we have found the solutions to (4) completely with a cost of . We give illustrations of the three possible cases in figure 1. This procedure is summarized in Algorithm 2.
2.4 Formulation for the multiclass problem
Let be the number of classes. There are several ways to construct a classifier for classes. We use the onevsall scheme, which has been shown to perform as well as any other variants of multiclass SVM (Rifkin and Klautau, 2004), and which gives rise to a simpler step. In the onevsall scheme we have binary SVMs, each of which is trained to classify whether a point belongs to some class or not. The decision function on a test point is , i.e., the final label is determined by the SVM with the largest decision value. The objective in (1) is replaced by the sum of the SVMs’ objective functions. In the step, we solve for each point a quadratic program of the following form (omitting subindex ):
where is the th SVM’s penalty parameter for the hinge loss and is the slack variable of the th SVM (associated with the point in consideration). This quadratic program contains variables and inequality constraints. Since the size of this problem is typically not large, we use an active set algorithm for solving it (as implemented in Matlab’s Optimization Toolbox). For the binary case, there exists only one SVM and the problem has a closedform solution, as shown before.
It is also possible to use the onevsone scheme. With classes, we have binary SVMs, each trained on each pair of classes. The decision function on a test point is given by majority vote (i.e., the classifier that wins more times). This has a lower training time but a higher test time. Also, the onevsone scheme involves a little more bookkeeping in the step than the onevsall scheme. To see this, take for example classes. The onevsall scheme uses 10 SVMs, and, in the step, each data point is involved in all 10 SVMs, either as positive or negative example. The onevsone scheme uses 45 SVMs, and each point is involved in only 9 of them.
2.5 Summary and practicalities
Jointly optimizing the classification error over and becomes iterating simple convex subproblems: RBF regression, linear SVM and a closedform update for . Remarkably, we do not require any involved gradient computation, but simply reuse existing techniques for training and efficiently. Thus, although the problem (1) is nonconvex and has multiple local optima, the algorithm is deterministic given its initialization. We run the algorithm from an initial
. We observe that, given reasonable hyperparameters, even random values work well (section
3.1 shows how to construct a nearoptimal initial ). In the first 1–2 iterations, the quickly reorganize in latent space, and they only get refined afterwards so as to enlarge the margin and reduce the bias. We use a initial value of for the quadratic penalty parameter and increase it times when the alternating scheme converges for fixed . We use early stopping, by exiting the iteration when the error in a validation set does not change or goes up. This helps to improve generalization and is faster: we observe a few iterations suffice, because each step updates very large, decoupled blocks of variables, and we need not drive or achieve convergence. The hyperparameters are the usual ones: , , for the RBF mapping , and for the SVM , and can be determined by crossvalidation.Our algorithm affords massive parallelization and is suitable for large scale learning. The step (a regression problem) decouples over latent dimensions, the step (oneversusall SVM) decouples over classes, and the step decouples over training samples. Indeed, our experiments show linear speedups with multiple processors.
The form of our final classifier is where , and the sign of gives the label. The number of parameters (weights and centers) is and the runtime for a test point is .
3 Experiments
We explore three questions across a range of datasets: the role of dimension reduction on the classification error and the latent space representation; the performance of our classifier, compared to the stateoftheart; and the training speed of our algorithm.
3.1 The role of dimension reduction
The ideal nonlinear dimensionality reduction + linear classifier
Given that the classifier is linear, consider the ideal case where is infinitely flexible and can represent any desirable mapping from to dimensions. Then, a perfect wrapper classifier can be achieved by having map all the inputs having the same label to a point , and then locating the class centroids in such that they are linearly separable and have maximum margin. How this can be achieved depends on the dimension . With , only classes may be linearly separable. With , all classes are linearly separable if placing the centroids on the vertices of a regular polygon; however, this leads to a small margin as grows. In , this generalizes to placing the centroids maximally apart on a hypersphere. However, when , the margin cannot be further improved, because the points span a space of dimensions, specifically a regular simplex, which provides linear separability and maximum margin. Is this ideal actually realized?
We investigate this question experimentally. The dataset in fig. 2 contains two spirals, each has samples and defines a class. We change the flexibility of to see what latent representations are obtained. We try a linear DR and a RBF DR with varying number of basis functions ( centers uniformly sampled from the training set) while keeping the other hyperparameters fixed.
We observe the following from the projections shown in fig. 2. (1) Since the dataset is not linearly separable, the two classes overlap severely in the latent space for linear (first column). (2) We see from the latent representations (shown in the second row) that, the more flexible is, the more classes collapse and training samples from different classes are pushed far apart (especially if using all training samples as RBFs centers). Thus, they are easily separated by with a big margin (so the nearest neighbor classifier would do very well here). Thus, the overall classifier is capable of solving linearly nonseparable problems given sufficient BFs. (3) To achieve a perfect classification, we need only a few basis functions ( more than suffice here). (4) The projections approximately form a line, implying that is enough to separate two classes.
Relation between the latent dimensionality and the number of classes
We now study the role of the latent dimension , the number of classes and the geometric configuration of the latent projections. We vary in the spirals dataset from to , with samples in each spiral, and run our algorithm with being nonparametric RBFs and , fixing other hyperparameters at reasonable values. Fig. 3 shows the classification results and projections. We find we do not always obtain a perfect classification using only dimensions for the onevsall SVMs. Instead, we find a common behavior for different : the classification performance improves drastically as increases in the beginning, and then stabilizes after some critical , by which time the training samples are perfectly separated. This is because once the classes are separable in , they can also be (more easily) separable in higher dimensions with more degrees of freedom to arrange them. We observe in this experiment that, typically with dimensions, the classes all form a pointlike cluster and approximately lie on vertices of a regular simplex, with zero classification error (decision boundaries shown in the first column). This gives a recipe to choose : starting from , increase until the classification error does not improve. It also gives a recipe to initialize , namely to the ideal configuration of centroids located in the corners of a simplex, so all the from the same class are set to the centroid of that class. Comparing this experimentally with random initial , we observe that the simplexbased leads to much faster convergence (usually 1 iteration), although interestingly the quality of the minimum found is similar to that of the random initial .
linear  

data  










In summary, these results are in approximate agreement with our ideal prediction, although in practice, the extent to which collapses classes depends on the number of basis functions. The more BFs, the more flexible is and the closer the latent projections are to centroids in a maximally linearly separable arrangement as described above, and this behavior arises from the joint optimization of DR and classifier. Note that finding an that maps a set of points to the same point is trivial: it is constant. But what we seek is an that maps each class’ points to the class centroid. This is a piecewise constant mapping which is much harder to learn.




: mean test error rate over 10 splits (standard deviation in parenthesis).
Right: projections by different algorithms.3.2 Comparison with other classifiers
We compare with directly applying a nearest neighbor classifier (NN) and linear (LSVM) or Gaussian kernel SVMs (GSVM) on the original inputs, with unsupervised DR methods PCA/Gaussian Kernel PCA (KPCA) followed by nearest neighbor, and with filter approaches such as LDA/KLDA (Gaussian kernel) followed by nearest neighbor. Hyperparameters for these algorithms (kernel width for kernel SVM, KPCA, and KLDA, penalty parameter for SVMs) are chosen by grid search on a separate validation set. In our algorithm, we initialize randomly.
Document binary classification
We pick two classes from the newsgroup dataset (comp.sys.ibm.pc.hardware and comp.sys.mac.hardware), remove words appearing in or fewer documents, reduce the dimension to , and then extract the TFIDF features. We further reduce the input dimension to with PCA (keeping
of the variance). For evaluation purposes, we create
different / splits of the training items into training and validation set. For each split, we let all algorithms pick the optimal hyperparameters based on validation error, and evaluate the optimal models on the test set. Due to the high dimensionality and scarcity of samples, we use linear for this problem (we did try RBFs for and obtained similar results). We also run Large Margin Nearest Neighbor (LMNN; Weinberger and Saul, 2009), a metric learning algorithm, with the recommended number of target neighbors . We report the mean and standard deviation of test error rates over 10 splits for all methods in fig. 4(left).Our classification performance is quite robust to the choice of , although in general more dimensions may bring a little improvement in the error rate at higher computational cost. We thus fix the latent dimensions to be for other methods. The results show that using class information in dimensionality reduction is superior to not using it (e.g. PCA), and we outperform others consistently with different . Fig. 4(right) shows the 2D projections of several algorithms, where supervised DR algorithms manage to separate the classes and PCA does not.
MNIST odd/even classification
We perform a binary classification task of discriminating odd digits (
) from the even digits () on the MNIST benchmark. We vary the training set size to be , , , , , , , and , including equal number of images randomly sampled for each digit. We use in all cases a balanced validation set of images from which we pick optimal hyperparameters for all methods. The MNIST test set ( samples) is used for evaluating the classification performance. Since this dataset is likely not linearly separable (as can be inferred from the error rate of the linear SVM), we fix the dimension of our latent space to be , and use nonparametric RBFs for . We compare our algorithm with nonlinear DR algorithms KPCA () and KLDA (). We also explore a twostep approach of using linear DR (PCA/LDA) followed by a Gaussian SVM. We crossvalidate for PCA, while LDA uses .Figure 5 (top) shows the test errors of all methods over different training set sizes. On the original inputs, our algorithm and KLDA (given very fine grid search for optimal kernel width) perform the best. Their clear advantage over KPCA demonstrates again that class information should be incorporated in DR. We expect our model to have a regularization effect and improve generalization more in smaller training sets. Indeed, we find that we improve over the nearest neighbor classifier (known to achieve near optimal error for large training sets; Duda et al., 2001) most for  training samples. Fig. 5(bottom) shows the 2D projection of the training set of samples obtained by KPCA and our algorithm. The classes are perfectly separated by our algorithm but not by KPCA, which uses no label information.
Due to the limited power of its DR mapping, LDA+Gaussian SVM performs poorly. PCA+Gaussian SVM performs well (slightly better than our algorithm on original inputs), though only at a much larger (around 40). We hypothesize this is because PCA is able to remove some noise that is not learned by us from the inputs. We then trained our algorithm on the PCA projection, further reducing the dimension to , and obtained consistently better accuracy (shown in fig. 5 as PCA+Ours).
error[b][]error rate (%)
size[][B]training set size ()



MNIST 10classes
We now consider the problem of classifying the digit classes of MNIST. We randomly sample images for training and for validation. The original test set is used for evaluating different algorithms. We were not able to run KPCA and KLDA because the naive implementation of these algorithms requires a huge memory space to store the kernel matrix of
and solve a dense eigenvalue problem. Our algorithm uses
BFs with centers chosen by means.We searched hyperparameters as follows in order to avoid unnecessary regions of the hyperparameter space. We first do a careful grid search for the kernel width and penalty parameter for the Gaussian SVM. The optimal kernel width () is also used as the RBF kernel width of . Then
centers are chosen by Kmeans as RBF basis centers. The regularization parameter
of is fixed to based on experiments on a much smaller training set (there exists a wide range of for which our method works well). Thus we only search for the penalty parameter of , which is common to all SVMs, chosen from .We again explored the approach of PCA/LDA+Gaussian SVM as in the other MNIST experiment and obtained a similar result: LDA ()+Gaussian SVM performs poorly; PCA+Gaussian SVM performs well at a relatively larger . We trained our model on the PCA projection, further reduced the dimension to , and obtained similar performance with much fewer basis functions.
As happened in the case of binary classification, starting from a random initialization, the projections reorganize quickly in the latent space and are well classified after a few iterations.
Fig. 6(top left) shows the test error rates and the total number of support vectors (from the Gaussian SVMs)/basis functions used in each algorithm. Our accuracy is similar to that of the kernel SVM but with times fewer basis functions, obtaining a large speedup in testing time. We have again explored the approach of PCA/LDA+Gaussian SVM as in the previous experiment and obtained a similar result: LDA ()+Gaussian SVM performs poorly; PCA+Gaussian SVM performs well at a relatively larger . We train our model on the PCA projection, further reduce the dimension to , and obtain similar performance with much fewer basis.
Fig. 6(bottom left) shows the performance of our algorithm using different values of the latent dimension . Our error rates decrease quickly as increases at first. After , it already does pretty well, and the performance does not change much for . We conjectured previously that the optimal latent configuration would be to arrange different classes on the vertices of a regular simplex. The projections achieved by our algorithm in fig. 6(bottom right) agree with that. Since we use PCA to visualize in 2D the latent representations lying in dimensions, some classes appear to overlap, but they are all completely separated, as can be seen by using other 2D views.





3.3 Training runtime: comparison with alternating optimization without
We compared our threestep alternating optimization scheme with a twostep alternating optimization scheme over only and , which optimizes the original problem (1) directly, where is a RBF network and a linear SVM. We are much faster in terms of both progress in objective function and actual runtime. The following experiment shows this in the MNIST odd/even classification problem for the case of training samples. To minimize the nested objective function in eq. (1), the twostep algorithm alternates a step which solves a linear SVM on and an step that optimizes over the weights of by solving a quadratic program. Since the latent dimensionality is set to , there are weight parameters in the quadratic program, which we solve by calling the interior point solver of the CVX package (Grant and Boyd, 2012). Fig. 6(middle) shows the nested objective function value versus iteration number during training. The time spent per iteration for our algorithm is about of that of alternating minimization. Therefore, the combined runtime of our and steps is significantly lower than that of the step of alternating minimization, and besides our step has a closedform solution. Further, a few iterations of our algorithm suffice to find a nearoptimal solution (with a small bias that continues to be eliminated as the penalty parameter is increased), while the progress of alternating minimization is hopelessly slow.
If instead of a RBF network, which is linear in the parameters of , we used a model which is nonlinear in the parameters (such as neural net), the gains of our algorithm would be even larger. The reason is that, in our algorithm, the nonlinearity over is confined to the step in the form of a standard leastsquares regression problem (section 2.2), thanks to the use of the auxiliary coordinates . Regression problems are wellstudied and can be solved with efficient algorithms for many classes of functions . In the direct optimization over , this nonlinearity is embedded in the constraints of (1), which is harder to deal with.
3.4 Parallel processing
We also ran a simple parallel version of our algorithm in the MNIST 10classes problem. This solves in parallel for the SVMs in the step, and for the coordinates of all training points in the step (within each respective step, all these problems are independent). Given that our code is in Matlab, we used the Matlab Parallel Processing Toolbox. The programming effort is insignificant: all we do is replace the “for” loop over SVMs (in the step) or over data points (in the step) with a “parfor” loop. Matlab then sends each iteration of the loop to a different processor. We ran this in a sharedmemory multiprocessor machine^{1}^{1}1An Aberdeen Stirling 148 computer having 4 physical CPUs (Intel Xeon CPU L7555@ 1.87GHz), each with 8 individual processing cores (thus a total of 32 actual processors), and a total RAM size of 64 GB., using up to 12 processors (a limit imposed by our Matlab license). We obtain an impressive speedup of up to times, as shown in fig. 6(bottom right). Even larger speedups may be possible if using other parallel computation models, since the Matlab Parallel Processing Toolbox is quite inefficient.
4 Discussion
Filters
Our algorithm illuminates the behavior of filter approaches. Such approaches optimize a proxy objective function constructed from over and then learn the classifier by optimizing the classification error constructed from over . This is like our  and steps, but, firstly, it uses the “wrong” objective for , and, secondly, without the coordination through the variables, the process cannot be iterated to converge to a minimum of the joint problem. We can then view our algorithm as a corrected, iterated filter approach. Since in practice it converges in a few iterations, its cost is just a little larger, but one need not introduce a proxy objective and yet obtains a true (local) minimum. Thus, we learn the following two lessons: (1) if we want to use a filter , the ideal filter would consist of mapping the inputs to class centroids located on the corners of a simplex in latent space; (2) we do not really need a filter approach, because we can train the optimal, wrapper approach nearly as efficiently and simply.
The role of dimensionality reduction in linear classification
Being able to find true optima of the classification error allowed us to study the role of nonlinear DR as a preprocessing step for linear classification. With an ideally flexible DR mapping , the best possible preprocessing is precisely to remove all variation that is unrelated to the class label, including variation within a manifold—an extreme form of denoising. The input domains are “denoised” so they collapse to nearly zerodimensional regions. In practice, belongs to a certain function class (given by the choice of model and number of parameters), and the ideal where classes collapse is only approached, but is clearly there. Using a latent space of dimensions is theoretically sufficient but, with a limited , using up to helps to improve the separation. Note that collapsing classes requires a genuinely nonlinear DR. The problem formulation of eq. (1) does not explicitly seek to collapse classes, but this behavior emerges anyway from the assumption of lowdimensional representation, if trained jointly with the classifier. Thus, rather than making the classifier work hard to approximate a possibly complex decision boundary, we help it by moving the data around in latent space so the boundary is simpler.
This clashes with the widely held view that a good supervised DR method should produce representations (often visualized in 2D) where the manifold structure of each class is displayed. In fact, with an optimal DR the entire manifold will collapse. This is different from unsupervised DR, where we do want to extract informative features that tell us something about the data variability; and from supervised regression, where only some of the input dimensions should be collapsed (those which do not alter the output).
SVMs and kernel learning
Our method and kernel SVMs can be seen as constructing a classifier as an expansion in basis functions , with BFs in our case and support vectors for the SVM. The SVMs do this nonparametrically, at the cost of constructing an kernel matrix and solving the corresponding problem, which is expensive—although much research work has sought to approximate this (Schölkopf and Smola, 2001) and reduce the number of SVs (Bi et al., 2003). The basis functions are given by the kernel, which is selected by the user, and the space they implicitly map to is typically infinitedimensional. The number of SVs is not set by the user but comes out automatically and can be quite large in practice. Our method is a parametric approach, where the user can set the number of BFs , and the mapping (in this paper, an RBF mapping) maps to a lowdimensional space. The result is a competitive nonlinear classifier, with scalable training and efficient at test time. Having as a user parameter also allows a simple, direct way for the user to trade off test runtime for accuracy, which is crucial in realtime problems, such as embedded systems, where a typical SVM classifier is too computationally demanding in both memory required to store the SVs and in runtime. As shown in our experiments, we can obtain a classification error comparable to the kernel SVM with , thus much faster.
It is possible to train a linear SVM by learning and directly given fixed basis functions , but this achieves a worse classification error, and does not do DR, as we seek. Our lowdimensional classifier can be seen as a special regularization structure on the classifier’s weights , where and are regularized separately. This effect is more pronounced in the multiclass case since each onevsall SVM interacts with the same DR mapping . If using a different functional form for (e.g. deep nets), this resemblance with kernel SVMs disappears.
Our model can also be seen as learning a “lowdimensional kernel”, since we pass a pair of latent vectors to the linear SVM kernel, rather than applying a kernel directly to the highdimensional inputs. If is a linear DR mapping (no need for bias in latent space) then this becomes a form of metric learning with SVMs, using a lowrank metric .
5 Related work
Filter approaches typically learn a DR mapping using the inputs and label information first, and then fit a classifier to the latent projections and labels . They are quite popular due to the ease of optimization, but they rely on the choice of objective function for the filter. Linear discriminant analysis (LDA; Belhumeur et al., 1997) and its kernel version KLDA (Mika et al., 1999) look for a transformation of the inputs such that, in the latent space, the withinclass scatter is minimized while the betweenclass scatter is maximized. The solution for can be obtained by solving an eigenvalue problem. These two algorithms can only produce up to latent dimensions for a class problem, due to the singularity of the betweenclass scatter matrix. Among other variations, Sugiyama (2007) modify LDA to work for manifold data, but the projection mapping is linear, and Urtasun and Darrell (2007)
derive a prior distribution from LDA in latent space and use it for a GPLVM to achieve DR, and the latent representation is then fed to a Gaussian process classifier. A clear disadvantage of filter approaches is the heuristic nature of the objective for DR, which acts as a proxy for classification, and is therefore not optimal for the classifier learned afterwards. As we showed, a good filter objective would be to collapse all classes and place them in the corners of a simplex.
Metric learning algorithms (Xing et al., 2003; Goldberger et al., 2005; Globerson and Roweis, 2006; Weinberger and Saul, 2009; Davis et al., 2007)
are closely related to DR for classification. Their goal is to find a Mahalanobis metric (or equivalently a linear transform) in input space such that samples in the same class are projected nearby in latent space while samples from different classes are pushed far apart. One achieves DR if the metric is lowrank. However, most metric learning algorithms first solve a positive semidefinite program without rank constraints, and then do a lowrank approximation of the learned Mahalanobis matrix to enforce dimensional reduction, thus the optimality of the projection is no longer guaranteed.
There also exist several wrapper approaches that train the DR mapping jointly with a classifier in a unified objective function. Pereira and Gordon (2006) (further generalized by Rish et al., 2008) use an objective function that combines the approximation error of the inputs using SVD and the hinge loss from applying a linear SVM to the coordinates, so that the representation extracted this way will be good for classification. This is closely related to supervised dictionary learning (Yang et al., 2012), only that the bias (approximation error) always exists in the model. Also, in this model the latent projections are an implicit function of the inputs, i.e., to project a new input, one needs to solve a optimization problem using learned basis. In contrast, our is an explicit mapping, directly applicable to test samples. Ji and Ye (2009) directly minimize the hinge loss of a SVM that operates on linearly transformed inputs (therefore it is a nested error, similar to us). They apply alternating optimization over both mappings. Due to the linearity of and , they are able to solve for in the dual and solve for using SVD. This would not be possible in general if has a different nonlinear form, unlike in our algorithm. Also, the SVD solution of limits the maximum meaningful latent dimension to the number of classes. In contrast with these approaches, our algorithm is bias free, works with any , and trains a nonlinear DR mapping fast.
Auxiliary variables were used previously for unsupervised dimensionality reduction (CarreiraPerpiñán and Lu, 2008, 2010) and regression (Wang and CarreiraPerpiñán, 2012). But the unconstrained objective function defined there, while jointly optimized over a dimension reduction mapping and a regressor , differs from a true wrapper objective function, and results in optima for the combined mapping that are biased.
6 Conclusion
We have proposed an efficient algorithm to train a nonlinear lowdimensional classifier jointly over the nonlinear DR mapping and the classifier (a wrapper approach). The algorithm is easy to implement, reuses existing regression and SVM procedures, and parallelizes well. The resulting classifier achieves stateoftheart classification error with a small number of basis functions, which can be tuned by the user. The algorithm can be seen as an iterated filter approach with provable convergence to a local minimum of the joint objective. This justifies filter approaches that use a secondary criterion over the DR mapping such as class separability or intraclass scatter in an effort to construct a good classifier, but also obviates them, since one can ensure to get the best lowdimensional classifier (under the model assumptions) with just a little more computation.
Our experiments illuminate the role of nonlinear DR in linear classification. If we optimize the classification error—the figure of merit one really cares about—jointly over the projection mapping and the classifier, the best DR in fact erases all structure (manifold and otherwise) in the input other than class membership, and uses the latent space to place collapsed classes in such a way that they are maximally linearly separable. Future work should analyze the role of DR with nonlinear classifiers.
Our algorithm generalizes beyond the specific forms of DR mapping and classifier used here, and we are exploring other combinations. In particular, one can replace the DR mapping with a complex featureextraction mapping that can handle invariances, such as convolutional neural nets, and jointly optimize this and the classifier.
Acknowledgments
Work funded in part by NSF CAREER award IIS–0754089.
References
 Belhumeur et al. (1997) P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman. Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. IEEE Trans. Pattern Analysis and Machine Intelligence, 19(7):711–720, July 1997.

Bi et al. (2003)
J. Bi, K. Bennett, M. Embrechts, C. Breneman, and M. Song.
Dimensionality reduction via sparse support vector machines.
J. Machine Learning Research
, 3:1229–1243, Mar. 2003.  Bishop (2006) C. M. Bishop. Pattern Recognition and Machine Learning. Springer Series in Information Science and Statistics. SpringerVerlag, Berlin, 2006.

CarreiraPerpiñán and Lu (2008)
M. Á. CarreiraPerpiñán and Z. Lu.
Dimensionality reduction by unsupervised regression.
In
Proc. of the 2008 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR’08)
, Anchorage, AK, June 23–28 2008.  CarreiraPerpiñán and Lu (2010) M. Á. CarreiraPerpiñán and Z. Lu. Parametric dimensionality reduction by unsupervised regression. In Proc. of the 2010 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR’10), pages 1895–1902, San Francisco, CA, June 13–18 2010.
 CarreiraPerpiñán and Wang (2012) M. Á. CarreiraPerpiñán and W. Wang. Distributed optimization of deeply nested systems. Unpublished manuscript, arXiv:1212.5921, Dec. 24 2012.

CarreiraPerpiñán and Wang (2014)
M. Á. CarreiraPerpiñán and W. Wang.
Distributed optimization of deeply nested systems.
In S. Kaski and J. Corander, editors,
Proc. of the 17th Int. Workshop on Artificial Intelligence and Statistics (AISTATS 2014)
, pages 10–19, Reykjavik, Iceland, Apr. 22–25 2014.  Chang and Lin (2011) C.C. Chang and C.J. Lin. LIBSVM: A library for support vector machines. ACM Trans. Intelligent Systems and Technology, 2(3):27, Apr. 2011.
 Davis et al. (2007) J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon. Informationtheoretic metric learning. In Z. Ghahramani, editor, Proc. of the 24th Int. Conf. Machine Learning (ICML’07), Corvallis, Oregon, June 20–24 2007.
 Duda et al. (2001) R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley & Sons, New York, London, Sydney, second edition, 2001.
 Globerson and Roweis (2006) A. Globerson and S. Roweis. Metric learning by collapsing classes. In Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems (NIPS), volume 18, pages 451–458. MIT Press, Cambridge, MA, 2006.
 Goldberger et al. (2005) J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems (NIPS), volume 17, pages 513–520. MIT Press, Cambridge, MA, 2005.
 Grant and Boyd (2012) M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming, version 2.0 beta. http://cvxr.com/cvx, Sept. 2012.

Guyon and Elisseeff (2003)
I. Guyon and A. Elisseeff.
An introduction to variable and feature selection.
J. Machine Learning Research, 3:1157–1182, Mar. 2003.  Ji and Ye (2009) S. Ji and J. Ye. Linear dimensionality reduction for multilabel classification. In Proc. of the 21st Int. Joint Conf. Artificial Intelligence (IJCAI’09), pages 1077–1082, Pasadena, California, July 11–17 2009.
 Kohavi and John (1998) R. Kohavi and G. H. John. The wrapper approach. In H. Liu and H. Motoda, editors, Feature Extraction, Construction and Selection. A Data Mining Perspective. SpringerVerlag, 1998.

Mika et al. (1999)
S. Mika, G. Rätsch, J. Weston, B. Schölkopf, and K.R. Müller.
Fisher discriminant analysis with kernels.
In Y. H. Hu and J. Larsen, editors,
Proc. of the 1999 IEEE Signal Processing Society Workshop on Neural Networks for Signal Processing (NNSP’99)
, pages 41–48, Madison, WI, Aug. 23–25 1999.  Nocedal and Wright (2006) J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research and Financial Engineering. SpringerVerlag, New York, second edition, 2006.
 Pereira and Gordon (2006) F. Pereira and G. Gordon. The support vector decomposition machine. In W. W. Cohen and A. Moore, editors, Proc. of the 23rd Int. Conf. Machine Learning (ICML’06), pages 689–696, Pittsburgh, PA, June 25–29 2006.
 Rifkin and Klautau (2004) R. Rifkin and A. Klautau. In defense of onevsall classification. J. Machine Learning Research, 5:101–141, Jan. 2004.
 Rish et al. (2008) I. Rish, G. Grabarnilk, G. Cecchi, F. Pereira, and G. Gordon. Closedform supervised dimensionality reduction with generalized linear models. In A. McCallum and S. Roweis, editors, Proc. of the 25th Int. Conf. Machine Learning (ICML’08), pages 832–839, Helsinki, Finland, July 5–9 2008.
 Schölkopf and Smola (2001) B. Schölkopf and A. J. Smola. Learning with Kernels. Support Vector Machines, Regularization, Optimization, and Beyond. Adaptive Computation and Machine Learning Series. MIT Press, Cambridge, MA, 2001.
 Sugiyama (2007) M. Sugiyama. Dimensionality reduction of multimodal labeled data by local Fisher discriminant analysis. J. Machine Learning Research, 8:1027–1061, May 2007.
 Urtasun and Darrell (2007) R. Urtasun and T. Darrell. Discriminative Gaussian process latent variable model for classification. In Z. Ghahramani, editor, Proc. of the 24th Int. Conf. Machine Learning (ICML’07), pages 927–934, Corvallis, Oregon, June 20–24 2007.
 Wang and CarreiraPerpiñán (2012) W. Wang and M. Á. CarreiraPerpiñán. Nonlinear lowdimensional regression using auxiliary coordinates. In N. Lawrence and M. Girolami, editors, Proc. of the 15th Int. Workshop on Artificial Intelligence and Statistics (AISTATS 2012), pages 1295–1304, La Palma, Canary Islands, Spain, Apr. 21–23 2012.
 Wang and CarreiraPerpiñán (2014) W. Wang and M. Á. CarreiraPerpiñán. The role of dimensionality reduction in classification. In C. E. Brodley and P. Stone, editors, Proc. of the 28th National Conference on Artificial Intelligence (AAAI 2014), Quebec City, Canada, July 27–31 2014.
 Weinberger and Saul (2009) K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. J. Machine Learning Research, 10:207–244, Feb. 2009.
 Xing et al. (2003) E. Xing, A. Ng, M. Jordan, and S. Russell. Distance metric learning with application to clustering with sideinformation. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems (NIPS), volume 15, pages 521–528. MIT Press, Cambridge, MA, 2003.
 Yang et al. (2012) J. Yang, Z. Wang, Z. Lin, X. Shu, and T. Huang. Bilevel sparse coding for coupled feature spaces. In Proc. of the 2012 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR’12), pages 2360–2367, Providence, RI, June 16–21 2012.