Classification tasks on data sets with large feature dimensions are very common in real-world machine learning applications. Typical examples include microarray data for gene selection and text documents for natural language processing. Despite the large number of features present in the data sets, usually only small subsets of the features are relevant to the particular learning tasks, and local correlation among the features is often observed. Hence, feature selection is required for good model interpretability. Popular classification techniques, such as support vector machine (SVM) and logistic regression, are formulated as convex optimization problems. An extensive literature has been devoted to optimization algorithms that solve variants of these classification models with sparsity regularization[12, 17]. Many of them are based on first-order (gradient-based) methods, mainly because the size of the optimization problem is very large. The advantage of first-order methods is that their computational and memory requirements at each iteration are low and as a result they can handle the large optimization problems occurring in classification problems. Their major disadvantage is their slow convergence, especially when a good approximation of the feature support has been identified. Second-order methods exhibit fast local convergence, but their computational and memory requirements are much more demanding, since they need to store and invert the Newton matrix at every iteration. It is therefore very important to be able to intelligently combine the advantages of both the first and the second order optimization methods in such a way that the resulting algorithm can solve large classification problems efficiently and accurately. As we will demonstrate in this paper such combination is possible by taking advantage of the problem structure and the change in its size during the solution process. In addition, we will also show that our algorithmic framework is flexible enough to incorporate prior knowledge to improve classification performance.
1.1 Related Work
The above requirements demand three features from a learning algorithm: 1. it should be able to automatically select features which are possibly in groups and highly correlated; 2. it has to solve the optimization problem in the training phase efficiently and with high accuracy; and 3. the learning model needs to be flexible enough so that domain knowledge can be easily incorporated. Existing methods are available in the literature that meet some of the above requirements individually. For enforcing sparsity in the solution, efficient optimization algorithms such as that proposed in  can solve large-scale sparse logistic regression. On the other hand, the -regularization is unstable with the presence of highly correlated features - among a group of such features, essentially one of them is selected in a random manner. To handle local correlation among groups of features, the elastic-net regularization  has been successfully applied to SVM  and logistic regression . However, incorporating domain knowledge into the logistic regression formulation is not straightforward. For SVM, including such knowledge in the optimization process has been demonstrated in . Recently, an Alternating Direction Method of Multipliers (ADMM) has been proposed for the elastic-net SVM (ENSVM) . ADMM is quick to find an approximate solution to the ENSVM problem, but it is known to converge very slowly to high accuracy optimal solutions . The interior-point methods (IPM) for SVM are known to be able to achieve high accuracy in their solutions with a polynomial iteration complexity, and the dual SVM formulation is independent of the feature space dimensionality. However, the classical -norm SVM is not able to perform automatic feature selection. Although the elastic-net SVM can be formulated as a QP (in the primal form), its problem size grows substantially with the feature dimensionality. Due to the need to solve a Newton system in each iteration, the efficiency of IPM quickly deteriorates as the feature dimension becomes large.
1.2 Main Contributions
In this paper we propose a new hybrid algorithmic framework for SVM to address all of the above challenges and requirements simultaneously. Our framework combines the advantages of a first-order optimization algorithm (through the use of ADMM) and a second-order method (via IPM) to achieve both superior speed and accuracy. Through a novel algorithmic approach that is able to incorporate expert knowledge, our proposed framework is able to exploit domain knowledge to improve feature selection, and hence, prediction accuracy. Besides efficiency and generalization performance, we demonstrate through experiments on both synthetic and real data that our method is also more robust to inaccuracy in the supplied knowledge than existing approaches.
2 A Two-phase Hybrid Optimization Algorithm
As previously mentioned, for data sets with many features, the high dimensionality of the feature space still poses a computational challenge for IPM. Fortunately, many data sets of this kind are very sparse, and the resulting classifieris also expected to be sparse, i.e. only a small subset of the features are expected to carry significant weights in classification. Naturally, it is ideal for IPM to train a classifier on the most important features only.
Inspired by the Hybrid Iterative Shrinkage (HIS)  algorithm for training large-scale sparse logistic regression classifiers, we propose a two-phase algorithm to shrink the feature space appropriately so as to leverage the high accuracy of IPM while maintaining efficiency. Specifically, we propose to solve an elastic-net SVM (ENSVM) or doubly-regularized SVM (DrSVM)  problem during the first phase of the algorithm. The elastic-net regularization performs feature selection with grouping effect and has been shown to be effective on data sets with many but sparse features and high local correlations . This is the case for text classification, microarray gene expression, and fMRI data sets. The support of the weight vector for ENSVM usually stabilizes well before the algorithm converges to the optimal solution. Taking advantage of that prospect, we can terminate the first phase of the hybrid algorithm early and proceed to solve a classical SVM problem with the reduced feature set in the second phase, using an IPM solver.
2.1 Solving the Elastic Net SVM using ADMM
SVM can be written in the regularized regression form as
where the first term is an averaged sum of the hinge losses and the second term is viewed as a ridge regularization on . It is easy to see from this form that the classical SVM does not enforce sparsity in the solution, and is generally dense. The ENSVM adds an regularization on top of the ridge regularization term, giving
Compared to the Lasso (-regularized regression) , the elastic-net has the advantage of selecting highly correlated features in groups (i.e. the grouping effect) while still enforcing sparsity in the solution. This is a particularly attractive feature for text document data, which is common in the hierarchical classification setting. Adopting the elastic-net regularization as in (2) brings the same benefit to SVM for training classifiers.
To approximately solve problem (2), we adopt the alternating direction method of multipliers (ADMM) for elastic-net SVM recently proposed in . ADMM has a long history dating back to the 1970s . Recently, it has been successfully applied to problems in machine learning . ADMM is a special case of the inexact augmented Lagrangian (IAL) method  for the structured unconstrained problem
where both functions and are convex. We can decouple the two functions by introducing an auxiliary variable and convert problem (3) into an equivalent constrained optimization problem
This technique is often called variable-splitting . The IAL method approximately minimizes in each iteration the augmented Lagrangian of (4) defined by , followed by an update to the Lagrange multiplier . The IAL method is guaranteed to converge to the optimal solution of (3), as long as the subproblem of approximately minimizing the augmented Lagrangian is solved with an increasing accuracy . ADMM can be viewed as a practical implementation of IAL, where the subproblem is solved approximately by minimizing with respect to and alternatingly once. Eckstein and Bertsekas  established the convergence of ADMM for the case of two-way splitting. Now applying variable-splitting and ADMM to problem (2),  introduced auxiliary variables and linear constraints so that the non-smooth hinge loss and -norm in the objective function are decoupled, making it easy to optimize over each of the variables. Specifically, problem (2) is transformed into an equivalent constrained form
where is the -th row of , and . The augmented Lagrangian
is then minimized with respect to and c sequentially in each iteration, followed by an update to the Lagrange multipliers and
. The original problem is thus decomposed into three subproblems consisting of computing the proximal operator of the hinge loss function (with respect toa), solving a special linear system (with respect to ), and performing a soft-thresholding operation (with respect to c), which can all be done in an efficient manner. Due to lack of space in the paper, we have included the detailed solution steps in the Appendix (see Algorithm 0.A.1 ADMM-ENSVM), where we define by the proximal operator associated with the hinge loss
and is the shrinkage operator.
2.2 SVM via Interior-Point Method
Interior Point Methods enjoy fast convergence rates for a wide class of QP problems. Their theoretical polynomial convergence () was first established by Mizuno . In addition, Andersen et al  showed that the number of iterations needed by IPMs to converge is , which demonstrates that their computational effort increases in a slower rate than the size of the problem.
Both the primal and the dual SVM are QP problems. The primal formulation of SVM  is defined as
whereas the dual SVM has the form
where . By considering the KKT conditions of (SVM-P) and (SVM-D), the optimal solution is given by , where is the set of sample indices corresponding to the support vectors. The optimal bias term can be computed from the complementary slackness condition .
Whether to solve (SVM-P) or (SVM-D) for a given data set depends on its dimensions as well as its sparsity. Even if is a sparse matrix, in (SVM-D) is still likely to be dense, whereas the Hessian matrix in (SVM-P) is the identity. The primal problem (SVM-P), however, has a larger variable dimension and more constraints. It is often argued that one should solve (SVM-P) when the number of features is smaller than the number of samples, whereas (SVM-D) should be solved when the number of features is less than that of the samples. Since in the second phase of Algorithm 2.1 we expect to have identified a small number of promising features, we have decided to solve (SVM-D) by using IPM. Solving (SVM-D) is realized through the OOQP  software package that implements a primal-dual IPM for convex QP problems.
2.3 The Two-phase Algorithm
Let us keep in mind that the primary objective for the first phase is to appropriately reduce the feature space dimensionality without impacting the final prediction accuracy. As we mentioned above, the suitability of ADMM for the first phase depends on whether the support of the feature vector converges quickly or not. On an illustrative dataset from , which has 50 samples with 300 features each, ADMM converged in 558 iterations. The output classifier w contained only 13 non-zero features, and the feature support converged in approximately 50 iteration (see Figure 1 in the Appendix for illustrative plots showing the early convergence of ADMM). Although the remaining more than 500 iterations are needed by ADMM in order to satisfy the optimality criteria, they do not offer any additional information regarding the feature selection process. Hence, it is important to monitor the change in the signs and indices of the support and terminate the first phase promptly. In our implementation, we adopt the criterion used in  and monitor the relative change in the iterates as a surrogate of the change in the support, i.e.
We have observed in our experiments that when the change over the iterates is small, the evolution of the support indices stabilizes too.
Upon starting the second phase, it is desirable for IPM to warm-start from the corresponding sub-vector of the solution returned by ADMM. It should also be noted that we apply IPM during the second phase to solve the classical -regularized SVM (1), instead of the ENSVM (2) in the first phase. There are two main reasons for this decision. First, although ENSVM can be reformulated as a QP, the size of the problem is larger than the classical SVM due to the additional linear constraints introduced by the -norm. Second, since we have already identified (approximately) the feature support in the first phase of the algorithm, enforcing sparsity in the reduced feature space becomes less critical. The two-phase algorithm is summarized in Algorithm 2.1.
3 Domain Knowledge Incorporation
Very often, we have prior domain knowledge for specific classification tasks. Domain knowledge is most helpful when the training data does not form a comprehensive representation of the underlying unknown population, resulting in poor generalization performance of SVM on the unseen data from the same population. This often arises in situations where labeled training samples are scarce, while there is an abundance of unlabeled data.
For high dimensional data, ENSVM performs feature selection along with training to produce a simpler model and to achieve better prediction accuracy. However, the quality of the feature selection depends entirely on the training data. In pathological cases, it is very likely that the feature support identified by ENSVM does not form a good representation of the population. Hence, when domain knowledge about certain features is available, we should take it into consideration during the training phase and include the relevant features in the feature support should them be deemed important for classification.
In this section, we explore and propose a new approach to achieve this objective. We consider domain knowledge in the form of class-membership information associated with features. We can incorporate such information (or enforce such rules) in SVM by adding equivalent linear constraints to the SVM QP problem (KSVM) [5, 10]. To be specific, we can model the above information with the linear implication
where and . It is shown in  that by utilizing the non-homogeneous Farkas theorem of the alternative, (8) can be transformed into the following equivalent system of linear inequalities with a solution u
Similarly, for the linear implication for the negative class membership we have:
which can be represented by the set of linear constraints in v
Hence, to incorporate the domain knowledge represented by (8) and (10) into SVM, Fung et al  simply add the linear constraints (9) and (11) to (SVM-P). Their formulation, however, increases both the variable dimension and the number of linear constraints by at least , where is the number of features in the classification problem we want to solve. This is clearly undesirable when is large, which is the scenario that we consider in this paper.
In order to avoid the above increase in the size of the optimization probelm, we choose to penalize the quadratics and instead of their counterparts considered in . By doing so the resulting problem is still a convex QP but with a much smaller size. Hence, we consider the following model for domain knowledge incorporation.
We are now ready to propose a novel combination of ENSVM and KSVM, and we will explain in the next section how the combined problem can be solved in our HIPAD framework. The main motivation behind this combination is to exploit domain knowledge to improve the feature selection, and hence, the generalization performance of HIPAD. To the best of our knowledge, this is the first method of this kind.
3.1 ADMM Phase
Our strategy for solving the elastic-net SVM with domain knowledge incorporation is still to apply the ADMM method. First, we combine problems (2) and (KSVM-P) and write the resulting optimization problem in an equivalent unconstrained form (by penalizing the violation of the inequality constraints through hinge losses in the objective function)
where . We then apply variable-splitting to decouple the -norms and hinge losses and obtain the following equivalent constrained optimization problem:
with . As usual, we form the augmented Lagrangian of problem (12),
and minimize with respect to individually and in order. For the sake of readability, we do not penalize the non-negative constraints for u and v in the augmented Lagrangian.
Given , solving for again involves solving a linear system
where and . Similar to solving the linear system in Algorithm 0.A.1 ADMM-ENSVM, we can compute the solution to the above linear system through a few PCG iterations, taking advantage of the fact that the left-hand-side matrix is of low-rank.
To minimize the augmented Lagrangian with respect to u, we need to solve a convex quadratic problem with non-negative constraints
Solving problem (14) efficiently is crucial for the efficiency of the overal algorithm. We describe a novel way to do so. Introducing a slack variable s and transferring the non-negative constraint on u to s, we decompose the problem into two parts which are easy to solve. Specifically, we reformulate (14) as
Penalizing the linear constraint in the new augmented Lagrangian, the new subproblem with respect to is
Given an , we can compute by solving a linear system
where . We assume that has full row-rank. This is a reasonable assumption since otherwise there is at least one redundant domain knowledge constraint and we can simply remove it. The number of domain knowledge constraints ( and ) are usually small, so the system (16) can be solved exactly and efficiently by Cholesky factorization.
Solving for corresponding to is easy, observing that problem (15) is separable in the elements of s. For each element , the optimal solution to the one-dimensional quadratic problem with a non-negative constraint on is given by . Writing in the vector form, . Similarly, we solve for by introducing a non-negative slack variable t and solve the linear system
where , and .
The subproblem with respect to is
The solution is given by a (one-dimensional) proximal operator associated with the hinge loss
Similarly, the subproblem with respect to is
and the solution is given by
Due to lack of space in the paper, we summarize the detailed solution steps in the Appendix (see Algorithm 0.A.2 ADMM-ENK)
Although there appears to be ten additional parameters (six ’s and four ’s) in the ADMM method for ENK-SVM, we can usually set the ’s to the same value and do the same for the ’s. Hence, in practice, there is only one additional parameter to tune, and our computational experience in Section 4.2 is that the algorithm is fairly insensitive to the ’s and ’s.
3.2 IPM Phase
The second phase for solving the knowledge-based SVM problem defined by (KSVM-P) follows the same steps as that described in section 2.2. Note that in the knowledge-based case we have decided to solve the primal problem. This decision was based on extensive numerical experiments with both the primal and dual formulation which revealed that the primal formulation is more efficient.
We found in our experiments that by introducing slack variables and transforming the above problem into a linearly equality-constrained QP, Phase 2 of HIPAD usually requires less time to solve.
3.3 HIPAD with domain knowledge incorporation
We formally state the new two-phase algorithm for the elastic-net KSVM in Algorithm 3.1.
4 Numerical results
We present our numerical experience with the two main algorithms proposed in this paper: HIPAD and its knowledge-based version HIPAD-ENK. We compare their performance with their non-hybrid counterparts, i.e., ADMM-ENSVM and ADMM-ENK, which use ADMM to solve the original SVM problem. The transition condition at the end of Phase 1 is specified in (7), with . The stopping criteria for ADMM are as follows: and , with , and .
4.1 HIPAD vs ADMM
To demonstrate the practical effectiveness of HIPAD, we tested the algorithm on nine real data sets which are publicly available. rcv1  is a collection of manually categorized news wires from Reuters. The original multiple labels have been consolidated into two classes for binary classification. real-sim contains UseNet articles from four discussion groups, for simulated auto racing, simulated aviation, real autos, and real aviation. Both rcv1 and real-sim have large feature dimensions but are highly sparse. The rest of the seven data sets are all dense data. rcv1 and real-sim are subsets of the original data sets, where we randomly sampled 500 training instances and 1,000 test instances. gisette is a handwriting digit recognition problem from NIPS 2003 Feature Selection Challenge, and we also sub-sampled 500 instances for training. (For testing, we used the original test set of 1,000 instances.) duke, leukemia, and colon-cancer are data sets of gene expression profiles for breast cancer, leukemia, and colon cancer respectively. fMRIa, fMRIb, and fMRIc are functional MRI (fMRI) data of brain activities when the subjects are presented with pictures and text paragraphs. The data was compiled and made available by Tom Mitchell’s neuroinformatics research group 111http://www.cs.cmu.edu/ tom/fmri.html. Except the three fMRI data sets, all the other data sets and their references are available at the LIBSVM website 222http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets.
The parameters of HIPAD, ADMM-ENSVM, and LIBSVM were selected through cross validation on the training data. We summarize the experiment results in Table 1
. Clearly, HIPAD produced the best overall predication performance. In order to test the significance of the difference, we used the test statistic in based on Friedman’s , and the results are significant at . In terms of CPU-time, HIPAD consistently outperformed ADMM-ENSVM by several times on dense data. The feature support sizes selected by HIPAD were also very competitive or even better than the ones selected by ADMM-ENSVM. In most cases, HIPAD was able to shrink the feature space to below 10 of the original size.
|Accuracy ()||Support size||CPU (s)||Accuracy ()||Support size||CPU (s)||accuracy ()|
4.2 Simulation for Knowledge Incorporation
We generated synthetic data to simulate the example presented at the beginning of Section 3 in the high dimensional feature space. Specifically, four groups of multi-variate Gaussians were sampled from and for four disjoint blocks of feature values (). For positive class samples, ; for negative class samples, . All the covariance matrices have 1 on the diagonal and 0.8 everywhere else. The training samples contain blocks and
, while all four blocks are present in the test samples. A random fraction (5%-10%) of the remaining entries in all the samples are generated from the standard Gaussian distribution.
The training samples are apparently hard to separate because the values of blocks and for the two classes are close to each other. However, blocks and in the test samples are well-separated. Hence, if we are given information about these two blocks as general knowledge for the entire population, we could expect the resulting SVM classifier to perform much better on the test data. Since we know the mean values of the distributions from which the entries in and are generated, we can supply the following information about the relationship between the block sample means and class membership to the KSVM: , and where is the length of the , , and represent the positive and negative classes, and the lowercase denotes the -th entry of the sample x. Translating into the notation of (KSVM-P), we have
The information given here is not precise, in that we are confident that a sample should belong to the positive (or negative) class only when the corresponding block sample mean well exceeds the distribution mean. This is consistent with real-world situations, where the domain or expert knowledge tends to be conservative and often does not come in exact form.
We simulated two sets of synthetic data for ENK-SVM as described above, with for ksvm-s-10k and for ksvm-s-50k. The number of features in each of the four blocks () is 50 for both data sets. Clearly, HIPAD-ENK is very effective in terms of speed, feature selection, and prediction accuracy on these two data sets. Even though the features in blocks and are not discriminating in the training data, HIPAD-ENK was still able to identify all the 200 features in the four blocks correctly and exactly. This is precisely what we want to achieve as we explained at the beginning of Section 3
. The expert knowledge not only helps rectify the separating hyperplane so that it generalizes better on the entire population, but also makes the training algorithm realize the significance of the features in blocksand .
|Accuracy ()||Support size||CPU (s)||Accuracy ()||Support size||CPU (s)||accuracy ()|
We have proposed a two-phase hybrid optimization framework for solving the ENSVM, in which the first phase is solved by ADMM, followed by IPM in the second phase. In addition, we have proposed a knowledge-based extension of the ENSVM which can be solved by the same hybrid framework. Through a set of experiments, we demonstrated that our method has significant advantage over the existing method in terms of computation time and the resulting prediction accuracy. The algorithmic framework introduced in this paper is general enough and potentially applicable to other regularization-based classification or regression problems.
G. J. Andersen, A., C. Meszaros, and X. Xu.
Implementation of interior point methods for large scale linear programming.In In ”Interior point methods in mathematical programming”, T. Terlaky, editor, pages 189–252. Kluwer Academic Publishers, 1996.
-  S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122, 2011.
-  P. Combettes and J. Pesquet. Proximal splitting methods in signal processing. Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pages 185–212, 2011.
-  J. Eckstein and D. Bertsekas. On the douglas rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming, 55(1):293–318, 1992.
-  G. Fung, O. Mangasarian, and J. Shavlik. Knowledge-based support vector machine classifiers. Advances in Neural Information Processing Systems, pages 537–544, 2003.
-  D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1):17–40, 1976.
-  E. Gertz and S. Wright. Object-oriented software for quadratic programming. ACM Transactions on Mathematical Software (TOMS), 29(1):58–81, 2003.
-  R. L. Iman and J. M. Davenport. Approximations of the critical region of the fbietkan statistic. Communications in Statistics-Theory and Methods, 9(6):571–595, 1980.
-  K. Koh, S. Kim, and S. Boyd. An interior-point method for large-scale l1-regularized logistic regression. Journal of Machine learning research, 8(8):1519–1555, 2007.
-  F. Lauer and G. Bloch. Incorporating prior knowledge in support vector machines for classification: A review. Neurocomputing, 71(7-9):1578–1594, 2008.
-  D. Lewis, Y. Yang, T. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. The Journal of Machine Learning Research, 5:361–397, 2004.
-  P. P. M. and P. E. Hansen. Data Mining and Mathematical Programming. American Mathematical Society, 2008.
-  S. Mizuno. Polynomiality of infeasible interior point algorithms for linear programming. Mathematical Programming, 67:109–119, 1994.
-  R. Rockafellar. The multiplier method of hestenes and powell applied to convex programming. Journal of Optimization Theory and Applications, 12(6):555–562, 1973.
-  S. Ryali, K. Supekar, D. Abrams, and V. Menon. Sparse logistic regression for whole-brain classification of fmri data. NeuroImage, 51(2):752–764, 2010.
-  J. Shi, W. Yin, S. Osher, and P. Sajda. A fast hybrid algorithm for large-scale l 1-regularized logistic regression. The Journal of Machine Learning Research, 11:713–741, 2010.
-  N. S. Sra, S. and S. Wright. Optimization for Machine Learning. MIT Press, 2011.
-  R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288, 1996.
The nature of statistical learning theory. Springer Verlag, 2000.
-  L. Wang, J. Zhu, and H. Zou. The doubly regularized support vector machine. Statistica Sinica, 16(2):589–615, 2006.
G.-B. Ye, Y. Chen, and X. Xie.
Efficient variable selection in support vector machines via the
alternating direction method of multipliers.
International Conference on Artificial Intelligence and Statistics, pages 832–840, 2011.
-  H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.