Sparse support vector machine (SVM) is a popular classification technique that can simultaneously learn a small set of the most interpretable features and identify the support vectors. It has achieved great successes in many real-world applications. However, for large-scale problems involving a huge number of samples and extremely high-dimensional features, solving sparse SVMs remains challenging. By noting that sparse SVMs induce sparsities in both feature and sample spaces, we propose a novel approach, which is based on accurate estimations of the primal and dual optima of sparse SVMs, to simultaneously identify the features and samples that are guaranteed to be irrelevant to the outputs. Thus, we can remove the identified inactive samples and features from the training phase, leading to substantial savings in both the memory usage and computational cost without sacrificing accuracy. To the best of our knowledge, the proposed method is the firststatic feature and sample reduction method for sparse SVM. Experiments on both synthetic and real datasets (e.g., the kddb dataset with about 20 million samples and 30 million features) demonstrate that our approach significantly outperforms state-of-the-art methods and the speedup gained by our approach can be orders of magnitude.READ FULL TEXT VIEW PDF
Sparse classifiers such as the support vector machines (SVM) are efficie...
We present a quantum machine learning algorithm for Sparse Support Vecto...
Survival analysis is a fundamental tool in medical research to identify
The time complexity of support vector machines (SVMs) prohibits training...
In several applications, input samples are more naturally represented in...
The linear Support Vector Machine (SVM) is one of the most popular binar...
Many emerging use cases of data mining and machine learning operate on l...
Sparse support vector machine (SVM) [1, 22] is a powerful technique that can simultaneously perform classification by margin maximization and variable selection by -norm penalty. The last few years have witnessed many successful applications of sparse SVMs, such as text mining [10, 24], bioinformatics  and image processing [12, 11]. Many algorithms [7, 6, 3, 9, 16] have been proposed to efficiently solve sparse SVM problems. However, the applications of sparse SVMs to large-scale learning problems, which involve a huge number of samples and extremely high-dimensional features, remain challenging.
An emerging technique, called screening , has been shown to be promising in accelerating large-scale sparse learning. The essential idea of screening is to quickly identify the zero coefficients in the sparse solutions without solving any optimization problems such that the corresponding features or samples—that are called inactive
features or samples—can be removed from the training phase. Then, we only need to perform optimization on the reduced datasets instead of the full datasets, leading to substantial savings in the computational cost and memory usage. Here, we need to emphasize that screening differs greatly from feature selection methods, although they look similar at the first glance. To be precise, screening is devoted to accelerating the training of many sparse models including Lasso, Sparse SVM, etc., while feature selection is the goal of these models. In the past few years, many screening methods are proposed for a large set of sparse learning techniques, such as Lasso[19, 23, 21], group Lasso ,
-regularized logistic regression, and SVM . Empirical studies indicate that screening methods can lead to orders of magnitude of speedup in computation time.
However, most existing screening methods study either feature screening or sample screening individually  and their applications have very different scenarios. Specifically, to achieve better performance (say, in terms of speedup), we favor feature screening methods when the number of features is much larger than the number of samples , while sample screening methods are preferable when . Note that there is another class of sparse learning techniques, like sparse SVMs, which induce sparsities in both feature and sample spaces. All these screening methods are helpless in accelerating the training of these models with large and . We also cannot address this problem by simply combining the existing feature and sample screening methods. The reason is that they could mistakenly discard relevant data as they are specifically designed for different sparse models. Recently, Shibagaki et al.  consider this problem and propose a method to simultaneously identify the inactive features and samples in a dynamic manner ; that is, during the optimization process, they trigger their testing rule when there is a sufficient decrease in the duality gap. Thus, the method in  can discard more inactive features and samples as the optimization proceeds and one has small-scale problems to solve in the late stage of the optimization. Nevertheless, the overall speedup can be limited as the problems’ size can be large in the early stage of the optimization. To be specific, the method in  depends heavily on the duality gap during the optimization process. The duality gap in the early stage can always be large, which makes the dual and primal estimations inaccurate and finally results in ineffective screening rules. Hence, it is essentially solving a large problem in the early stage.
In this paper, to address the limitations in the dynamic screening method, we propose a novel screening method that can Simultaneously identify Inactive Features and Samples (SIFS) for sparse SVMs in a static manner, that is, we only need to perform SIFS once before (instead of during) optimization. Thus, we only need to run the optimization algorithm on small-scale problems. The major technical challenge in developing SIFS is that we need to accurately estimate the primal and dual optima. The more accurate the estimations are, the more effective SIFS is in detecting inactive features and samples. Thus, our major technical contribution is a novel framework, which is based on the strong convexity of the primal and dual problems of sparse SVMs [see problems (P) and (D) in Section 2] for deriving accurate estimations of the primal and dual optima (see Section 3). Another appealing feature of SIFS is the so-called synergy effect . Specifically, the proposed SIFS consists of two parts, i.e., Inactive Feature Screening (IFS) and Inactive Samples Screening (ISS). We show that discarding inactive features (samples) identified by IFS (ISS) leads to a more accurate estimation of the primal (dual) optimum, which in turn dramatically enhances the capability of ISS (IFS) in detecting inactive samples (features). Thus, SIFS applies IFS and ISS in an alternating manner until no more inactive features and samples can be identified, leading to much better performance in scaling up large-scale problems than the application of ISS or IFS individually. Moreover, SIFS (see Section 4) is safe in the sense that the detected features and samples are guaranteed to be absent from the sparse representations. To the best of our knowledge, SIFS is the first static screening rule for sparse SVM that is able to simultaneously detect inactive features and samples. Experiments (see Section 5) on both synthetic and real datasets demonstrate that SIFS significantly outperforms the state-of-the-art  in improving the efficiency of sparse SVMs and the speedup can be orders of magnitude. Detailed proofs of theoretical results in the main text are in the supplementary supplements.
Notations: Let , , and be the , , and norms, respectively. We denote the inner product of vectors and by , and the -th component of by . Let for a positive integer . Given a subset of , let be the cardinality of . For a vector , let . For a matrix , let and , where and are the row and column of , respectively. For a scalar , we denote by .
In this section, we briefly review some basics of sparse SVMs and then motivate SIFS via the KKT conditions. Specifically, we focus on the -regularized SVM with a smoothed hinged loss that has strong theoretical guarantees , which takes the form of
where is the parameter vector to be estimated, is the training set, , , , and
are positive parameters, and the loss functionis
where . We present the Lagrangian dual problem of problem (P) and the KKT conditions in the following theorem, which plays a fundamentally important role in developing our screening rule.
which imply that
Thus, we call the feature inactive if . The samples in are the so-called support vectors and we call the samples in and inactive samples.
Suppose that we are given subsets of , , and , then by (R), we can see that many coefficients of and are known. Thus, we may have much less unknowns to solve and the problem size can be dramatically reduced. We formalize this idea in Lemma 1.
Given index sets , and , the followings hold
, , .
Let , , and , where , , and . Then, solves the following scaled dual problem:
Suppose that is known. Then,
Lemma 1 indicates that, if we can identify index sets and and the cardinalities of and are much smaller than the feature dimension and the dataset size , we only need to solve a problem (scaled-D) that may be much smaller than problem (D) to exactly recover the optima and without sacrificing any accuracy.
However, we cannot directly apply the rules in (R) to identify subsets of , , and , as they require the knowledge of and that are usually unavailable. Inspired by the idea in , we can first estimate regions and that contain and , respectively. Then, by denoting
since it is easy to know that , the rules in (R) can be relaxed as follows:
Derive estimations and such that and , respectively.
In this section, we first show that the primal and dual optima admit closed form solutions for specific values of and (see Section 3.1). Then, in Sections 3.2 and 3.3, we present accurate estimations of the primal and dual optima, respectively.
We first show that, if the value of is sufficiently large, no matter what is, the primal solution is .
Let . Then, for and , we have
For any , the next result shows that, if is large enough, the primal and dual optima admit closed form solutions.
If we denote
then for all , we have
In Section 1, we mention that the proposed SIFS consists of IFS and ISS, and an alternating application of IFS and ISS can improve the estimation of the primal and dual optima, which can in turn make ISS and IFS more effective in identifying inactive samples and features, respectively. Lemma 2 shows that discarding inactive features by IFS leads to a more accurate estimation of the primal optimum.
Suppose that the reference solution with and is known. Consider problem (P) with parameters and . Let be the index set of the inactive features identified by the previous IFS steps, i.e., . We define
Then, the following holds:
As is the index set of identified inactive features, we have . Hence, we only need to find an accurate estimation of . Lemma 2 shows that lies in a ball of radius centered at . Note that, before we perform IFS, the set is empty and thus the second term on the right hand side (RHS) of Eq. (6) is . If we apply IFS multiple times (alternating with ISS), the set will be monotonically increasing. Thus, Eq. (6) implies that the radius will be monotonically decreasing, leading to a more accurate primal optimum estimation.
Similar to Lemma 2, the next result shows that ISS can improve the estimation of the dual optimum.
Suppose that the reference solution with and is known. Consider problem (D) with parameters and . Let and be the index sets of inactive samples identified by the previous ISS steps, i.e., , , and . We define
Then, the following holds:
Similar to Lemma 2, Lemma 3 also bounds by a ball. In view of Eq. (9), a similar discussion of Lemma 2—that is, the index sets and monotonically increase and thus the last two terms on the RHS of Eq. (9) monotonically increase when we perform ISS multiple times (alternating with IFS)—implies that the ISS steps can reduce the radius and thus improve the dual optimum estimation.
where is given by Eq. (10) and and are the index sets of inactive features and samples that have been identified in previous screening processes, respectively. The next result shows the closed form solution of problem (11).
We are now ready to present the IFS rule.
Consider problem (P). We suppose that and are known. Then,
The feature screening rule IFS takes the form of
We update the index set by
where is given by Eq. (7) and and are the index sets of inactive features and samples that have been identified in previous screening processes. We show that problems (13) and (14) admit closed form solutions.
We are now ready to present the ISS rule.
Consider problem (D). We suppose that and are known. Then,
The sample screening rule ISS takes the form of
We update the the index sets and by
In real applications, the optimal parameter values of and are usually unknown. To determine appropriate parameter values, common approaches, like cross validation and stability selection, need to solve the model over a grid of parameter values with and . This can be very time-consuming. Inspired by Strong Rule  and SAFE , we develop a sequential version of SIFS in Algorithm 1.
Specifically, given the primal and dual optima and at , we apply SIFS to identify the inactive features and samples for problem (P) at . Then, we perform optimization on the reduced dataset and solve the primal and dual optima at . We repeat this process until we solve problem (P) at all pairs of parameter values.
Note that we insert into every sequence ( see line 1 in Algorithm 1) to obtain a closed-form solution as the first reference solution. In this way, we can avoid solving problem at directly (without screening), which is time consuming. At last, we would like to point out that the values in SIFS can be specified by users arbitrarily.
SIFS applies ISS and IFS in an alternating manner to reinforce their capability in identifying inactive samples and features. In Algorithm 1, we apply ISS first. Of course, we can also apply IFS first. The theorem below demonstrates that the orders have no impact on the performance of SIFS.
Given the optimal solutions and at as the reference solution pair at for SIFS, we assume SIFS with ISS first stops after applying IFS and ISS for times and denote the identified inactive features and samples as and . Similarly, when we apply IFS first, the results are denoted as and . Then, the followings hold:
(1) and .
(2) With different orders of applying ISS and IFS, the difference of the times of ISS and IFS we need to apply in SIFS can never be larger than 1, that is, .
From Remark 1, we can see that our SIFS can also be applied to solve a single problem, due to the existence of the free reference solution pair.
We evaluate SIFS on both synthetic and real datasets in terms of three measurements. The first one is the scaling ratio: , where , , , and are the numbers of inactive samples and features identified by SIFS, sample size, and feature dimension of the datasets. The second measure is rejection ratios of each triggering of ISS and IFS in SIFS: and , where and are the numbers of inactive samples and features identified in -th triggering of ISS and IFS in SIFS. and are the numbers of inactive samples and features in the solution. The third measure is speedup, i.e., the ratio of the running time of the solver without screening to that with screening.
Recall that, we can integrate SIFS with any solvers for problem (P). In this experiment, we use Accelerated Proximal Stochastic Dual Coordinate Ascent (Accelerated-Prox-SDCA) , as it is one of the state-of-the-arts. As we mentioned in the introduction section that screening differs greatly from features selection methods, it is not appropriate to make comparisons with feature selection methods. To this end, we only choose the state-of-art screening method for Sparse SVMs in  as a baseline in the experiments.
For each dataset, we solve problem (P) at a grid of turning parameter values. Specifically, we first compute by Theorem 2 and then select 10 values of that are equally spaced on the logarithmic scale of from to . Then, for each value of , we first compute by Theorem 3 and then select values of that are equally spaced on the logarithmic scale of from to . Thus, for each dataset, we solve problem (P) at pairs of parameter values in total. We write the code in C++ along with Eigen library for some numerical computations. We perform all the computations on a single core of Intel(R) Core(TM) i7-5930K 3.50GHz, 128GB MEM.
We evaluate SIFS on 3 synthetic datasets named syn1, syn2 and syn3 with sample and feature size . We present each data point as with and
. We use Gaussian distributionsand to generate the data points, where and
is the identity matrix. To be precise,for positive and negative points are sampled from and , respectively. For each entry in , it has chance to be sampled from and chance to be 0.
Fig. 1 shows the scaling ratios by ISS, IFS, and SIFS on the synthetic datasets at parameter values. We can see that IFS is more effective in scaling problem size than ISS, with scaling ratios roughly against . Moreover, SIFS, which is an alternating application of IFS and ISS, significantly outperforms ISS and IFS, with scaling ratios roughly . This high scaling ratios imply that SIFS can lead to a significant speedup.
Due to the space limitation, we only report the rejection ratios of SIFS on syn2. Other results can be found in the supplementary material. Fig. 2 shows that SIFS can identify most of the inactive features and samples. However, few features and samples are identified in the second and later triggerings of ISS and IFS. The reason may be that the task here is so simple that one triggering is enough.
Table 1 reports the running time of solver without and with IFS, ISS and SIFS for solving problem (P) at pairs of parameter values. We can see that SIFS leads to significant speedups, that is, up to times. Taking syn2 for example, without SIFS, the solver takes more than two hours to solve problem (P) at pairs of parameter values. However, combined with SIFS, the solver only needs less than three minutes for solving the same set of problems. From the theoretical analysis in  for Accelerated-Prox-SDCA, we can see that its computational complexity rises proportionately to the sample size and the feature dimension . From this theoretical result, we can see that the results in Figure 1 are roughly consistent with the speedups we achieved shown in Table 1.
In this experiment, we evaluate the performance of SIFS on 5 large-scale real datasets: real-sim, rcv1-train, rcv1-test, url, and kddb, which are all collected from the project page of LibSVM . See Table 2 for a brief summary. We note that, the kddb dataset has about 20 million samples with 30 million features.
|Dataset||Feature size:||Sample size:|
Recall that, SIFS detects the inactive features and samples in a static manner, i.e., we perform SIFS only once before the optimization and thus the size of the problem we need to perform optimization on is fixed. However, the method in  detects inactive features and samples in a dynamic manner , i.e., they perform their method along with the optimization and thus the size of the problem would keep decreasing during the iterative process. Thus, comparing SIFS with the method in  in terms of rejection ratios is inapplicable. We compare the performance of SIFS with the method in  in terms of speedup. Specifically, we compare the speedup gained by SIFS and the method in  for solving problem (P) at pairs of parameter values. The code of the method in  is obtained from (https://github.com/husk214/s3fs).
|Data Set||Solver||Method in +Solver||SIFS+Solver|
Fig. 3 shows the rejection ratios of SIFS on the real-sim dataset (other results are in the supplementary material). In Fig. 3, we can see that some inactive features and samples are identified in the 2nd and 3rd triggering of ISS and IFS, which verifies the necessity of the alternating application of ISS and IFS. SIFS is efficient since it always stops in 3 times of triggering. In addition, most of () the inactive features can be identified in the 1st triggering of IFS while identifying inactive samples needs to apply ISS two or more times. It may result from two reasons: 1) We run ISS first, which reinforces the capability of IFS due to the synergy effect (see Sections 4.1 and 4.2), see Section A.12.1 in the supplementary material for further verification; 2) Feature screening here may be easier than sample screening.
Table 3 reports the running time of solver without and with the method in  and SIFS for solving problem (P) at pairs of parameter values on real datasets. The speedup gained by SIFS is up to times on real-sim, rcv1-train and rcv1-test. Moreover, SIFS significantly outperforms the method in  in terms of speedup—by about to times faster on the aforementioned three datasets. For datasets url and kddb, we do not report the results of the solver as the sizes of the datasets are huge and the computational cost is prohibitive. Instead, we can see that the solver with SIFS is about times faster than the solver with the method in  on both datasets url and kddb. Take the dataset kddb as an example. The solver with SIFS takes about hours to solve problem (P) for all pairs of parameter values, while the solver with the method in  needs days to finish the same task.
In this paper, we develop a novel data reduction method SIFS to simultaneously identify inactive features and samples for sparse SVM. Our major contribution is a novel framework for an accurate estimation of the primal and dual optima based on strong convexity. To the best of our knowledge, the proposed SIFS is the first static screening method that is able to simultaneously identify inactive features and samples for sparse SVMs. An appealing feature of SIFS is that all detected features and samples are guaranteed to be irrelevant to the outputs. Thus, the model learned on the reduced data is identical to the one learned on the full data. Experiments on both synthetic and real datasets demonstrate that SIFS can dramatically reduce the problem size and the resulting speedup can be orders of magnitude. We plan to generalize SIFS to more complicated models, e.g., SVM with a structured sparsity-inducing penalty.
This work was supported by the National Basic Research Program of China (973 Program) under Grant 2013CB336500, National Natural Science Foundation of China under Grant 61233011 and National Youth Top-notch Talent Support Program.
The Journal of Machine Learning Research, 3:1229–1243, 2003.
Safe feature elimination in sparse supervised learning.Pacific Journal of Optimization, 8:667–698, 2012.
In this appendix, we first present the detailed proofs of all the theorems in the main text and then report the rest experiment results which are omitted in the experiment section due to the space limitation.