Scaling Up Sparse Support Vector Machines by Simultaneous Feature and Sample Reduction

07/24/2016 ∙ by Weizhong Zhang, et al. ∙ 0

Sparse support vector machine (SVM) is a popular classification technique that can simultaneously learn a small set of the most interpretable features and identify the support vectors. It has achieved great successes in many real-world applications. However, for large-scale problems involving a huge number of samples and extremely high-dimensional features, solving sparse SVMs remains challenging. By noting that sparse SVMs induce sparsities in both feature and sample spaces, we propose a novel approach, which is based on accurate estimations of the primal and dual optima of sparse SVMs, to simultaneously identify the features and samples that are guaranteed to be irrelevant to the outputs. Thus, we can remove the identified inactive samples and features from the training phase, leading to substantial savings in both the memory usage and computational cost without sacrificing accuracy. To the best of our knowledge, the proposed method is the firststatic feature and sample reduction method for sparse SVM. Experiments on both synthetic and real datasets (e.g., the kddb dataset with about 20 million samples and 30 million features) demonstrate that our approach significantly outperforms state-of-the-art methods and the speedup gained by our approach can be orders of magnitude.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

page 22

page 25

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sparse support vector machine (SVM) [1, 22] is a powerful technique that can simultaneously perform classification by margin maximization and variable selection by -norm penalty. The last few years have witnessed many successful applications of sparse SVMs, such as text mining [10, 24], bioinformatics [13] and image processing [12, 11]. Many algorithms [7, 6, 3, 9, 16] have been proposed to efficiently solve sparse SVM problems. However, the applications of sparse SVMs to large-scale learning problems, which involve a huge number of samples and extremely high-dimensional features, remain challenging.

An emerging technique, called screening [5], has been shown to be promising in accelerating large-scale sparse learning. The essential idea of screening is to quickly identify the zero coefficients in the sparse solutions without solving any optimization problems such that the corresponding features or samples—that are called inactive

features or samples—can be removed from the training phase. Then, we only need to perform optimization on the reduced datasets instead of the full datasets, leading to substantial savings in the computational cost and memory usage. Here, we need to emphasize that screening differs greatly from feature selection methods, although they look similar at the first glance. To be precise, screening is devoted to accelerating the training of many sparse models including Lasso, Sparse SVM, etc., while feature selection is the goal of these models. In the past few years, many screening methods are proposed for a large set of sparse learning techniques, such as Lasso

[19, 23, 21], group Lasso [14],

-regularized logistic regression

[20], and SVM [15]. Empirical studies indicate that screening methods can lead to orders of magnitude of speedup in computation time.

However, most existing screening methods study either feature screening or sample screening individually [18] and their applications have very different scenarios. Specifically, to achieve better performance (say, in terms of speedup), we favor feature screening methods when the number of features is much larger than the number of samples , while sample screening methods are preferable when . Note that there is another class of sparse learning techniques, like sparse SVMs, which induce sparsities in both feature and sample spaces. All these screening methods are helpless in accelerating the training of these models with large and . We also cannot address this problem by simply combining the existing feature and sample screening methods. The reason is that they could mistakenly discard relevant data as they are specifically designed for different sparse models. Recently, Shibagaki et al. [18] consider this problem and propose a method to simultaneously identify the inactive features and samples in a dynamic manner [2]; that is, during the optimization process, they trigger their testing rule when there is a sufficient decrease in the duality gap. Thus, the method in [18] can discard more inactive features and samples as the optimization proceeds and one has small-scale problems to solve in the late stage of the optimization. Nevertheless, the overall speedup can be limited as the problems’ size can be large in the early stage of the optimization. To be specific, the method in [18] depends heavily on the duality gap during the optimization process. The duality gap in the early stage can always be large, which makes the dual and primal estimations inaccurate and finally results in ineffective screening rules. Hence, it is essentially solving a large problem in the early stage.

In this paper, to address the limitations in the dynamic screening method, we propose a novel screening method that can Simultaneously identify Inactive Features and Samples (SIFS) for sparse SVMs in a static manner, that is, we only need to perform SIFS once before (instead of during) optimization. Thus, we only need to run the optimization algorithm on small-scale problems. The major technical challenge in developing SIFS is that we need to accurately estimate the primal and dual optima. The more accurate the estimations are, the more effective SIFS is in detecting inactive features and samples. Thus, our major technical contribution is a novel framework, which is based on the strong convexity of the primal and dual problems of sparse SVMs [see problems (P) and (D) in Section 2] for deriving accurate estimations of the primal and dual optima (see Section 3). Another appealing feature of SIFS is the so-called synergy effect [18]. Specifically, the proposed SIFS consists of two parts, i.e., Inactive Feature Screening (IFS) and Inactive Samples Screening (ISS). We show that discarding inactive features (samples) identified by IFS (ISS) leads to a more accurate estimation of the primal (dual) optimum, which in turn dramatically enhances the capability of ISS (IFS) in detecting inactive samples (features). Thus, SIFS applies IFS and ISS in an alternating manner until no more inactive features and samples can be identified, leading to much better performance in scaling up large-scale problems than the application of ISS or IFS individually. Moreover, SIFS (see Section 4) is safe in the sense that the detected features and samples are guaranteed to be absent from the sparse representations. To the best of our knowledge, SIFS is the first static screening rule for sparse SVM that is able to simultaneously detect inactive features and samples. Experiments (see Section 5) on both synthetic and real datasets demonstrate that SIFS significantly outperforms the state-of-the-art [18] in improving the efficiency of sparse SVMs and the speedup can be orders of magnitude. Detailed proofs of theoretical results in the main text are in the supplementary supplements.

Notations: Let , , and be the , , and norms, respectively. We denote the inner product of vectors and by , and the -th component of by . Let for a positive integer . Given a subset of , let be the cardinality of . For a vector , let . For a matrix , let and , where and are the row and column of , respectively. For a scalar , we denote by .

2 Basics and Motivations

In this section, we briefly review some basics of sparse SVMs and then motivate SIFS via the KKT conditions. Specifically, we focus on the -regularized SVM with a smoothed hinged loss that has strong theoretical guarantees [17], which takes the form of

(P)

where is the parameter vector to be estimated, is the training set, , , , and

are positive parameters, and the loss function

is

where . We present the Lagrangian dual problem of problem (P) and the KKT conditions in the following theorem, which plays a fundamentally important role in developing our screening rule.

Theorem 1.

Let and be the soft-thresholding operator [8], i.e., . Then, for problem (P), the followings hold:
The dual problem of (P) is

(D)

where is a vector with all components equal to .
Denote the optima of (P) and (D) by and , respectively. Then,

(KKT-1)
(KKT-2)

According to KKT-1 and KKT-2, we define 4 index sets:

which imply that

(R)

Thus, we call the feature inactive if . The samples in are the so-called support vectors and we call the samples in and inactive samples.

Suppose that we are given subsets of , , and , then by (R), we can see that many coefficients of and are known. Thus, we may have much less unknowns to solve and the problem size can be dramatically reduced. We formalize this idea in Lemma 1.

Lemma 1.

Given index sets , and , the followings hold
, , .
Let , , and , where , , and . Then, solves the following scaled dual problem:

(scaled-D)

Suppose that is known. Then,

Lemma 1 indicates that, if we can identify index sets and and the cardinalities of and are much smaller than the feature dimension and the dataset size , we only need to solve a problem (scaled-D) that may be much smaller than problem (D) to exactly recover the optima and without sacrificing any accuracy.

However, we cannot directly apply the rules in (R) to identify subsets of , , and , as they require the knowledge of and that are usually unavailable. Inspired by the idea in [5], we can first estimate regions and that contain and , respectively. Then, by denoting

(1)
(2)
(3)

since it is easy to know that , the rules in (R) can be relaxed as follows:

(R1)
(R2)

In view of R1 and R2, we sketch the development of SIFS as follows.

  1. Derive estimations and such that and , respectively.

  2. Develop SIFS by deriving the relaxed screening rules R1 and R2, i.e., by solving the optimization problems in Eq. (1), Eq. (2) and Eq. (3).

3 Estimate the Primal and Dual Optima

In this section, we first show that the primal and dual optima admit closed form solutions for specific values of and (see Section 3.1). Then, in Sections 3.2 and 3.3, we present accurate estimations of the primal and dual optima, respectively.

3.1 Effective Intervals of the Parameters and

We first show that, if the value of is sufficiently large, no matter what is, the primal solution is .

Theorem 2.

Let . Then, for and , we have

For any , the next result shows that, if is large enough, the primal and dual optima admit closed form solutions.

Theorem 3.

If we denote

then for all , we have

(4)

By Theorems 2 and 3, we only need to consider the cases with and .

3.2 Primal Optimum Estimation

In Section 1, we mention that the proposed SIFS consists of IFS and ISS, and an alternating application of IFS and ISS can improve the estimation of the primal and dual optima, which can in turn make ISS and IFS more effective in identifying inactive samples and features, respectively. Lemma 2 shows that discarding inactive features by IFS leads to a more accurate estimation of the primal optimum.

Lemma 2.

Suppose that the reference solution with and is known. Consider problem (P) with parameters and . Let be the index set of the inactive features identified by the previous IFS steps, i.e., . We define

(5)
(6)

Then, the following holds:

(7)

As is the index set of identified inactive features, we have . Hence, we only need to find an accurate estimation of . Lemma 2 shows that lies in a ball of radius centered at . Note that, before we perform IFS, the set is empty and thus the second term on the right hand side (RHS) of Eq. (6) is . If we apply IFS multiple times (alternating with ISS), the set will be monotonically increasing. Thus, Eq. (6) implies that the radius will be monotonically decreasing, leading to a more accurate primal optimum estimation.

3.3 Dual Optimum Estimation

Similar to Lemma 2, the next result shows that ISS can improve the estimation of the dual optimum.

Lemma 3.

Suppose that the reference solution with and is known. Consider problem (D) with parameters and . Let and be the index sets of inactive samples identified by the previous ISS steps, i.e., , , and . We define

(8)
(9)

Then, the following holds:

(10)

Similar to Lemma 2, Lemma 3 also bounds by a ball. In view of Eq. (9), a similar discussion of Lemma 2—that is, the index sets and monotonically increase and thus the last two terms on the RHS of Eq. (9) monotonically increase when we perform ISS multiple times (alternating with IFS)—implies that the ISS steps can reduce the radius and thus improve the dual optimum estimation.

Remark 1.

To estimate and by Lemmas 2 and 3, we have a free reference solution pair and with . From Theorems 2 and 3, we know that in this setting, and admit closed form solutions.

4 The Proposed SIFS Screening Rule

We first present the IFS and ISS rules in Sections 4.1 and 4.2, respectively. Then, in Section 4.3, we develop the SIFS screening rule by an alternating application of IFS and ISS.

4.1 Inactive Feature Screening (IFS)

Suppose that and are known, we derive IFS to identify inactive features for problem (P) at by solving the optimization problem in Eq. (1) (see Section E in the supplementary material):

(11)

where is given by Eq. (10) and and are the index sets of inactive features and samples that have been identified in previous screening processes, respectively. The next result shows the closed form solution of problem (11).

Lemma 4.

Consider problem (11). Let and be given by Eq. (8) and Eq. (9). Then, for all , we have

We are now ready to present the IFS rule.

Theorem 4.

Consider problem (P). We suppose that and are known. Then,

  1. The feature screening rule IFS takes the form of

    (IFS)
  2. We update the index set by

    (12)

Recall that (Lemma 3), previous sample screening results give us a more tighter dual estimation, i.e., a smaller feasible region for problem (11), which results in a smaller . It finally leads us to a more powerful feature screening rule IFS. This is the so called synergy effect.

4.2 Inactive Sample Screening (ISS)

Similar to IFS, we derive ISS to identify inactive samples by solving the optimization problems in Eq. (2) and Eq. (3) (see Section G in the supplementary material for details):

(13)
(14)

where is given by Eq. (7) and and are the index sets of inactive features and samples that have been identified in previous screening processes. We show that problems (13) and (14) admit closed form solutions.

Lemma 5.

Consider problems (13) and (14). Let and be given by Eq. (5) and Eq. (6). Then,

We are now ready to present the ISS rule.

Theorem 5.

Consider problem (D). We suppose that and are known. Then,

  1. The sample screening rule ISS takes the form of

    (ISS)
  2. We update the the index sets and by

    (15)
    (16)

The synergy effect also exists here. Recall that (Lemma 2), previous feature screening results lead a smaller feasible region for the problems (13) and (14), which results in smaller and bigger . It finally leads us to a more accurate sample screening rule ISS.

4.3 The Proposed SIFS Rule by An Alternating Application of IFS and ISS

In real applications, the optimal parameter values of and are usually unknown. To determine appropriate parameter values, common approaches, like cross validation and stability selection, need to solve the model over a grid of parameter values with and . This can be very time-consuming. Inspired by Strong Rule [19] and SAFE [5], we develop a sequential version of SIFS in Algorithm 1.

1:  Input: and .
2:  for  to  do
3:     Compute the first reference solution and using the close-form formula (4).
4:     for  to  do
5:        Initialization:
6:        repeat
7:           Run sample screening using rule ISS based on .
8:           Update and by Eq. (15) and Eq. (16), respectively.
9:           Run feature screening using rule IFS based on .
10:           Update by Eq. (12).
11:        until No new inactive features or samples are identified
12:        Compute and by solving the scaled problem.
13:     end for
14:  end for
15:  Output: and .
Algorithm 1 SIFS

Specifically, given the primal and dual optima and at , we apply SIFS to identify the inactive features and samples for problem (P) at . Then, we perform optimization on the reduced dataset and solve the primal and dual optima at . We repeat this process until we solve problem (P) at all pairs of parameter values.

Note that we insert into every sequence ( see line 1 in Algorithm 1) to obtain a closed-form solution as the first reference solution. In this way, we can avoid solving problem at directly (without screening), which is time consuming. At last, we would like to point out that the values in SIFS can be specified by users arbitrarily.

SIFS applies ISS and IFS in an alternating manner to reinforce their capability in identifying inactive samples and features. In Algorithm 1, we apply ISS first. Of course, we can also apply IFS first. The theorem below demonstrates that the orders have no impact on the performance of SIFS.

Theorem 6.

Given the optimal solutions and at as the reference solution pair at for SIFS, we assume SIFS with ISS first stops after applying IFS and ISS for times and denote the identified inactive features and samples as and . Similarly, when we apply IFS first, the results are denoted as and . Then, the followings hold:
(1) and .
(2) With different orders of applying ISS and IFS, the difference of the times of ISS and IFS we need to apply in SIFS can never be larger than 1, that is, .

Remark 2.

From Remark 1, we can see that our SIFS can also be applied to solve a single problem, due to the existence of the free reference solution pair.

5 Experiments

We evaluate SIFS on both synthetic and real datasets in terms of three measurements. The first one is the scaling ratio: , where , , , and are the numbers of inactive samples and features identified by SIFS, sample size, and feature dimension of the datasets. The second measure is rejection ratios of each triggering of ISS and IFS in SIFS: and , where and are the numbers of inactive samples and features identified in -th triggering of ISS and IFS in SIFS. and are the numbers of inactive samples and features in the solution. The third measure is speedup, i.e., the ratio of the running time of the solver without screening to that with screening.

Recall that, we can integrate SIFS with any solvers for problem (P). In this experiment, we use Accelerated Proximal Stochastic Dual Coordinate Ascent (Accelerated-Prox-SDCA) [17], as it is one of the state-of-the-arts. As we mentioned in the introduction section that screening differs greatly from features selection methods, it is not appropriate to make comparisons with feature selection methods. To this end, we only choose the state-of-art screening method for Sparse SVMs in [18] as a baseline in the experiments.

For each dataset, we solve problem (P) at a grid of turning parameter values. Specifically, we first compute by Theorem 2 and then select 10 values of that are equally spaced on the logarithmic scale of from to . Then, for each value of , we first compute by Theorem 3 and then select values of that are equally spaced on the logarithmic scale of from to . Thus, for each dataset, we solve problem (P) at pairs of parameter values in total. We write the code in C++ along with Eigen library for some numerical computations. We perform all the computations on a single core of Intel(R) Core(TM) i7-5930K 3.50GHz, 128GB MEM.

5.1 Simulation Studies

We evaluate SIFS on 3 synthetic datasets named syn1, syn2 and syn3 with sample and feature size . We present each data point as with and

. We use Gaussian distributions

and to generate the data points, where and

is the identity matrix. To be precise,

for positive and negative points are sampled from and , respectively. For each entry in , it has chance to be sampled from and chance to be 0.

(a) The scaling ratios of ISS, IFS, and SIFS on syn1.
(b) The scaling ratios of ISS, IFS, and SIFS on syn2.
(c) The scaling ratios of ISS, IFS, and SIFS on syn3.
Figure 1: Scaling ratios of ISS, IFS and SIFS (from left to right).

Fig. 1 shows the scaling ratios by ISS, IFS, and SIFS on the synthetic datasets at parameter values. We can see that IFS is more effective in scaling problem size than ISS, with scaling ratios roughly against . Moreover, SIFS, which is an alternating application of IFS and ISS, significantly outperforms ISS and IFS, with scaling ratios roughly . This high scaling ratios imply that SIFS can lead to a significant speedup.

(a) =0.05
(b) =0.1
(c) =0.5
(d) =0.9
(e) =0.05
(f) =0.1
(g) =0.5
(h) =0.9
Figure 2: Rejection ratios of SIFS on syn 2 (first row: Feature Screening, second row: Sample Screening).
Data Solver ISS+Solver IFS+Solver SIFS+Solver
ISS Solver Speedup IFS Solver Speedup SIFS Solver Speedup
syn1 499.1 4.9 27.8 15.3 2.3 42.6 11.1 8.6 6.0 34.2
syn2 8749.9 24.9 1496.6 5.8 23.0 288.1 28.1 92.6 70.3 53.7
syn3 1279.7 2.0 257.1 4.9 2.2 33.4 36.0 7.2 9.5 76.8
Table 1: Running time (in seconds) for solving problem (P) at pairs of parameter values on three synthetic datasets.

Due to the space limitation, we only report the rejection ratios of SIFS on syn2. Other results can be found in the supplementary material. Fig. 2 shows that SIFS can identify most of the inactive features and samples. However, few features and samples are identified in the second and later triggerings of ISS and IFS. The reason may be that the task here is so simple that one triggering is enough.

Table 1 reports the running time of solver without and with IFS, ISS and SIFS for solving problem (P) at pairs of parameter values. We can see that SIFS leads to significant speedups, that is, up to times. Taking syn2 for example, without SIFS, the solver takes more than two hours to solve problem (P) at pairs of parameter values. However, combined with SIFS, the solver only needs less than three minutes for solving the same set of problems. From the theoretical analysis in [17] for Accelerated-Prox-SDCA, we can see that its computational complexity rises proportionately to the sample size and the feature dimension . From this theoretical result, we can see that the results in Figure 1 are roughly consistent with the speedups we achieved shown in Table 1.

5.2 Experiments on Real Datasets

In this experiment, we evaluate the performance of SIFS on 5 large-scale real datasets: real-sim, rcv1-train, rcv1-test, url, and kddb, which are all collected from the project page of LibSVM [4]. See Table 2 for a brief summary. We note that, the kddb dataset has about 20 million samples with 30 million features.

Dataset Feature size: Sample size:
real-sim 20,958 72,309
rcv1-train 47,236 20,242
rcv1-test 47,236 677, 399
url 3,231,961 2,396,130
kddb 29,890,095 19,264,097
Table 2: Statistics of the real datasets.

Recall that, SIFS detects the inactive features and samples in a static manner, i.e., we perform SIFS only once before the optimization and thus the size of the problem we need to perform optimization on is fixed. However, the method in [18] detects inactive features and samples in a dynamic manner [2], i.e., they perform their method along with the optimization and thus the size of the problem would keep decreasing during the iterative process. Thus, comparing SIFS with the method in [18] in terms of rejection ratios is inapplicable. We compare the performance of SIFS with the method in [18] in terms of speedup. Specifically, we compare the speedup gained by SIFS and the method in [18] for solving problem (P) at pairs of parameter values. The code of the method in [18] is obtained from (https://github.com/husk214/s3fs).

(a) =0.05
(b) =0.1
(c) =0.5
(d) =0.9
(e) =0.05
(f) =0.1
(g) =0.5
(h) =0.9
Figure 3: Rejection ratios of SIFS on the real-sim dataset (first row: Feature Screening, second row: Sample Screening).
Data Set Solver Method in [18]+Solver SIFS+Solver
Screen Solver Speedup Screen Solver Speedup
real-sim 3.93E+04 24.10 4.94E+03 7.91 60.01 140.25 195.00
rcv1-train 2.98E+04 10.00 3.73E+03 7.90 27.11 80.11 277.10
rcv1-test 1.10E+06 398.00 1.35E+05 8.10 1.17E+03 2.55E+03 295.11
url 3.18E+04 8.60E+05 7.66E+03 2.91E+04
kddb 4.31E+04 1.16E+06 1.10E+04 3.6E+04
Table 3: Running time (in seconds) for solving problem (P) at pairs of parameter values on five real datasets.

Fig. 3 shows the rejection ratios of SIFS on the real-sim dataset (other results are in the supplementary material). In Fig. 3, we can see that some inactive features and samples are identified in the 2nd and 3rd triggering of ISS and IFS, which verifies the necessity of the alternating application of ISS and IFS. SIFS is efficient since it always stops in 3 times of triggering. In addition, most of () the inactive features can be identified in the 1st triggering of IFS while identifying inactive samples needs to apply ISS two or more times. It may result from two reasons: 1) We run ISS first, which reinforces the capability of IFS due to the synergy effect (see Sections 4.1 and 4.2), see Section A.12.1 in the supplementary material for further verification; 2) Feature screening here may be easier than sample screening.

Table 3 reports the running time of solver without and with the method in [18] and SIFS for solving problem (P) at pairs of parameter values on real datasets. The speedup gained by SIFS is up to times on real-sim, rcv1-train and rcv1-test. Moreover, SIFS significantly outperforms the method in [18] in terms of speedup—by about to times faster on the aforementioned three datasets. For datasets url and kddb, we do not report the results of the solver as the sizes of the datasets are huge and the computational cost is prohibitive. Instead, we can see that the solver with SIFS is about times faster than the solver with the method in [18] on both datasets url and kddb. Take the dataset kddb as an example. The solver with SIFS takes about hours to solve problem (P) for all pairs of parameter values, while the solver with the method in [18] needs days to finish the same task.

6 Conclusion

In this paper, we develop a novel data reduction method SIFS to simultaneously identify inactive features and samples for sparse SVM. Our major contribution is a novel framework for an accurate estimation of the primal and dual optima based on strong convexity. To the best of our knowledge, the proposed SIFS is the first static screening method that is able to simultaneously identify inactive features and samples for sparse SVMs. An appealing feature of SIFS is that all detected features and samples are guaranteed to be irrelevant to the outputs. Thus, the model learned on the reduced data is identical to the one learned on the full data. Experiments on both synthetic and real datasets demonstrate that SIFS can dramatically reduce the problem size and the resulting speedup can be orders of magnitude. We plan to generalize SIFS to more complicated models, e.g., SVM with a structured sparsity-inducing penalty.

Acknowledgements

This work was supported by the National Basic Research Program of China (973 Program) under Grant 2013CB336500, National Natural Science Foundation of China under Grant 61233011 and National Youth Top-notch Talent Support Program.

References

  • [1] Jinbo Bi, Kristin Bennett, Mark Embrechts, Curt Breneman, and Minghu Song. Dimensionality reduction via sparse support vector machines.

    The Journal of Machine Learning Research

    , 3:1229–1243, 2003.
  • [2] Antoine Bonnefoy, Valentin Emiya, Liva Ralaivola, and Rémi Gribonval. A dynamic screening principle for the lasso. In Signal Processing Conference (EUSIPCO), 2014 Proceedings of the 22nd European, pages 6–10. IEEE, 2014.
  • [3] Bryan Catanzaro, Narayanan Sundaram, and Kurt Keutzer. Fast support vector machine training and classification on graphics processors. In Proceedings of the 25th international conference on Machine learning, pages 104–111. ACM, 2008.
  • [4] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011.
  • [5] Laurent El Ghaoui, Vivian Viallon, and Tarek Rabbani.

    Safe feature elimination in sparse supervised learning.

    Pacific Journal of Optimization, 8:667–698, 2012.
  • [6] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classification. The Journal of Machine Learning Research, 9:1871–1874, 2008.
  • [7] Trevor Hastie, Saharon Rosset, Robert Tibshirani, and Ji Zhu. The entire regularization path for the support vector machine. The Journal of Machine Learning Research, 5:1391–1415, 2004.
  • [8] Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC Press, 2015.
  • [9] Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S Sathiya Keerthi, and Sellamanickam Sundararajan. A dual coordinate descent method for large-scale linear svm. In Proceedings of the 25th international conference on Machine learning, pages 408–415. ACM, 2008.
  • [10] Thorsten Joachims. Text categorization with support vector machines: Learning with many relevant features. Springer, 1998.
  • [11] Irene Kotsia and Ioannis Pitas. Facial expression recognition in image sequences using geometric deformation features and support vector machines. Image Processing, IEEE Transactions on, 16(1):172–187, 2007.
  • [12] Johannes Mohr and Klaus Obermayer. A topographic support vector machine: Classification using local label configurations. In Advances in Neural Information Processing Systems, pages 929–936, 2004.
  • [13] Harikrishna Narasimhan and Shivani Agarwal. Svm pauc tight: a new support vector method for optimizing partial auc based on a tight convex upper bound. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 167–175. ACM, 2013.
  • [14] Eugene Ndiaye, Olivier Fercoq, Alexandre Gramfort, and Joseph Salmon. Gap safe screening rules for sparse-group lasso. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 388–396. Curran Associates, Inc., 2016.
  • [15] Kohei Ogawa, Yoshiki Suzuki, and Ichiro Takeuchi. Safe screening of non-support vectors in pathwise svm computation. In Proceedings of the 30th International Conference on Machine Learning, pages 1382–1390, 2013.
  • [16] Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming, 127(1):3–30, 2011.
  • [17] Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, 155(1-2):105–145, 2016.
  • [18] Atsushi Shibagaki, Masayuki Karasuyama, Kohei Hatano, and Ichiro Takeuchi. Simultaneous safe screening of features and samples in doubly sparse modeling. In Proceedings of The 33rd International Conference on Machine Learning, 2016.
  • [19] Robert Tibshirani, Jacob Bien, Jerome Friedman, Trevor Hastie, Noah Simon, Jonathan Taylor, and Ryan J Tibshirani. Strong rules for discarding predictors in lasso-type problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(2):245–266, 2012.
  • [20] Jie Wang, Jiayu Zhou, Jun Liu, Peter Wonka, and Jieping Ye. A safe screening rule for sparse logistic regression. In Advances in Neural Information Processing Systems, pages 1053–1061, 2014.
  • [21] Jie Wang, Jiayu Zhou, Peter Wonka, and Jieping Ye. Lasso screening rules via dual polytope projection. In Advances in Neural Information Processing Systems, pages 1070–1078, 2013.
  • [22] Li Wang, Ji Zhu, and Hui Zou. The doubly regularized support vector machine. Statistica Sinica, pages 589–615, 2006.
  • [23] Zhen James Xiang and Peter J Ramadge. Fast lasso screening tests based on correlations. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, pages 2137–2140. IEEE, 2012.
  • [24] Yuya Yoshikawa, Tomoharu Iwata, and Hiroshi Sawada. Latent support measure machines for bag-of-words data classification. In Advances in Neural Information Processing Systems, pages 1961–1969, 2014.

Appendix A Appendix

In this appendix, we first present the detailed proofs of all the theorems in the main text and then report the rest experiment results which are omitted in the experiment section due to the space limitation.

a.1 Proof for Theorem 1

Proof.

of Theorem 1:

Let and , the primal problem (P) then is equivalent to

The Lagrangian then becomes

(17)