1 Introduction
Popular machine learning algorithms are computationally expensive, or worse yet, intractable to train on massive data sets, where the input data set is so large that it may not be possible to process all the data at one time. A natural approach to achieve scalability when faced with Big Data is to first conduct a preprocessing step to summarize the input data points by a significantly smaller, representative set. Offtheshelf training algorithms can then be run efficiently on this compressed set of data points. The premise of this twostep learning procedure is that the model trained on the compressed set will be provably competitive with the model trained on the original set – as long as the data summary, i.e., the
coreset, can be generated efficiently and is sufficiently representative.Coresets are small weighted subsets of the training points such that models trained on the coreset are approximately as good as the ones trained on the original (massive) data set. Coreset constructions were originally introduced in the context of computational geometry [1]
and subsequently generalized for applications to other problems, such as logistic regression, neural network compression, and mixture model training
[6, 7, 10, 18, 21] (see [11] for a survey).A popular coreset construction technique – and the one that we leverage in this paper – is to use importance sampling with respect to the points’ sensitivities
. The sensitivity of each point is defined to be the worstcase relative impact of each data point on the objective function. Points with high sensitivities have a large impact on the objective value and are sampled with correspondingly high probability, and viceversa. The main challenge in generating smallsized coresets often lies in evaluating the importance of each point in an accurate and computationallyefficient way.
1.1 Our Contributions
In this paper, we propose an efficient coreset construction algorithm to generate compact representations of large data sets to accelerate SVM training. Our approach hinges on bridging the SVM problem with that of means clustering. As a corollary to our theoretical analysis, we obtain theoretical justification for the widely reported empirical success of using means clustering as a way to generate data summaries for largescale SVM training. In contrast to prior approaches, our approach is both (i) provably efficient and (ii) naturally extends to streaming or dynamic data settings. Above all, our approach can be used to enable the applicability of any offtheshelf SVM solver – including gradientbased and/or approximate ones, e.g., Pegasos [27], to streaming and distributed data settings by exploiting the composibility and reducibility properties of coresets [11].
In particular, this paper contributes the following:

A coreset construction algorithm for accelerating SVM training based on an efficient importance sampling scheme.

An analysis proving lower bounds on the number of samples required by any coreset construction algorithm to approximate the input data set.

Theoretical guarantees on the efficiency and accuracy of our coreset construction algorithm.

Evaluations on synthetic and realworld data sets that demonstrate the effectiveness of our algorithm in both streaming and offline settings.
2 Related Work
Training SVMs requires time and space in the offline setting where is the number of training points. Towards the goal of accelerating SVM training in the offline setting, [28, 29] introduced the Core Vector Machine (CVM) and Ball Vector Machine (BVM) algorithms, which are based on reformulating the SVM problem as the Minimum Enclosing Ball (MEB) problem and Enclosing Ball (EB) problem, respectively, and by leveraging existing coreset constructions for each; see [5]. However, CVM’s accuracy and convergence properties have been noted to be at times inferior relative to those of existing SVM implementations [22]; moreover, unlike the algorithm presented in this paper, neither the CVM, nor the BVM algorithm extends naturally to streaming or dynamic settings where data points are continuously inserted or deleted. Similar geometric approaches, including extensions of the MEB formulation, those based on convex hulls and extreme points, among others, were investigated by [2, 12, 14, 16, 24, 26]. Another class of related work includes the use of canonical optimization algorithms such as the FrankWolfe algorithm [9], Gilbert’s algorithm [9, 8]
, and a primaldual approach combined with Stochastic Gradient Descent (SGD)
[15].SGDbased approaches, such as Pegasos [27], have been a popular tool of choice in approximatelyoptimal SVM training. Pegasos is a stochastic subgradient algorithm for obtaining a approximate solution to the SVM problem in time for a linear kernel, where is the regularization parameter and is the dimensionality of the input data points. In contrast to our method, these approaches and their corresponding theoretical guarantees do not feasibly extend to dynamic data sets and/or streaming settings. In particular, gradientbased approaches cannot be trivially extended to streaming settings since the arrival of each input point in the stream results in a change of the gradient.
There has been prior work in streaming algorithms for SVMs, such as those of [2, 14, 25, 26]. However, these works generally suffer from poor practical performance in comparison to that of approximately optimal SVM algorithms in the offline (batch) setting, high difficulty of implementation and application to practical settings, and/or lack of strong theoretical guarantees. Unlike the algorithms of prior work, our method is simultaneously simpletoimplement, exhibits theoretical guarantees, and naturally extends to streaming and dynamic data settings, where the input data set is so large that it may not be possible to store or process all the data at one time.
3 Problem Definition
Let denote a set of input points. Note that for each point , the last entry of accounts for the bias term embedding into the feature space^{1}^{1}1We perform this embedding for ease of presentation later on in our analysis.. To present our results with full generality, we consider the setting where the input points may have weights associated with them. Hence, given and a weight function , we let denote the weighted set with respect to and . The canonical unweighted case can be represented by the weight function that assigns a uniform weight of 1 to each point, i.e., for every point . For every , let . We consider the scenario where is much larger than the dimension of the data points, i.e., .
For a normal to a separating hyperplane
, let denote vector which contains the first entries of . The last entry of () encodes the bias term . Under this setting, the hinge loss of any point with respect to a normal to a separating hyperplane, , is defined as , where . As a prelude to our subsequent analysis of sensitivitybased sampling, we quantify the contribution of each point to the SVM objective function as(1) 
where is the SVM regularization parameter, and is the hinge loss with respect to the query and point . Putting it all together, we formalize the regularized SVM problem as follows.
Definition 1 (regularized SVM Problem)
For a given weighted set of points and a regularization parameter , the regularized SVM problem with respect to is given by
where
(2) 
We let denote the optimal solution to the SVM problem with respect to , i.e., . A solution is an approximation to the SVM problem if . Next, we formalize the coreset guarantee that we will strive for when constructing our data summaries.
Coresets.
A coreset is a compact representation of the full data set that provably approximates the SVM cost function (2) for every query – including that of the optimal solution . We formalize this notion below for the SVM problem with objective function as in (2) below.
Definition 2 (coreset)
Let and let be the weighted set of training points as before. A weighted subset , where and is an coreset for if
(3) 
This strong guarantee implies that the models trained on the coreset with any offtheshelf SVM solver will be approximately (and provably) as good as the optimal solution obtained by training on the entire data set . This also implies that, if the size of the coreset is provably small, e.g., logartihmic in (see Sec. 5), then an approximately optimal solution can be obtained much more quickly by training on rather than , leading to computational gains in practice for both offline and streaming data settings (see Sec. 6).
The difficulty in constructing coresets lies in constructing them (i) efficiently, so that the preprocessthentrain pipeline takes less time than training on the full data set and (ii) accurately, so that important data points – i.e., those that are imperative to obtaining accurate models – are not left out of the coreset, and redundant points are eliminated so that the coreset size is small. In the following sections, we introduce and analyze our coreset algorithm for the SVM problem.
4 Method
Our coreset construction scheme is based on the unified framework of [10, 18] and is shown in Alg. 1. The crux of our algorithm lies in generating the importance sampling distribution via efficiently computable upper bounds (proved in Sec. 5) on the importance of each point (Lines 1–1). Sufficiently many points are then sampled from this distribution and each point is given a weight that is inversely proportional to its sample probability (Lines 1–1). The number of points required to generate an coreset with probability at least is a function of the desired accuracy , failure probability , and complexity of the data set ( from Theorem 5.3). Under mild assumptions on the problem at hand (see Sec. 0.A.4), the required sample size is polylogarithmic in .
Our algorithm is an importance sampling procedure that first generates a judicious sampling distribution based on the structure of the input points and samples sufficiently many points from the original data set. The resulting weighted set of points
, serves as an unbiased estimator for
for any query , i.e.,. Although sampling points uniformly with appropriate weights can also generate such an unbiased estimator, it turns out that the variance of this estimation is minimized if the points are sampled according to the distribution defined by the ratio between each point’s sensitivity and the sum of sensitivities, i.e.,
on Line 1 [4].4.1 Computational Complexity
Coresets are intended to provide efficient and provable approximations to the optimal SVM solution. However, the very first line of our algorithm entails computing an (approximately) optimal solution to the SVM problem. This seemingly eerie phenomenon is explained by the mergeandreduce technique [13] that ensures that our coreset algorithm is only run against small partitions of the original data set [7, 13, 23]. The mergeandreduce approach (depicted in Alg. 2 in Sec. 0.B of the appendix) leverages the fact that coresets are composable and reduces the coreset construction problem for a (large) set of points into the problem of computing coresets for points, where is the minimum size of input set that can be reduced to half using Algorithm 1 [7]. Assuming that the sufficient conditions for obtaining polylogarithmic size coresets implied by Theorem 5.3 hold, the overall time required is approximately linear in .
5 Analysis
In this section, we analyze the sampleefficiency and computational complexity of our algorithm. The outline of this section is as follows: we first formalize the importance (i.e., sensitivity) of each point and summarize the necessary conditions for the existence of small coresets. We then present the negative result that, in general, sublinear coresets do not exist for every data set (Lem. 5.2). Despite this, we show that we can obtain accurate approximations for the sensitivity of each point via an approximate means clustering (Lems. 5.3 and 5.3), and present nonvacuous, datadependent bounds on the sample complexity (Thm. 5.3). Our technical results in full with corresponding proofs can be found in the Appendix.
5.1 Preliminaries
We will henceforth state all of our results with respect to the weighted set of training points , , and SVM cost function (as in Sec. 3). The definition below rigorously quantifies the relative contribution of each point.
Definition 3 (Sensitivity [7])
The sensitivity of each point is given by
(4) 
Note that in practice, exact computation of the sensitivity is intractable, so we usually settle for (sharp) upper bounds on the sensitivity (e.g., as in Alg. 1). Sensitivitybased importance sampling then boils down to normalizing the sensitivities by the normalization constant – to obtain an importance sampling distribution – which in this case is the sum of sensitivities . It turns out that the required size of the coreset is at least linear in [7], which implies that one immediate necessary condition for sublinear coresets is .
5.2 Lower bound for Sensitivity
The next lemma shows that a sublinearsized coreset cannot be constructed for every SVM problem instance. The proof of this result is based on demonstrating a hard point set for which the sum of sensitivities is , ignoring factors, which implies that sensitivitybased importance sampling roughly boils down to uniform sampling for this data set. This in turn implies that if the regularization parameter is too large, e.g., , and if (as in Big Data applications) then the required number of samples for property (3) to hold is .
lemmasenslowerbound For an even integer , there exists a set of weighted points such that
We next provide upper bounds on the sensitivity of each data point with respect to the complexity of the input data. Despite the nonexistence results established above, our upper bounds shed light into the class of problems for which smallsized coresets are ensured to exist.
5.3 Sensitivity Upper Bound
In this subsection we present sharp, datadependent upper bounds on the sensitivity of each point. Our approach is based on an approximate solution to the means clustering problem and to the SVM problem itself (as in Alg. 1). To this end, we will henceforth let be a positive integer, be the error of the (coarse) SVM approximation, and let , and for every , and as in Lines 1–1 of Algorithm 1.
lemmasensupperbound Let be a positive integer, , and let be a weighted set. Then for every , and ,
lemmasumsensupperbound In the context of Lemma 5.3, the sum of sensitivities is bounded by
where for all and .
theoremepsiloncoreset For any , let be an integer satisfying
where is as in Lem. 5.3. Invoking Coreset with the inputs defined in this context yields a coreset with probability at least in time, where represents the computational complexity of obtaining an approximated solution to SVM and applying means++ on and .
6 Results
In this section, we present experimental results that demonstrate and compare the effectiveness of our algorithm on a variety of synthetic and realworld data sets in offline and streaming data settings [20]. Our empirical evaluations demonstrate the practicality and widespread effectiveness of our approach: our algorithm consistently generated more compact and representative data summaries, and yet incurred a negligible increase in computational complexity when compared to uniform sampling. Additional results and details of our experimental setup and evaluations can be found in Sec. 0.D of the Appendix.
[40mm]MeasurementsDataset  HTRU  Credit  Pathol.  Skin  Cod  W1 

Number of datapoints ()  
Sum of Sensitivities ()  
(Percentage) 
Evaluation
We considered realworld data sets of varying size and complexity as depicted in Table 1 (also see Sec. 0.D of the Appendix). For each data set of size , we selected a set of geometricallyspaced subsample sizes . For each sample size , we ran each algorithm (Alg. 1 or uniform sampling) to construct a subset of size . We then trained the SVM model as per usual on this subset to obtain an optimal solution with respect to the coreset , i.e., . We then computed the relative error incurred by the solution computed on the coreset () with respect to the groundtruth optimal solution computed on the entire data set (): See Corollary 1 at Sec. 0.A.4 of the Appendix. The results were averaged across trials.
Figures 1 and 2 depict the results of our comparisons against uniform sampling in the offline setting. In Fig. 1, we see that the coresets generated by our algorithm are much more representative and compact than the ones constructed by uniform sampling: across all data sets and sample sizes, training on our coreset yields significantly better solutions to SVM problem when compared to those generated by training on a uniform sample. For certain data sets, such as HTRU, Pathological, and W1, this relative improvement over uniform sampling is at least an order of magnitude better, especially for small sample sizes. Fig. 1 also shows that, as a consequence of a more informed sampling scheme, the variance of each model’s performance trained on our coreset is much lower than that of uniform sampling for all data sets.
Fig. 2 shows the total computational time required for constructing the subsample (i.e., coreset) and training the SVM on the subset to obtain . We observe that our approach takes significantly less time than training on the original model when considering nontrivial data sets (i.e., ), and underscores the efficiency of our method: we incur a negligible cost in the overall SVM training time due to a more involved coreset construction procedure, but benefit heavily in terms of the accuracy of the models generated (Fig. 1).
Next, we evaluate our approach in the streaming setting, where data points arrive onebyone and the entire data set cannot be kept in memory, for the same data sets. The results of the streaming setting are shown in Fig. 3. The corresponding figure for the total computational time is shown as Fig. 5 in Sec. 0.E of the Appendix. Figs. 3 and 5 (in the Appendix) portray a similar trend as the one we observed in our offline evaluations: our approach significantly outperforms uniform sampling for all of the evaluated data sets and sample sizes, with negligible computational overhead.
In sum, our empirical evaluations demonstrate the practical efficiency of our algorithm and reaffirm the favorable theoretical guarantees of our approach: the additional computational complexity of constructing the coreset is negligible relative to that of uniform sampling, and the entire preprocessthentrain pipeline is significantly more efficient than training on the original massive data set.
7 Conclusion
We presented an efficient coreset construction algorithm for generating compact representations of the input data points that are provably competitive with the original data set in training Support Vector Machine models. Unlike prior approaches, our method and its theoretical guarantees naturally extend to streaming settings and scenarios involving dynamic data sets, where points are continuously inserted and deleted. We established instancedependent bounds on the number of samples required to obtain accurate approximations to the SVM problem as a function of input data complexity and established dataset dependent conditions for the existence of compact representations. Our experimental results on realworld data sets validate our theoretical results and demonstrate the practical efficacy of our approach in speeding up SVM training. We conjecture that our coreset construction can be extended to accelerate SVM training for other classes of kernels and can be applied to a variety of Big Data scenarios.
References
 [1] (2005) Geometric approximation via coresets. Combinatorial and computational geometry 52, pp. 1–30. Cited by: §1.
 [2] (2010) Streaming algorithms for extent problems in high dimensions. In Proceedings of the twentyfirst annual ACMSIAM symposium on Discrete Algorithms, pp. 1481–1489. Cited by: §2, §2.
 [3] (2007) Kmeans++: the advantages of careful seeding. In Proceedings of the eighteenth annual ACMSIAM symposium on Discrete algorithms, pp. 1027–1035. Cited by: Appendix 0.D.
 [4] (2017) Practical coreset constructions for machine learning. arXiv preprint arXiv:1703.06476. Cited by: §4.
 [5] (2003) Smaller coresets for balls. In Proceedings of the fourteenth annual ACMSIAM symposium on Discrete algorithms, pp. 801–802. Cited by: §2.
 [6] (2018) Datadependent coresets for compressing neural networks with applications to generalization bounds. arXiv preprint arXiv:1804.05345. Cited by: §1.
 [7] (2016) New frameworks for offline and streaming coreset constructions. arXiv preprint arXiv:1612.00889. Cited by: §1, §4.1, §5.1, Definition 3, §0.A.4.
 [8] (2012) Sublinear optimization for machine learning. Journal of the ACM (JACM) 59 (5), pp. 23. Cited by: §2.
 [9] (2010) Coresets, sparse greedy approximation, and the frankwolfe algorithm. ACM Transactions on Algorithms (TALG) 6 (4), pp. 63. Cited by: §2.

[10]
(2011)
A unified framework for approximating and clustering data.
In
Proceedings of the fortythird annual ACM symposium on Theory of computing
, pp. 569–578. Cited by: Appendix 0.B, §1, §4.  [11] (2019) Coresets: an updated survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, pp. e1335. Cited by: §1.1, §1.
 [12] (2009) Coresets for polytope distance. In Proceedings of the twentyfifth annual symposium on Computational geometry, pp. 33–42. Cited by: §2.
 [13] (2004) On coresets for kmeans and kmedian clustering. In Proceedings of the thirtysixth annual ACM symposium on Theory of computing, pp. 291–300. Cited by: §4.1.
 [14] (2007) Maximum margin coresets for active and noise tolerant learning.. In IJCAI, pp. 836–841. Cited by: §2, §2.
 [15] (2011) Beating sgd: learning svms in sublinear time. In Advances in Neural Information Processing Systems, pp. 1233–1241. Cited by: §2.
 [16] (2006) Training linear svms in linear time. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 217–226. Cited by: §2.
 [17] (2013) Review on determining number of cluster in kmeans clustering. International Journal 1 (6), pp. 90–95. Cited by: §0.C.1.
 [18] (2010) Universal approximators for integrals. In Proceedings of the twentyfirst annual ACMSIAM symposium on Discrete Algorithms, pp. 598–607. Cited by: Appendix 0.B, §1, §4.
 [19] (2001) Improved bounds on the sample complexity of learning. Journal of Computer and System Sciences 62 (3), pp. 516–527. Cited by: §0.A.4.
 [20] (2013) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §6.
 [21] (2019) Provable filter pruning for efficient neural networks. arXiv preprint arXiv:1911.07412. Cited by: §1.
 [22] (2007) Comments on the “Core Vector Machines: Fast SVM Training on Very Large Data Sets. Journal of Machine Learning Research 8 (Feb), pp. 291–301. Cited by: §2.
 [23] (2017) Training mixture models at scale via coresets. arXiv preprint arXiv:1703.08110. Cited by: §4.1.
 [24] (2014) Fast svm training using approximate extreme points.. Journal of Machine Learning Research 15 (1), pp. 59–98. Cited by: §2.
 [25] (2014) Accurate streaming support vector machines. arXiv preprint arXiv:1412.2485. Cited by: §2.
 [26] (2009) Streamed learning: onepass svms. arXiv preprint arXiv:0908.0572. Cited by: §2, §2.
 [27] (2011) Pegasos: primal estimated subgradient solver for svm. Mathematical programming 127 (1), pp. 3–30. Cited by: §1.1, §2.
 [28] (2007) Simpler core vector machines with enclosing balls. In Proceedings of the 24th international conference on Machine learning, pp. 911–918. Cited by: §2.
 [29] (2005) Core vector machines: fast svm training on very large data sets. Journal of Machine Learning Research 6 (Apr), pp. 363–392. Cited by: §2.
 [30] (1998) Statistical learning theory. Vol. 1, Wiley New York. Cited by: §0.A.4.
 [31] (2017) Weighted sgd for ell_p regression with randomized preconditioning. arXiv preprint arXiv:1502.03571. Cited by: §0.A.1.
Appendix 0.A Proofs of the Analytical Results in Section 5
This section includes the full proofs of the technical results given in Sec. 5.
0.a.1 Proof of Lemma 5.2
*
Proof
Following [31], let and let , where be set of labeled points, and . For every , where and , among the first entries of , exactly entries are equivalent to
where the remaining entries among the first are set to . Hence, for our proof to hold, we assume that contains all such combinations and at least one point of each label. For every , define the set of nonzero entries of as the set
Put and note that for bounding the sensitivity of point , consider with entries defined as
Note that . We also have that since . To bound the sum of hinge losses contributed by other points , note that . Then for every ,
which implies that . Thus,
Putting it all together,
Since the above holds for every , summing the above inequality over every , yields that
0.a.2 Proof of Lemma 5.3
*
Proof
Let denote the set of points with the same label as as in Line 1 of Algorithm 1. Consider an optimal clustering of the points in into clusters with centroids being their mean as in Line 1, and let be as defined in Line 1 for every and . In addition, let denote the weighted set for every and .
Put and let be the index of the cluster which belongs to, i.e., .
We first observe that for any scalars , . This implies that, by definition of the hinge loss, we have for every
where as before. Hence, in the context of the definitions above
(5)  
(6)  
(7) 
Now let the total weight of the points in be denoted by . Note that since is the centroid of (as described in Line 1 of Algorithm 1), we have Observing that the hinge loss is convex, we invoke Jensen’s inequality to obtain
Applying the two inequalities established above to yields that
(8)  
(9)  
(10)  
(11)  
(12) 
By definition of , we have
Continuing from above and dividing both sides by yields
where
(13) 
Let
be the expression on the right hand side of the sensitivity inequality above, and let . The rest of the proof will focus on bounding , since an upper bound on the sensitivity of a point as a whole would follow directly from an upper bound on .
Note that by definition of and the embedding of to the entry of the original dimensional point (with respect to ),
where the equality holds since the th entry of is zero.
We know that , since otherwise , which contradicts the fact that is the maximizer of . This implies that for each entry of the subgradient of evaluated at , denoted by , is given by
(14) 
and that since the bias term does not appear in the numerator of .
Letting and setting each entry of the gradient to , we solve for to obtain
This implies that
Rearranging and using the definition of , we obtain
(15) 
where Lemma 5.3 holds by taking outside the max term.
By using the same equivalency for from above, we also obtain that
but , and so continuing from above, we have
where . Solving for from the above equation yields for
(16) 
Now we subdivide the rest of the proof into two cases. The first is the trivial case in which the sensitivity of the point is sufficiently small enough to be negligible, and the second case is the involved case in which the point has a high influence on the SVM cost function and its contribution cannot be captured by the optimal solution or something close to it.

[leftmargin=0pt, itemindent=20pt, labelwidth=15pt, labelsep=5pt, listparindent=0.7cm, align=left]
 Case

the bound on the sensitivity follows trivially from the analysis above.
 Case

note that the assumption of this case implies that cannot be the maximizer of , i.e.,
. This follows by the convexity of the SVM loss function which implies that the norm of the gradient evaluated at
is 0. Thus by (15):Since , we obtain
Hence, we know that for this case we have , , and so we obtain .
This implies that we can use Eq.(16) to upper bound the numerator of the sensitivity. Note that from (16) is decreasing as a function of , and so it suffices to obtain a lower bound on . To do so, lets focus on Eq.(15) and let divide both sides of it by , to obtain that
By rearranging the above equality, we have that
(17) Recall that since the last entry of is then it follows from Eq.(14) that is also zero, which implies that
(18) where the inequality is by CauchySchwarz.
Combining Eq.(17) with Eq. (18) yields
where the second inequality holds by the assumption of the case, the third inequality follows from the fact that .
This implies that
Hence by definition of , we have that
(19) Plugging Eq.(19) into Eq.(
Comments
There are no comments yet.