On Coresets for Support Vector Machines

We present an efficient coreset construction algorithm for large-scale Support Vector Machine (SVM) training in Big Data and streaming applications. A coreset is a small, representative subset of the original data points such that a models trained on the coreset are provably competitive with those trained on the original data set. Since the size of the coreset is generally much smaller than the original set, our preprocess-then-train scheme has potential to lead to significant speedups when training SVM models. We prove lower and upper bounds on the size of the coreset required to obtain small data summaries for the SVM problem. As a corollary, we show that our algorithm can be used to extend the applicability of any off-the-shelf SVM solver to streaming, distributed, and dynamic data settings. We evaluate the performance of our algorithm on real-world and synthetic data sets. Our experimental results reaffirm the favorable theoretical properties of our algorithm and demonstrate its practical effectiveness in accelerating SVM training.

Authors

• 5 publications
• 7 publications
• 32 publications
• 53 publications
05/01/2019

High-Performance Support Vector Machines and Its Applications

The support vector machines (SVM) algorithm is a popular classification ...
07/28/2021

Chance constrained conic-segmentation support vector machine with uncertain data

Support vector machines (SVM) is one of the well known supervised classe...
05/18/2018

Wasserstein Coresets for Lipschitz Costs

Sparsification is becoming more and more relevant with the proliferation...
09/29/2020

Efficient SVDD Sampling with Approximation Guarantees for the Decision Boundary

Support Vector Data Description (SVDD) is a popular one-class classifier...
01/30/2014

Support vector comparison machines

In ranking problems, the goal is to learn a ranking function from labele...
09/27/2021

Derivative Extrapolation Using Least Squares

Here, we present three methods for differentiating discrete sets from st...
12/05/2018

In the era of big data, an important weapon in a machine learning resear...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Popular machine learning algorithms are computationally expensive, or worse yet, intractable to train on massive data sets, where the input data set is so large that it may not be possible to process all the data at one time. A natural approach to achieve scalability when faced with Big Data is to first conduct a preprocessing step to summarize the input data points by a significantly smaller, representative set. Off-the-shelf training algorithms can then be run efficiently on this compressed set of data points. The premise of this two-step learning procedure is that the model trained on the compressed set will be provably competitive with the model trained on the original set – as long as the data summary, i.e., the

coreset, can be generated efficiently and is sufficiently representative.

Coresets are small weighted subsets of the training points such that models trained on the coreset are approximately as good as the ones trained on the original (massive) data set. Coreset constructions were originally introduced in the context of computational geometry [1]

and subsequently generalized for applications to other problems, such as logistic regression, neural network compression, and mixture model training

[6, 7, 10, 18, 21] (see [11] for a survey).

A popular coreset construction technique – and the one that we leverage in this paper – is to use importance sampling with respect to the points’ sensitivities

. The sensitivity of each point is defined to be the worst-case relative impact of each data point on the objective function. Points with high sensitivities have a large impact on the objective value and are sampled with correspondingly high probability, and vice-versa. The main challenge in generating small-sized coresets often lies in evaluating the importance of each point in an accurate and computationally-efficient way.

1.1 Our Contributions

In this paper, we propose an efficient coreset construction algorithm to generate compact representations of large data sets to accelerate SVM training. Our approach hinges on bridging the SVM problem with that of -means clustering. As a corollary to our theoretical analysis, we obtain theoretical justification for the widely reported empirical success of using -means clustering as a way to generate data summaries for large-scale SVM training. In contrast to prior approaches, our approach is both (i) provably efficient and (ii) naturally extends to streaming or dynamic data settings. Above all, our approach can be used to enable the applicability of any off-the-shelf SVM solver – including gradient-based and/or approximate ones, e.g., Pegasos [27], to streaming and distributed data settings by exploiting the composibility and reducibility properties of coresets [11].

In particular, this paper contributes the following:

1. A coreset construction algorithm for accelerating SVM training based on an efficient importance sampling scheme.

2. An analysis proving lower bounds on the number of samples required by any coreset construction algorithm to approximate the input data set.

3. Theoretical guarantees on the efficiency and accuracy of our coreset construction algorithm.

4. Evaluations on synthetic and real-world data sets that demonstrate the effectiveness of our algorithm in both streaming and offline settings.

2 Related Work

Training SVMs requires time and space in the offline setting where is the number of training points. Towards the goal of accelerating SVM training in the offline setting, [28, 29] introduced the Core Vector Machine (CVM) and Ball Vector Machine (BVM) algorithms, which are based on reformulating the SVM problem as the Minimum Enclosing Ball (MEB) problem and Enclosing Ball (EB) problem, respectively, and by leveraging existing coreset constructions for each; see [5]. However, CVM’s accuracy and convergence properties have been noted to be at times inferior relative to those of existing SVM implementations [22]; moreover, unlike the algorithm presented in this paper, neither the CVM, nor the BVM algorithm extends naturally to streaming or dynamic settings where data points are continuously inserted or deleted. Similar geometric approaches, including extensions of the MEB formulation, those based on convex hulls and extreme points, among others, were investigated by [2, 12, 14, 16, 24, 26]. Another class of related work includes the use of canonical optimization algorithms such as the Frank-Wolfe algorithm [9], Gilbert’s algorithm [9, 8]

, and a primal-dual approach combined with Stochastic Gradient Descent (SGD)

[15].

SGD-based approaches, such as Pegasos [27], have been a popular tool of choice in approximately-optimal SVM training. Pegasos is a stochastic sub-gradient algorithm for obtaining a -approximate solution to the SVM problem in time for a linear kernel, where is the regularization parameter and is the dimensionality of the input data points. In contrast to our method, these approaches and their corresponding theoretical guarantees do not feasibly extend to dynamic data sets and/or streaming settings. In particular, gradient-based approaches cannot be trivially extended to streaming settings since the arrival of each input point in the stream results in a change of the gradient.

There has been prior work in streaming algorithms for SVMs, such as those of [2, 14, 25, 26]. However, these works generally suffer from poor practical performance in comparison to that of approximately optimal SVM algorithms in the offline (batch) setting, high difficulty of implementation and application to practical settings, and/or lack of strong theoretical guarantees. Unlike the algorithms of prior work, our method is simultaneously simple-to-implement, exhibits theoretical guarantees, and naturally extends to streaming and dynamic data settings, where the input data set is so large that it may not be possible to store or process all the data at one time.

3 Problem Definition

Let denote a set of input points. Note that for each point , the last entry of accounts for the bias term embedding into the feature space111We perform this embedding for ease of presentation later on in our analysis.. To present our results with full generality, we consider the setting where the input points may have weights associated with them. Hence, given and a weight function , we let denote the weighted set with respect to and . The canonical unweighted case can be represented by the weight function that assigns a uniform weight of 1 to each point, i.e., for every point . For every , let . We consider the scenario where is much larger than the dimension of the data points, i.e., .

For a normal to a separating hyperplane

, let denote vector which contains the first entries of . The last entry of () encodes the bias term . Under this setting, the hinge loss of any point with respect to a normal to a separating hyperplane, , is defined as , where . As a prelude to our subsequent analysis of sensitivity-based sampling, we quantify the contribution of each point to the SVM objective function as

 fλ(p,w)=12U(P)∥w1:d∥22+λh(p,w), (1)

where is the SVM regularization parameter, and is the hinge loss with respect to the query and point . Putting it all together, we formalize the -regularized SVM problem as follows.

Definition 1 (λ-regularized SVM Problem)

For a given weighted set of points and a regularization parameter , the -regularized SVM problem with respect to is given by

 minw∈Rd+1Fλ(P,w),

where

 Fλ(P,w)=∑p∈Pu(p)f(p,w). (2)

We let denote the optimal solution to the SVM problem with respect to , i.e., . A solution is an -approximation to the SVM problem if . Next, we formalize the coreset guarantee that we will strive for when constructing our data summaries.

Coresets.

A coreset is a compact representation of the full data set that provably approximates the SVM cost function (2) for every query – including that of the optimal solution . We formalize this notion below for the SVM problem with objective function as in (2) below.

Definition 2 (ε-coreset)

Let and let be the weighted set of training points as before. A weighted subset , where and is an -coreset for if

 (3)

This strong guarantee implies that the models trained on the coreset with any off-the-shelf SVM solver will be approximately (and provably) as good as the optimal solution obtained by training on the entire data set . This also implies that, if the size of the coreset is provably small, e.g., logartihmic in (see Sec. 5), then an approximately optimal solution can be obtained much more quickly by training on rather than , leading to computational gains in practice for both offline and streaming data settings (see Sec. 6).

The difficulty in constructing coresets lies in constructing them (i) efficiently, so that the preprocess-then-train pipeline takes less time than training on the full data set and (ii) accurately, so that important data points – i.e., those that are imperative to obtaining accurate models – are not left out of the coreset, and redundant points are eliminated so that the coreset size is small. In the following sections, we introduce and analyze our coreset algorithm for the SVM problem.

4 Method

Our coreset construction scheme is based on the unified framework of [10, 18] and is shown in Alg. 1. The crux of our algorithm lies in generating the importance sampling distribution via efficiently computable upper bounds (proved in Sec. 5) on the importance of each point (Lines 11). Sufficiently many points are then sampled from this distribution and each point is given a weight that is inversely proportional to its sample probability (Lines 11). The number of points required to generate an -coreset with probability at least is a function of the desired accuracy , failure probability , and complexity of the data set ( from Theorem 5.3). Under mild assumptions on the problem at hand (see Sec. 0.A.4), the required sample size is polylogarithmic in .

Our algorithm is an importance sampling procedure that first generates a judicious sampling distribution based on the structure of the input points and samples sufficiently many points from the original data set. The resulting weighted set of points

, serves as an unbiased estimator for

for any query , i.e.,

. Although sampling points uniformly with appropriate weights can also generate such an unbiased estimator, it turns out that the variance of this estimation is minimized if the points are sampled according to the distribution defined by the ratio between each point’s sensitivity and the sum of sensitivities, i.e.,

on Line 1 [4].

4.1 Computational Complexity

Coresets are intended to provide efficient and provable approximations to the optimal SVM solution. However, the very first line of our algorithm entails computing an (approximately) optimal solution to the SVM problem. This seemingly eerie phenomenon is explained by the merge-and-reduce technique [13] that ensures that our coreset algorithm is only run against small partitions of the original data set [7, 13, 23]. The merge-and-reduce approach (depicted in Alg. 2 in Sec. 0.B of the appendix) leverages the fact that coresets are composable and reduces the coreset construction problem for a (large) set of points into the problem of computing coresets for points, where is the minimum size of input set that can be reduced to half using Algorithm 1 [7]. Assuming that the sufficient conditions for obtaining polylogarithmic size coresets implied by Theorem 5.3 hold, the overall time required is approximately linear in .

5 Analysis

In this section, we analyze the sample-efficiency and computational complexity of our algorithm. The outline of this section is as follows: we first formalize the importance (i.e., sensitivity) of each point and summarize the necessary conditions for the existence of small coresets. We then present the negative result that, in general, sublinear coresets do not exist for every data set (Lem. 5.2). Despite this, we show that we can obtain accurate approximations for the sensitivity of each point via an approximate -means clustering (Lems. 5.3 and 5.3), and present non-vacuous, data-dependent bounds on the sample complexity (Thm. 5.3). Our technical results in full with corresponding proofs can be found in the Appendix.

5.1 Preliminaries

We will henceforth state all of our results with respect to the weighted set of training points , , and SVM cost function (as in Sec. 3). The definition below rigorously quantifies the relative contribution of each point.

Definition 3 (Sensitivity [7])

The sensitivity of each point is given by

 s(p)=supwu(p)fλ(p,w)Fλ(P,w). (4)

Note that in practice, exact computation of the sensitivity is intractable, so we usually settle for (sharp) upper bounds on the sensitivity (e.g., as in Alg. 1). Sensitivity-based importance sampling then boils down to normalizing the sensitivities by the normalization constant – to obtain an importance sampling distribution – which in this case is the sum of sensitivities . It turns out that the required size of the coreset is at least linear in  [7], which implies that one immediate necessary condition for sublinear coresets is .

5.2 Lower bound for Sensitivity

The next lemma shows that a sublinear-sized coreset cannot be constructed for every SVM problem instance. The proof of this result is based on demonstrating a hard point set for which the sum of sensitivities is , ignoring factors, which implies that sensitivity-based importance sampling roughly boils down to uniform sampling for this data set. This in turn implies that if the regularization parameter is too large, e.g., , and if (as in Big Data applications) then the required number of samples for property (3) to hold is .

lemmasenslowerbound For an even integer , there exists a set of weighted points such that

We next provide upper bounds on the sensitivity of each data point with respect to the complexity of the input data. Despite the non-existence results established above, our upper bounds shed light into the class of problems for which small-sized coresets are ensured to exist.

5.3 Sensitivity Upper Bound

In this subsection we present sharp, data-dependent upper bounds on the sensitivity of each point. Our approach is based on an approximate solution to the -means clustering problem and to the SVM problem itself (as in Alg. 1). To this end, we will henceforth let be a positive integer, be the error of the (coarse) SVM approximation, and let , and for every , and as in Lines 11 of Algorithm 1.

lemmasensupperbound Let be a positive integer, , and let be a weighted set. Then for every , and ,

 s(p)≤u(p)U(P(i)y)+λu(p)92max⎧⎪⎨⎪⎩49α(i)y, ⎷4(α(i)y)2+2∥pΔ∥229˜optξ−2α(i)y⎫⎪⎬⎪⎭=γ(p).

lemmasumsensupperbound In the context of Lemma 5.3, the sum of sensitivities is bounded by

 ∑p∈Ps(p)≤t=4k+k∑i=13λVar% (i)+√2˜optξ+3λVar(i)−√2˜optξ,

where for all and .

theoremepsiloncoreset For any , let be an integer satisfying

 m∈Ω(tε2(dlogt+log(1/δ))),

where is as in Lem. 5.3. Invoking Coreset with the inputs defined in this context yields a -coreset with probability at least in time, where represents the computational complexity of obtaining an -approximated solution to SVM and applying -means++ on and .

We refer the reader to Sec. 0.A.4 of the Appendix for the sufficient conditions required for obtaining poly-logarithmic sized coreset, and to Sec. 0.C of the Appendix for additional details on the effect of the -means clustering on the sensitivity bounds.

6 Results

In this section, we present experimental results that demonstrate and compare the effectiveness of our algorithm on a variety of synthetic and real-world data sets in offline and streaming data settings [20]. Our empirical evaluations demonstrate the practicality and wide-spread effectiveness of our approach: our algorithm consistently generated more compact and representative data summaries, and yet incurred a negligible increase in computational complexity when compared to uniform sampling. Additional results and details of our experimental setup and evaluations can be found in Sec. 0.D of the Appendix.

Evaluation

We considered real-world data sets of varying size and complexity as depicted in Table 1 (also see Sec. 0.D of the Appendix). For each data set of size , we selected a set of geometrically-spaced subsample sizes . For each sample size , we ran each algorithm (Alg. 1 or uniform sampling) to construct a subset of size . We then trained the SVM model as per usual on this subset to obtain an optimal solution with respect to the coreset , i.e., . We then computed the relative error incurred by the solution computed on the coreset () with respect to the ground-truth optimal solution computed on the entire data set (): See Corollary 1 at Sec. 0.A.4 of the Appendix. The results were averaged across trials.

Figures 1 and 2 depict the results of our comparisons against uniform sampling in the offline setting. In Fig. 1, we see that the coresets generated by our algorithm are much more representative and compact than the ones constructed by uniform sampling: across all data sets and sample sizes, training on our coreset yields significantly better solutions to SVM problem when compared to those generated by training on a uniform sample. For certain data sets, such as HTRU, Pathological, and W1, this relative improvement over uniform sampling is at least an order of magnitude better, especially for small sample sizes. Fig. 1 also shows that, as a consequence of a more informed sampling scheme, the variance of each model’s performance trained on our coreset is much lower than that of uniform sampling for all data sets.

Fig. 2 shows the total computational time required for constructing the sub-sample (i.e., coreset) and training the SVM on the subset to obtain . We observe that our approach takes significantly less time than training on the original model when considering non-trivial data sets (i.e., ), and underscores the efficiency of our method: we incur a negligible cost in the overall SVM training time due to a more involved coreset construction procedure, but benefit heavily in terms of the accuracy of the models generated (Fig. 1).

Next, we evaluate our approach in the streaming setting, where data points arrive one-by-one and the entire data set cannot be kept in memory, for the same data sets. The results of the streaming setting are shown in Fig. 3. The corresponding figure for the total computational time is shown as Fig. 5 in Sec. 0.E of the Appendix. Figs. 3 and 5 (in the Appendix) portray a similar trend as the one we observed in our offline evaluations: our approach significantly outperforms uniform sampling for all of the evaluated data sets and sample sizes, with negligible computational overhead.

In sum, our empirical evaluations demonstrate the practical efficiency of our algorithm and reaffirm the favorable theoretical guarantees of our approach: the additional computational complexity of constructing the coreset is negligible relative to that of uniform sampling, and the entire preprocess-then-train pipeline is significantly more efficient than training on the original massive data set.

7 Conclusion

We presented an efficient coreset construction algorithm for generating compact representations of the input data points that are provably competitive with the original data set in training Support Vector Machine models. Unlike prior approaches, our method and its theoretical guarantees naturally extend to streaming settings and scenarios involving dynamic data sets, where points are continuously inserted and deleted. We established instance-dependent bounds on the number of samples required to obtain accurate approximations to the SVM problem as a function of input data complexity and established dataset dependent conditions for the existence of compact representations. Our experimental results on real-world data sets validate our theoretical results and demonstrate the practical efficacy of our approach in speeding up SVM training. We conjecture that our coreset construction can be extended to accelerate SVM training for other classes of kernels and can be applied to a variety of Big Data scenarios.

References

• [1] P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan (2005) Geometric approximation via coresets. Combinatorial and computational geometry 52, pp. 1–30. Cited by: §1.
• [2] P. K. Agarwal and R. Sharathkumar (2010) Streaming algorithms for extent problems in high dimensions. In Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms, pp. 1481–1489. Cited by: §2, §2.
• [3] D. Arthur and S. Vassilvitskii (2007) K-means++: the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 1027–1035. Cited by: Appendix 0.D.
• [4] O. Bachem, M. Lucic, and A. Krause (2017) Practical coreset constructions for machine learning. arXiv preprint arXiv:1703.06476. Cited by: §4.
• [5] M. Badoiu and K. L. Clarkson (2003) Smaller core-sets for balls. In Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 801–802. Cited by: §2.
• [6] C. Baykal, L. Liebenwein, I. Gilitschenski, D. Feldman, and D. Rus (2018) Data-dependent coresets for compressing neural networks with applications to generalization bounds. arXiv preprint arXiv:1804.05345. Cited by: §1.
• [7] V. Braverman, D. Feldman, and H. Lang (2016) New frameworks for offline and streaming coreset constructions. arXiv preprint arXiv:1612.00889. Cited by: §1, §4.1, §5.1, Definition 3, §0.A.4.
• [8] K. L. Clarkson, E. Hazan, and D. P. Woodruff (2012) Sublinear optimization for machine learning. Journal of the ACM (JACM) 59 (5), pp. 23. Cited by: §2.
• [9] K. L. Clarkson (2010) Coresets, sparse greedy approximation, and the frank-wolfe algorithm. ACM Transactions on Algorithms (TALG) 6 (4), pp. 63. Cited by: §2.
• [10] D. Feldman and M. Langberg (2011) A unified framework for approximating and clustering data. In

Proceedings of the forty-third annual ACM symposium on Theory of computing

,
pp. 569–578. Cited by: Appendix 0.B, §1, §4.
• [11] D. Feldman (2019) Core-sets: an updated survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, pp. e1335. Cited by: §1.1, §1.
• [12] B. Gärtner and M. Jaggi (2009) Coresets for polytope distance. In Proceedings of the twenty-fifth annual symposium on Computational geometry, pp. 33–42. Cited by: §2.
• [13] S. Har-Peled and S. Mazumdar (2004) On coresets for k-means and k-median clustering. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pp. 291–300. Cited by: §4.1.
• [14] S. Har-Peled, D. Roth, and D. Zimak (2007) Maximum margin coresets for active and noise tolerant learning.. In IJCAI, pp. 836–841. Cited by: §2, §2.
• [15] E. Hazan, T. Koren, and N. Srebro (2011) Beating sgd: learning svms in sublinear time. In Advances in Neural Information Processing Systems, pp. 1233–1241. Cited by: §2.
• [16] T. Joachims (2006) Training linear svms in linear time. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 217–226. Cited by: §2.
• [17] T. M. Kodinariya and P. R. Makwana (2013) Review on determining number of cluster in k-means clustering. International Journal 1 (6), pp. 90–95. Cited by: §0.C.1.
• [18] M. Langberg and L. J. Schulman (2010) Universal -approximators for integrals. In Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms, pp. 598–607. Cited by: Appendix 0.B, §1, §4.
• [19] Y. Li, P. M. Long, and A. Srinivasan (2001) Improved bounds on the sample complexity of learning. Journal of Computer and System Sciences 62 (3), pp. 516–527. Cited by: §0.A.4.
• [20] M. Lichman (2013) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §6.
• [21] L. Liebenwein, C. Baykal, H. Lang, D. Feldman, and D. Rus (2019) Provable filter pruning for efficient neural networks. arXiv preprint arXiv:1911.07412. Cited by: §1.
• [22] G. Loosli and S. Canu (2007) Comments on the “Core Vector Machines: Fast SVM Training on Very Large Data Sets. Journal of Machine Learning Research 8 (Feb), pp. 291–301. Cited by: §2.
• [23] M. Lucic, M. Faulkner, A. Krause, and D. Feldman (2017) Training mixture models at scale via coresets. arXiv preprint arXiv:1703.08110. Cited by: §4.1.
• [24] M. Nandan, P. P. Khargonekar, and S. S. Talathi (2014) Fast svm training using approximate extreme points.. Journal of Machine Learning Research 15 (1), pp. 59–98. Cited by: §2.
• [25] V. Nathan and S. Raghvendra (2014) Accurate streaming support vector machines. arXiv preprint arXiv:1412.2485. Cited by: §2.
• [26] P. Rai, H. Daumé III, and S. Venkatasubramanian (2009) Streamed learning: one-pass svms. arXiv preprint arXiv:0908.0572. Cited by: §2, §2.
• [27] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter (2011) Pegasos: primal estimated sub-gradient solver for svm. Mathematical programming 127 (1), pp. 3–30. Cited by: §1.1, §2.
• [28] I. W. Tsang, A. Kocsor, and J. T. Kwok (2007) Simpler core vector machines with enclosing balls. In Proceedings of the 24th international conference on Machine learning, pp. 911–918. Cited by: §2.
• [29] I. W. Tsang, J. T. Kwok, and P. Cheung (2005) Core vector machines: fast svm training on very large data sets. Journal of Machine Learning Research 6 (Apr), pp. 363–392. Cited by: §2.
• [30] V. N. Vapnik and V. Vapnik (1998) Vol. 1, Wiley New York. Cited by: §0.A.4.
• [31] J. Yang, Y. Chow, C. Ré, and M. W. Mahoney (2017) Weighted sgd for ell_p regression with randomized preconditioning. arXiv preprint arXiv:1502.03571. Cited by: §0.A.1.

Appendix 0.A Proofs of the Analytical Results in Section 5

This section includes the full proofs of the technical results given in Sec. 5.

0.a.1 Proof of Lemma 5.2

*

Proof

Following [31], let and let , where be set of labeled points, and . For every , where and , among the first entries of , exactly entries are equivalent to

 y√2d,

where the remaining entries among the first are set to . Hence, for our proof to hold, we assume that contains all such combinations and at least one point of each label. For every , define the set of non-zero entries of as the set

 Bp={i∈[d+1]:xi≠0}.

Put and note that for bounding the sensitivity of point , consider with entries defined as

 ∀i∈[d+1]wi=⎧⎨⎩0% ifi∈Bp,1√2dotherwise.

Note that . We also have that since . To bound the sum of hinge losses contributed by other points , note that . Then for every ,

 y′⟨x′,w⟩ =∑i∈Bq∖Bpy′x′iwi≥1√2d√2d=1,

which implies that . Thus,

 ∑q∈Ph(q,w)=1.

Putting it all together,

 s(p)=supw′∈Rd+1Fλ(P,w′)≠0fλ(p,w′)Fλ(P,w′)≥d28n+λh(p,w)∥w∥222+λ=d28n+λd28+λ.

Since the above holds for every , summing the above inequality over every , yields that

 ∑p∈Ps(p)≥d28+nλd28+λ∈Ω(d2+nλd2+λ).

0.a.2 Proof of Lemma 5.3

*

Proof

Let denote the set of points with the same label as as in Line 1 of Algorithm 1. Consider an optimal clustering of the points in into clusters with centroids being their mean as in Line 1, and let be as defined in Line 1 for every and . In addition, let denote the weighted set for every and .

Put and let be the index of the cluster which belongs to, i.e., .

We first observe that for any scalars , . This implies that, by definition of the hinge loss, we have for every

 h(q,w)≤h(^q,w)+[⟨q−^q,w⟩]+,

where as before. Hence, in the context of the definitions above

 h(p,w) =h(p−c(p)+c(p),w) (5) ≤h(c(p),w)+[⟨c(p)−yx,w⟩]+ (6) =h(c(p),w)+[⟨pΔ,w⟩]+. (7)

Now let the total weight of the points in be denoted by . Note that since is the centroid of (as described in Line 1 of Algorithm 1), we have Observing that the hinge loss is convex, we invoke Jensen’s inequality to obtain

 fλ(c(i)y,w)≤1U(P(i)y)∑q∈P(i)yu(q)f(q,w)=Fλ(P,w)−Fλ(P∖P(i)y,w)U(P(i)y).

Applying the two inequalities established above to yields that

 s(p)u(p) =supwfλ(p,w)Fλ(P,w) (8) ≤supwfλ(c(i)y,w)+λ[⟨w,pΔ⟩]+Fλ(P,w) (9) (10) =supwFλ(P,w)−Fλ(P∖P(i)y,w)U(P(i)y)Fλ(P,w)+λ[⟨w,pΔ⟩]+Fλ(P,w) (11) (12)

By definition of , we have

 Fλ(P∖P(i)y,w)≥∥w1:d∥22U(P∖P(i)y)2U(P).

Continuing from above and dividing both sides by yields

 s(p)λu(p) ≤1λU(P(i)y)+supw[⟨w,pΔ⟩]+−∥w1:d∥2U(P∖P(i)y)2λU(P)U(P(i)y)Fλ(P,w) ≤1λu(Cp)+supw[⟨w,pΔ⟩]+−α(i)y∥w1:d∥22Fλ(P,w),

where

 α(i)y=U(P∖P(i)y)2λU(P)U(P(i)y). (13)

Let

be the expression on the right hand side of the sensitivity inequality above, and let . The rest of the proof will focus on bounding , since an upper bound on the sensitivity of a point as a whole would follow directly from an upper bound on .

Note that by definition of and the embedding of to the entry of the original -dimensional point (with respect to ),

 ⟨^w,pΔ⟩=⟨^w1:d,(pΔ)1:d⟩,

where the equality holds since the th entry of is zero.

We know that , since otherwise , which contradicts the fact that is the maximizer of . This implies that for each entry of the sub-gradient of evaluated at , denoted by , is given by

 ∇g(^w)j =((pΔ)j−2α(i)y^wj)Fλ(P,^w)−∇Fλ(P,^w)j(⟨w,pΔ⟩−α(i)y∥w1:d∥22)Fλ(P,^w)2, (14)

and that since the bias term does not appear in the numerator of .

Letting and setting each entry of the gradient to , we solve for to obtain

 (pΔ)1:d=γ∇Fλ(P,^w)1:dFλ(P,^w)+2α(i)y^w1:d.

This implies that

 ⟨^w,pΔ⟩ =γ⟨^w,∇Fλ(P,^w)⟩Fλ(P,^w)+2α(i)y∥^w1:d∥22

Rearranging and using the definition of , we obtain

 γ=γ⟨^w,∇Fλ(P,^w)⟩Fλ(P,^w)+α(i)y∥^w1:d∥22, (15)

where Lemma 5.3 holds by taking outside the max term.

By using the same equivalency for from above, we also obtain that

 ∥pΔ∥22 =⟨pΔ,pΔ⟩=∥∥∥γ∇Fλ(P,^w)1:dFλ(P,^w)+2α(i)y^w∥∥∥2 =γ2Fλ(P,^w)2∥∇Fλ(P,^w)∥22+4(α(i)y)2∥^w1:d∥22+4α(i)yγ⟨^w,∇Fλ(P,^w)⟩Fλ(P,^w),

but , and so continuing from above, we have

 ∥pΔ∥22 =γ2Fλ(P,^w)2∥∇Fλ(P,^w)∥22+4(α(i)y)2∥^w1:d∥22+4α(i)y(γ−α(i)y∥^w1:d∥22) =γ2Fλ(P,^w)2∥∇Fλ(P,^w)1:d∥22+4α(i)yγ =γ2~x+4α(i)yγ,

where . Solving for from the above equation yields for

 (16)

Now we subdivide the rest of the proof into two cases. The first is the trivial case in which the sensitivity of the point is sufficiently small enough to be negligible, and the second case is the involved case in which the point has a high influence on the SVM cost function and its contribution cannot be captured by the optimal solution or something close to it.

[leftmargin=0pt, itemindent=20pt, labelwidth=15pt, labelsep=5pt, listparindent=0.7cm, align=left]

Case

the bound on the sensitivity follows trivially from the analysis above.

Case

note that the assumption of this case implies that cannot be the maximizer of , i.e.,

. This follows by the convexity of the SVM loss function which implies that the norm of the gradient evaluated at

is 0. Thus by (15):

 γ=α(i)y∥w∗1:d∥22.

Since , we obtain

 s(p)≤α(i)y∥w∗1:d∥22Fλ(P,w∗)≤2α(i)y.

Hence, we know that for this case we have , , and so we obtain .

This implies that we can use Eq.(16) to upper bound the numerator of the sensitivity. Note that from (16) is decreasing as a function of , and so it suffices to obtain a lower bound on . To do so, lets focus on Eq.(15) and let divide both sides of it by , to obtain that

 1=⟨^w,∇Fλ(P,^w)⟩Fλ(P,^w)+α(i)yγ∥w1:d∥22.

By rearranging the above equality, we have that

 ⟨^w,∇Fλ(P,^w)⟩Fλ(P,^w)=1−α(i)y∥w1:d∥22γ. (17)

Recall that since the last entry of is then it follows from Eq.(14) that is also zero, which implies that

 ⟨^w,∇Fλ(P,^w)⟩=⟨^w1:d,∇Fλ(P,^w)1:d⟩≤∥^w1:d∥2∥∇Fλ(P,^w)1:d∥2=∥^w1:d∥2∥∇Fλ(P,^w)∥2 (18)

where the inequality is by Cauchy-Schwarz.

Combining Eq.(17) with Eq. (18) yields

 ∥^w1:d∥2∥∇Fλ(P,^w)∥2Fλ(P,^w) ≥1−α(i)y∥w1:d∥22γ ≥1−α(i)y∥^w1:d∥223α(i)yFλ(P,^w) ≥1−α(i)y2Fλ(P,^w)3α(i)yFλ(P,^w) =13,

where the second inequality holds by the assumption of the case, the third inequality follows from the fact that .

This implies that

 ∥∇Fλ(P,^w)∥2Fλ(P,^w)≥13∥w1:d∥2≥√23√Fλ(P,^w).

Hence by definition of , we have that

 ~x≥29Fλ(P,^w) (19)

Plugging Eq.(19) into Eq.(