Coresets for Classification – Simplified and Strengthened

06/08/2021 ∙ by Tung Mai, et al. ∙ University of Massachusetts Amherst adobe 0

We give relative error coresets for training linear classifiers with a broad class of loss functions, including the logistic loss and hinge loss. Our construction achieves (1±ϵ) relative error with Õ(d ·μ_y(X)^2/ϵ^2) points, where μ_y(X) is a natural complexity measure of the data matrix X ∈ℝ^n × d and label vector y ∈{-1,1}^n, introduced in by Munteanu et al. 2018. Our result is based on subsampling data points with probabilities proportional to their ℓ_1 Lewis weights. It significantly improves on existing theoretical bounds and performs well in practice, outperforming uniform subsampling along with other importance sampling methods. Our sampling distribution does not depend on the labels, so can be used for active learning. It also does not depend on the specific loss function, so a single coreset can be used in multiple training scenarios.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Coresets are an important tool in scalable machine learning. Given

data points and some objective function, we seek to select a subset of data points such that minimizing the objective function on those points (possibly where selected points are weighted non-uniformly) will yield a near minimizer over the full dataset. Coresets have been applied to problems ranging from clustering [HPM04, FL11]

, to principal component analysis

[CEM15, FSS20]

, to linear regression

[DMM06, DDH09, CWW19]

, to kernel density estimation

[PT20], and beyond [AHPV05, BLK17, SS18].

We study coresets for linear classification. Given a data matrix , with row and a label vector , the goal is to compute , where for a classification loss function , such as the logistic loss

used in logistic regression or hinge loss

used in soft-margin SVMs.

We seek to select a subset of points along with a corresponding set of weights such that, for some small and all ,

(1)

This relative error coreset guarantee ensures that if is computed to be the minimizer of the weighted loss over our selected points, then .

It is well known that common classification loss functions such as the log and hinge losses do not admit relative error coresets with points. To address this issue, Munteanu et al. [MSSW18] introduce a natural notion of the complexity of the matrix and label vector , which we also use to parameterize our results.

Definition 1 (Classification Complexity Measure [Mssw18]).

For any , , let , where is a diagonal matrix with as its diagonal, and and denote the set of positive and negative entries in .

Roughly, is large when there is some parameter vector that produces significant imbalance between correctly classified and misclassified points. This can occur e.g., when the data is exactly separable. However, as argued in [MSSW18], we typically expect to be small.

1.1 Our Results

Our main result, formally stated in Corollary 9, is that sampling points according to the Lewis weights of and reweighting appropriately, yields a relative error coreset satisfying (1) for the logistic loss, the hinge loss, and generally a broad class of ‘hinge-like’ losses. This significantly improves the previous state-of-the-art using the same parameterization, which was [MSSW18]. See Table 1 for a detailed comparison with prior work.

Samples Error Loss Assumptions Distribution Ref.
relative
log, hinge
ReLU
Def. 1 Lewis Cors. 6, 9
relative log Def. 1 sqrt lev. scores [MSSW18]
relative log Def. 1 sqrt lev. scores [MSSW18]
relative log, hinge
regularization
or
uniform [CIM19]
additive log deterministic [KL19]
Table 1: Comparison to prior work. hides logarithmic factors in the problem parameters. We note that the bounded norm assumption of [CIM19] can be removed by simply scaling , giving a dependence on the maximum row norm of in the sample complexity. The importance sampling distributions of our work and [MSSW18] are both in fact a mixture with uniform sampling. Our work and [KL19, CIM19] generalize to broader classes of loss functions – for simplicity here we focus just on the important logistic loss, hinge loss, and ReLU.

Theoretical Approach. The Lewis weights are a measure of the importance of rows in , originally designed to sample rows in order to preserve for any [CP15]. They can be viewed as an generalization of the leverage scores which are used in applications where one seeks to preserve [CLM15]. Like the leverage scores, the Lewis weights can be approximated very efficiently, in time where is the constant of fast matrix multiplication. They can also be approximated in streaming and online settings [BDM20]. Our coreset constructions directly inherit these computational properties.

The Lewis weights are a natural sampling distribution for hinge-like loss functions, including the logistic loss, hinge loss, and the ReLU. These functions grow approximately linearly for positive , but asymptote at for negative . Thus, ignoring some technical details, it can be shown that concentrates only better under sampling than

As shown by Cohen and Peng [CP15], taking samples according to the Lewis weights of (which are the same as those of ) suffices to approximate for all up to relative error. We show in Thm. 8 using contraction bounds for Rademacher averages that it in turn suffices to approximate up to additive error roughly . We then simply show in Corollaries 6 and 9 that by setting and applying Def. 1, this result yields a relative error coreset for a broad class of hinge-like loss functions including the ReLU, the log loss, and the hinge loss.

Experimental Evaluation. In Section 5, we compare our Lewis weight-based method to the square root of leverage score method of [MSSW18], uniform sampling as studied in [CIM19], and an oblivious sketching algorithm of [MOW21]. We study performance in minimizing both the log and hinge losses, with and without regularization. We observe that our method typically far outperforms uniform sampling, even in some cases when regularization is used. It performs comparably to the method of [MSSW18], seeming to outperform when the complexity parameter is large.

1.2 Related Work

Our work is closely related to [MSSW18], which introduces the complexity measure. They give relative error coresets with worse polynomial dependences on the parameters through a mixture of uniform sampling and sampling by the squareroots of the leverage scores. This approach has the same intuition as ours – the squareroot leverage score sampling preserves the ‘linear part’ of the hinge-like loss function and the uniform sampling preserves the asymptoting piece. However, like many other works on coresets for logistic regression and other problems [HCB16, TF18, CIM19] the analysis of Munteanu et al. centers on the sensitivity framework. At best, this framework can achieve sample complexity – one factor comes from the total sensitivity of the problem, and the other from a VC dimension bound on the set of linear classifiers. To the best of our knowledge, our work is the first that avoids this sensitivity framework – Lewis weight sampling results are based on matrix concentration result and give optimal linear dependence on the dimension .

Regularized Classification Losses. Rather than using the parameterization of Def. 1, several other works [TF18, CIM19] achieve relative error coresets for the log and hinge losses by assuming that the loss function is regularized by , where is some parameter and is some norm – e.g., , , or in the important case of soft-margin SVM, .

Curtin et al. show that simple uniform sampling gives a relative error coreset with points in this setting [CIM19]. They also show that no coreset with points exists. In Appendix A, we tighten this lower bound, showing via a reduction to the INDEX problem in communication complexity that the bound achieved by uniform sampling is in fact optimal.

Our theoretical results are incomparable to those of [CIM19]. Empirically though, Lewis weight sampling often far outperforms uniform sampling – see Sec. 5. Note that our results do directly apply in the regularized setting – our relative error can only improve. However, our theoretical bounds do not actually improve with regularization, still depending on , which [CIM19] avoids.

Other Related Work. Wang, Zhu, and Ma [WZM18] take a statistical perspective on subsampling for logistic regression, studying optimal subsampling strategies in the limit as

. Their strategies do not yield finite sample coresets and cannot be implemented without fully solving the original logistic regression problem, however they suggest a heuristic approximation approach. Ting and Brochu also study this asymptotic regime, suggesting sampling by the data point influence functions, which are related to the leverage scores

[TB18]. Less directly, our work is connected to sampling and sketching algorithms for linear regression under different loss functions, often using variants of the leverage scores or Lewis weights [DDH09, CW14, ALS18, CWW19, CD21]

. It is also related to work on sketching methods that preserve the norms of vectors under nonlinear transformations, like the ReLU, often with applications to coresets or compressed sensing for neural networks

[BJPD17, BOB20, GM21].

2 Preliminaries

Notation. Throughout, for and a vector , we let denote the entrywise application of to . For a vector we let denote it’s entry. So .

For data matrix with rows and label vector we consider classification loss functions of the form , where is the diagonal matrix with on its diagonal. For simplicity, we write instead of throughout, since we can think of the labels as just being incorporated into by flipping the signs of its rows. Similarly, we write the complexity parameter of Def. 1 as .

Throughout we will call the logistic loss and the hinge loss. Note that these functions have the sign of flipped from the typical convention. This is just notational – we can always negate or and have an identical loss function. We use these versions as they are both close to the ReLU function, a fact that we will leverage in our analysis.

Basic sampling results. Our coreset construction is based on sampling with the Lewis weights. We define these weights and state fundamental results on Lewis weight sampling and below.

Definition 2 ( Lewis Weights [Cp15]).

For any the Lewis weights are the unique values such that, letting be the diagonal matrix with as its diagonal, for all ,

where for any matrix , is the pseudoinverse. when square and full-rank.

Theorem 3 ( Lewis Weight Sampling).

Consider any , and set of sampling values with and for all , where is a universal constant. If we generate a matrix with each row chosen independently as the standard basis vector times with probability then there exists an such that if is chosen with independent Rademacher entries

In particular, if each is a scaling of a constant factor approximation to the Lewis weight , has rows.

Theorem 3 is implicit in [CP15], following from the proof of Lemma 7.4, which shows a high probability bound on

via the moment bound stated above. This moment bound is proven on page 29 of the arXiv version. We will translate the above moment bound to give approximate bounds for classification loss functions like the ReLU, logistic loss, and hinge loss, using the following standard result on Rademacher complexities:

Theorem 4 (Ledoux-Talagrand contraction, c.f. [Duc]).

Consider , along with -Lipschitz functions with . Then for any , if is chosen with independent Rademacher entries,

3 Warm Up: Coresets for ReLU Regression

We start by showing that Lewis weight sampling yields a -relative error coreset for ReLU regression, under the complexity assumption of Def. 1. Our proofs for log loss, hinge loss, and other hinge-like loss functions will follow a similar structure, with some added complexities.

We first show that Lewis weight sampling gives a coreset with additive error . By setting , we then easily obtain a relative error coreset under the assumption of Def. 1.

Theorem 5 (ReLU Regression – Additive Error Coreset).

Consider and let for all . For a set of sampling values with and for all , where is a universal constant, if we generate with each row chosen independently as the standard basis vector times with probability then with probability at least , for all ,

If each is a scaling of a constant factor approximation to the Lewis weight , has rows.

Corollary 6 (ReLU Regression – Relative Error Coreset).

Consider the setting of Theorem 5, where and for all . With probability at least , , If each is a scaling of a constant factor approximation to the Lewis weight , has rows.

Proof of Corollary 6.

We have

(2)

Additionally, since by definition ,

(3)

Combining (2) with (3) gives that , which then completes the corollary after applying Theorem 5 with

Proof of Theorem 5.

We prove the theorem restricted to such that . Since the ReLU function is linear in that , this yields the complete theorem via scaling. It suffices to prove that there exists some such that

The theorem then follows via Markov’s inequality and the monotonicity of for . Via a standard symmetrization argument (c.f. the Proof of Theorem 7.4 in [CP15]) we have

where has independent Rademacher random entries. We can then apply, for each fixed value of the Ledoux-Talagrand contraction theorem (Theorem 4) with and for all . is -Lipschitz with . This gives

for some by Theorem 3. This completes the theorem after adjusting by a factor of , which only affects the sample complexity by a constant factor. ∎

4 Extension to the Hinge Like Loss Functions

We next extend Theorem 5 to a family of ‘nice hinge functions’ which includes the hinge loss and the log loss . These functions present two additional challenges: 1) they are generally not linear in that , an assumption which is used in the proof of Theorem 5 to restrict to considering with and 2) they are not contractions with , a property which was used to apply the Ledoux-Talagrand contraction theorem.

Definition 7 (Nice Hinge Function).

We call an -nice hinge function if for fixed constants and ,

(1) is -Lipschitz     (2) for all     (3) for all .

We start with an additive error coreset result for nice hinge functions. We then show that under the additional assumption of , the additive error achieved is small compared to , yielding a relative error coreset. This gives our main results for both the hinge loss and log loss, which are -nice and -nice hinge functions respectively.

Theorem 8 (Nice Hinge Function – Additive Error Coreset).

Consider and let be an -nice hinge function (Def. 7). For a set of sampling values with and for all , where and is a fixed constant, if we generate with each row chosen independently as the standard basis vector times with probability , then with probability at least , ,

Observe that for a fixed function , are constant and so, if each is a scaling of a constant factor approximation to , has rows.

Proof.

Let for some constant . We will show that for each integer , with probability at least ,

(4)

Via a union bound this gives the theorem for all with . We then just need to handle the case of with norm outside this range – i.e. when is polynomially small or polynomially large in and the other problem parameters. We will take a union bound over the failure probabilities for these cases, and after adjusting by a constant, have the complete theorem. We make the argument for outside first.

Small Norm. For with , . Thus, for all . Thus by triangle inequality, and the fact that :

(5)

where is value of the single nonzero entry in the row of , which samples index from . Let be i.i.d., each taking value with probability for all . Then

For all we have , so applying a Bernstein bound, if the constant is chosen large enough we have:

(6)

Combining (6) with (5), with probability at least , we have

Adjusting constants on , this gives the theorem for with .

Large Norm. We next consider with . Since by assumption for all , we can apply triangle inequality to give for any ,

Applying Theorem 5 and the bound on given in (6), we thus have, with probability at least , for all with ,

where the final bound uses that for a large enough constant . This gives the theorem for with .

Bounded Norm. We now return to proving that (4) holds for any with probability at least . Let . Then for any we have:

We again apply the bound on given in (6) and the fact that . This gives that with probability at least , for all ,

(7)

Now, for , by a standard symmetrization argument (c.f. the proof of Theorem 7.4 in [CP15]),

where has independent Rademacher random entries. We can then apply, for each fixed value of the Ledoux-Talagrand contraction theorem (Thm. 4) with and . Note that since . Additionally, is -Lipschitz since by assumption is -Lipschitz so is -Lipschitz. We have,

So, applying Theorem 3, for some we have, since for ,

Adjusting by a constant, this gives via Markov’s inequality that with probability at least ,

(8)

In combination with (4), we then have that probability at least ,

This gives (4) and completes the theorem. ∎

4.1 Relative Error Coresets

Our relative error coreset result for nice hinge functions follows as a simple corollary of Theorem 8.

Corollary 9 (Nice Hinge Function – Relative Error Coreset).

Consider the setting of Theorem 8 under the additional assumption that . If and for all , where and is a fixed constant, with probability , for all ,

Proof.

By (3) proven in Corollary 6 and using the fact that is -nice,

(9)

Let . Now we claim that . If then this holds immediately since , and