# Robust Coreset for Continuous-and-Bounded Learning (with Outliers)

In this big data era, we often confront large-scale data in many machine learning tasks. A common approach for dealing with large-scale data is to build a small summary, e.g., coreset, that can efficiently represent the original input. However, real-world datasets usually contain outliers and most existing coreset construction methods are not resilient against outliers (in particular, the outliers can be located arbitrarily in the space by an adversarial attacker). In this paper, we propose a novel robust coreset method for the continuous-and-bounded learning problem (with outliers) which includes a broad range of popular optimization objectives in machine learning, like logistic regression and k-means clustering. Moreover, our robust coreset can be efficiently maintained in fully-dynamic environment. To the best of our knowledge, this is the first robust and fully-dynamic coreset construction method for these optimization problems. We also conduct the experiments to evaluate the effectiveness of our robust coreset in practice.

## Authors

• 3 publications
• 20 publications
• 21 publications
• ### Layered Sampling for Robust Optimization Problems

In real world, our datasets often contain outliers. Moreover, the outlie...
02/27/2020 ∙ by Hu Ding, et al. ∙ 0

• ### Robust Coreset Construction for Distributed Machine Learning

Motivated by the need of solving machine learning problems over distribu...
04/11/2019 ∙ by Hanlin Lu, et al. ∙ 0

• ### Robust Trimmed k-means

Clustering is a fundamental tool in unsupervised learning, used to group...
08/16/2021 ∙ by Olga Dorabiala, et al. ∙ 9

• ### DORO: Distributional and Outlier Robust Optimization

Many machine learning tasks involve subpopulation shift where the testin...
06/11/2021 ∙ by Runtian Zhai, et al. ∙ 0

• ### The Effectiveness of Johnson-Lindenstrauss Transform for High Dimensional Optimization with Outliers

Johnson-Lindenstrauss (JL) Transform is one of the most popular methods ...
02/27/2020 ∙ by Hu Ding, et al. ∙ 0

• ### Random Bits Regression: a Strong General Predictor for Big Data

To improve accuracy and speed of regressions and classifications, we pre...
01/13/2015 ∙ by Yi Wang, et al. ∙ 0

• ### Robust Meta-learning for Mixed Linear Regression with Small Batches

A common challenge faced in practical supervised learning, such as medic...
06/17/2020 ∙ by Weihao Kong, et al. ∙ 29

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

As the rapid increasing of data volume in this big data era, we often need to develop low-complexity (e.g., linear or even sublinear) algorithms for machine learning tasks. Moreover, our dataset is often maintained in a dynamic environment so that we have to consider some more complicated issues like data insertion and deletion. For example, as mentioned in the recent article [GGV+19], Ginart et al. discussed the scenario that some sensitive training data has to be deleted due to the reason of privacy preserving. Obviously, it is prohibitive to re-train our model when the training data is changed dynamically, if the data size is extremely large. To remedy these issues, a natural way is to construct a small-sized summary of the training data so that we can run existing algorithm on the summary rather than the whole data. Coreset [FEL20], which was originally studied in the community of computational geometry [AHV04], has become a widely used data summary for large-scale machine learning. As a succinct data compression technique, coreset also enjoys a number of other nice properties. For instance, coreset is usually composable thus can be applied in the environment like distributed computing [IMM+14]. Also, small coreset can be obtained for the streaming algorithms [AHV04, BS80] and the fully-dynamic algorithms in the dynamic setting with data insertion and deletion [HK20].

However, the existing coreset construction methods are still far from being satisfactory in practice. A major bottleneck is that most of them are sensitive to outliers. We are aware that real-world dataset is usually noise and may contain outliers; note that the outliers can be located arbitrarily in the space and even one outlier can significantly destroy the final machine learning result. A typical example is poisoning attack, where an adversarial attacker may inject several specially crafted samples into the training data which can make the decision boundary severely deviate and cause unexpected misclassification [BR18]. To see why the existing coreset methods are sensitive to outliers, we can take the popular sampling based coreset framework [FL11] as an example. The framework needs to compute a “sensitivity” for each data item, which measures the importance degree of the data item to the whole data set; however, it tends to assign high sensitivities to the points who are far from the majority of the data, that is, an outlier is likely to have a high sensitivity and thus has a high chance to be selected to the coreset. Obviously, the coreset obtained by this way is not appropriate since we expect to contain more inliers rather than outliers in the coreset. It is also more challenging to further construct a fully-dynamic robust coreset. The existing robust coreset construction methods [FL11, HJL+18] often rely on simple uniform sampling and are efficient only when the number of outliers is a constant factor of the input size (we will discuss this issue in Section 3.1). Note that other outlier-resistant data summary methods like [GKL+17, CAZ18] usually yield large approximation factors and are unlikely to be maintained in a fully dynamic scenario, to our knowledge.

### 1.1 Our Contributions

In this paper, we propose a unified fully-dynamic robust coreset framework for a class of optimization problems which is termed continuous-and-bounded learning. This type of learning problems actually covers a broad range of optimization objectives in machine learning [SB14, Chapter 12.2.2]. Roughly speaking, “continuous-and bounded learning” requires that the optimization objective is a continuous function (e.g., smooth or Lipschitz), and meanwhile the solution is restricted within a bounded region. We emphasize that this “bounded” assumption actually is quite natural in real machine learning scenarios. We can consider the unsupervised optimization problem, facility location, as an example. Suppose we are going to build a supermarket in a city. Though there are many available candidate locations in the city, we often restrict our choice within some specific districts due to some other practical factors. Moreover, it is also reasonable to bound the solution range in a dynamic environment because one single update (insertion or deletion) is not quite likely to dramatically change the solution.

Our coreset construction is a novel hybrid framework. First, we suppose that there exists an ordinary coreset construction method for the given continuous-and bounded optimization objective (without considering outliers). We partition the input data into two parts: the “suspected” inliers and the “suspected” outliers, where the ratio of the sizes of these two parts is a carefully designed parameter . For the “suspected” inliers, we run the method (as a black box); for the “suspected” outliers, we directly take a small sample uniformly at random; finally, we prove that these two parts together yield a robust coreset. Our framework can be also efficiently implemented under the merge-and-reduce framework for dynamical setting (even though the original merge-and-reduce framework is not designed for the case with outliers) [BS80, HM04]. A cute feature of our framework is that we can easily tune the parameter for updating our coreset dynamically, if the fraction of outliers is changed in the dynamic environment.

The other contribution of this paper is that we propose two different coreset construction methods for continuous-and-bounded optimization objectives (i.e., the aforementioned black box ). The first method is based on the importance sampling framework [FL11], and the second one is based on a space partition idea. Our coreset sizes depend on the doubling dimension of the solution space rather than the VC (shattering) dimension (this property is particularly useful if the VC dimension is too high or not easy to obtain). Our methods can be applied for a broad range of widely studied optimization objectives, such as logistic regression [MSS+18], Bregman clustering [BMD+05] and truth discovery [LGM+15]. It is worth noting that although some coreset construction methods have been proposed for some of these continuous-and-bounded objectives (like [MSS+18, TF18, LBK16]), they are all problem-dependent and we are the first, to the best of our knowledge, to study them from a unified perspective.

## 2 Preliminaries

Suppose is the parameter space. In this paper, we consider the learning problem whose objective function is the weighted sum of the cost over the training data, i.e.,

 f(θ,X):=∑x∈Xw(x)f(θ,x), (1)

where is the input data and indicates weight of each data item ; is the cost contributed by

with the parameter vector

. The goal is to find an appropriate so that objective function is minimized. Assuming that there are outliers in , we then define the objective function with outliers:

 fz(θ,X):=minO⊂X,⟦O⟧=z f(θ,X∖O).

We use to denote a given instance with outliers. Actually, the above definition comes from the popular “trimming” method [RL87] that has been widely used for robust optimization problems. Several other notations used throughout this paper is shown in Table 1. We always assume if the data items are unit-weighted; otherwise, we assume the total weight .

We further impose some reasonable assumption to the function . Let be a metric space. A function is -Lipschitz continuous if for any , , where and is some specified norm in . A differentiable (resp., twice-differentiable) function is -Lipschitz continuous gradient (resp., -Lipschitz continuous Hessian) if its gradient (resp., Hessian matrix) (resp., ) is -Lipschitz continuous. An -Lipschitz continuous gradient function is also called -smooth. For any , the -Lipschitz continuous gradient and -Lipschitz continuous Hessian also imply the bounds of the difference between and , which are

 (2) (3)

Below we show the formal definition of the continuous-bounded learning problem.

###### Definition 1 (Continuous-and-Bounded Learning [Zin03, Sss+09, Sb14]).

Let , . A learning problem is called Lipschitz-and-bounded with the parameters if the following two conditions hold:

1. There exists a fixed parameter vector such that is always restricted within , the ball centered at with radius in the parameter space .

2. For each

, the loss function

is -Lipschitz.

Similarly, we can define the smooth-and-bounded and Lipschitz Hessian-and-bounded learning problems. In general, we call all these three problems “continuous-and-bounded learning”.

As mentioned in Section 1.1, many practical optimization objectives fall under the umbrella of the continuous-and-bound learning problems.

We also define the coreset for continuous-and-bounded learning problems below. We assume each has unit weight (i.e., ), and it is not hard to extend our method to weighted case.

###### Definition 2 (ε-coreset and (ε,ν)-coreset).

Let . Given a dataset and the objective function , we say that a weighted set is an -coreset of if for any , we have

 |f(θ,C)−f(θ,X)|≤εf(θ,X). (4)

The set is called an -coreset if for any ,

 |f(θ,C)−f(θ,X)|≤ε(f(θ,X)+ν). (5)
###### Remark 1.

The purpose for proposing the -coreset is that we will use it as a bridge to achieve an -coreset in our analysis. Actually it is not difficult to see the relation between -coreset and -coreset. Suppose , where . Then an -coreset should be an -coreset as well, because is no larger than for any .

Following Definition 2, we define the corresponding robust coreset which was introduced in [FL11, HJL+18] before.

###### Definition 3 (robust coreset).

Let and . Given the dataset and the objective function , we say that a weighted dataset is a -robust coreset of if for any , we have

 (1−ε)f(1+β)z(θ,X)|X|−z≤fz(θ,C)⟦C⟧−z≤(1+ε)f(1−β)z(θ,X)|X|−z. (6)

The set is called an -robust coreset if for any ,

 (1−ε)f(1+β)z(θ,X)|X|−z−εν≤fz(θ,C)⟦C⟧−z≤(1+ε)f(1−β)z(θ,X)|X|−z+εν (7)

For the case without outliers, it is not difficult to see that the optimal solution of the -coreset is a -approximate solution of the full data. Specifically, let be the optimal solution of an -coreset and be the optimal solution of the original dataset ; then for , we have

 f(θ∗C,X)≤(1+3ε)f(θ∗X,X). (8)

But when considering outliers, this result only holds for -robust coreset (i.e., ). If , we can derive a weaker result.

###### Lemma 2.

Given two parameters and , suppose is a -robust coreset of with outliers. Let be the optimal solution of , be the optimal solution of respectively. Then we have

 f(1+4β)z(θ∗C,X)≤(1+3ε)fz(θ∗X,X) (9)
###### Proof.

For and , we have , and . Thus we can obtain the following bound:

 f(1+4β)z(θ∗C,X) ≤f(1+2β)(1+β)z(θ∗C,X)≤11−εf(1+2β)z(θ∗C,C) ≤11−εf(1+2β)z(θ∗X,C)≤1+ε1−εf(1+2β)(1−β)z(θ∗X,X) ≤(1+3ε)fz(θ∗X,X)

The rest of this paper is organized as follows. In Section 3, we introduce our robust coreset framework and show how to realize it in a fully-dynamic environment. In Section 4, we propose two different ordinary coreset (without outliers) construction methods for continuous-and-bound learning problems, which can be used as the black box in our robust coreset framework of Section 3.

## 3 Our Robust Coreset Framework

As the warm-up, we first consider the simple uniform sampling as the robust coreset in Section 3.1. Then, we introduce our major contribution, the hybrid framework for robust coreset construction and its fully-dynamic realization in Section 3.2 and 3.3 respectively.

### 3.1 Warm-Up: Uniform Sampling

As mentioned before, the existing robust coreset construction methods [FL11, HJL+18] are based on uniform sampling. Note that their methods are only for the clustering problems (e.g., -means/median clustering). Thus a natural question is that whether the uniform sampling idea also works for the general continuous-and-bounded optimization problems studied in this paper. Below, we answer this question in the affirmative.

###### Definition 4 (f-induced range space).

Suppose is an arbitrary metric space. Given the cost function as (1) over , we let

 R={{x∈X:f(θ,x)≤ℓ}∣∀ℓ≥0,∀θ∈P}, (10)

then is called the -induced range space.

The following “-sample” concept comes from the theory of VC dimension  [LLS01]. Given a range space , let and be two finite subsets of . Suppose . We say is a -sample of if and for any . Denote by the VC dimension of the range space of Definition 4, then we can achieve a

-sample with probability

by uniformly sampling points from  [LLS01]. The value of depends on the function “”. For example, if “” is the loss function of logistic regression in , then can be as large as  [MSS+18]. The following theorem shows that a -sample can serve as a robust coreset when is a constant factor of .

###### Theorem 1.

Let be an instance of the continuous-and-bounded learning problem (2). If is a -sample of in the -induced range space. Then we have

 fz+δn(θ,X)≤fz(θ,C)≤fz−δn(θ,X) (11)

for any and any . In particular, if , is a -robust coreset of .

###### Proof.

Let and be two subsets of and we assume that they are -approximate, where the range space is induced by . Let and . We create a new weighted set by setting the weight of each element in to be ; it can be simply thought as making copies of each element in . Similarly, we create that each point has weight . Then we have .

We arrange elements in in the order of and use to denote the -th element. First we claim that for all , we have

 f(θ,a′i+δO)≥f(θ,b′i) (12)

If not, there must be some with . Consider a set . We have , and . (We define the intersection of a weighted set and a unit-weighted set to be a weighted set such that each element should belong to both sets and it keeps its weight from the weighted set.) Then we have

 |A∩R||A| =⟦A′∩R⟧/b|A|≥(j+δO)/ba=jO+δ |B∩R||B| =⟦B′∩R⟧/a|B|

We see that and are not -approximate, which contradicts the assumption. Thus we proved (12) .
We denote by the proportion of inliers, then we have

 f(1−γ)|A|(θ,A) =γa∑i=1f(θ,ai)=1bγO∑i=1f(θ,a′i) ≥1bγO∑i=1+δOf(θ,a′i)because f(θ,⋅)≥0 =1b(γ−δ)O∑i=1f(θ,a′i+δO)≥1b(γ−δ)O∑i=1f(θ,b′i)by the claim f(θ,a′i+δO)≥f(θ,b′i) =ab(γ−δ)b∑i=1f(θ,yi) =f(1−(γ−δ))|B|(θ,B).

For simplicity we assume that is an integer, and it is easy to see that this result still holds when is not integral.

is a -sample of , so they are -approximate. Let be and be , we have . Let be , be and , we have . ∎

###### Remark 3.

Though the uniform sampling is simple and easy to implement, it has two major drawbacks. First, it always involves an error on the number of outliers (otherwise, if , the sample should be the whole ). Also, it is useful only when is a constant factor of . For example, if , the obtained sample size can be as large as . Our following hybrid robust framework in Section 3.2 can well avoid these two issues.

### 3.2 The Hybrid Framework for (β,ε)-Robust Coreset

Our idea for building the robust coreset is inspired by the following intuition. In an ideal scenario, if we know who are the inliers and who are the outliers, we can simply construct the coresets for them separately. In reality, though we cannot obtain such a clear classification, the properties of the continuous-and-bounded objective function (from Definition 1) can guide us to obtain a “coarse” classification. Furthermore, together with some novel insights in geometry, we prove that such a hybrid framework can yield a -robust coreset.

Suppose the cost function is continuous-and-bounded as Definition 1. Specifically, the parameter vector is always restricted within the ball centered at with radius . First, we partition into two parts according to the value of . Let , and be the point who has the -th largest cost among . We let , and thus we have

 ∣∣{x∈X:f(~θ,x)≥τ}∣∣=Z. (13)

We call these points as the “suspected outliers” and the remaining points as the “suspected inliers”. If we fix , the set of the “suspected outliers” contains at least real inliers (since ). This immediately implies the following inequality:

 τz≤εfz(~θ,X). (14)

Our robust coreset construction is as follows. Suppose we have an ordinary coreset construction method as the black box (we will discuss it in Section 4). We build an -coreset for the suspected inliers by , and build a -sample for the suspected outliers with (as Theorem 1). If we set , we just directly take all the suspected outliers as the -sample. We denote these two coresets as and respectively. Finally, we return as the robust coreset.

To prove the correctness of this construction, we imagine the following partition on . For any parameter vector in the parameter space, is also partitioned into the real inliers (i.e., the set of points who have the smallest cost ) and the real outliers (i.e., the set of points who have the largest cost ). Therefore, together with the suspected inliers and outliers, is partitioned into parts:

• : the points belonging to the real inliers and also the suspected inliers. is not empty because we assume that suspected outliers are less than inliers ().

• : the points belonging to the real inliers and the suspected outliers. is not empty because the number of the suspected outliers is larger than the number of outliers ().

• : the points belonging to the real outliers and also the suspected outliers. can be empty iff all the real outliers are suspected inliers.

• : the points belonging to the real outliers and the suspected inliers. can be empty iff all the real outliers are suspected outliers.

Similarly, is partitioned into parts in the same way.

For continuous-and-bounded learning problems, we can bound by a low degree polynomial function. For example, if is -Lipschitz, the polynomial is ; if is -smooth, the polynomial is , where . For conciseness, we use the in the following statement.

###### Theorem 2.

For continuous-and-bounded learning problems defined in Definition 1, given an -coreset method with size for the “suspected inliers”, we can construct a -robust coreset method of size with probability at least . In particular, when , our coreset has no error on the number of outliers and its size is .

###### Proof (sketch).

In this proof we omit technical details and the entire proof is put in the supplement.

Our aim is to prove that . We have . Hence we need to bound and respectively. The upper bound of comes from the definition of -coreset. We have since is an -coreset of .

The upper bound of is a little complex. We give two different upper bounds depending on is empty or not:

 f(θ,CII)≤{f(1−β)z(θ,XII+XIII)if CIV=∅f(θ,XII)+2z(τ+ξ(ℓ))if CIV≠∅ (15)

By adding the upper bound of and , we obtain an upper bound of :

 fz(θ,C)≤{(1+ε)f(1−β)z(θ,X)+4zξ(ℓ)if CIV=∅(1+ε)fz(θ,X)+4zτ+4zξ(ℓ)if CIV≠∅ (16)

Merging two cases together, we have that

 fz(θ,C)≤(1+ε)f(1−β)z(θ,X)+4zτ+4zξ(ℓ) (17)

We have and . Substitute these into (17) and take the average, we have

 fz(θ,C)n−z≤(1+5ε)f(1−β)z(θ,X)n−z+10ε⋅ξ(ℓ) (18)

Similarly, we can derive

 fz(θ,C)n−z≥(1−5ε)f(1+β)z(θ,X)n−z−10ε⋅ξ(ℓ) (19)

We conclude that is a -robust coreset of .

Finally, we can obtain a -robust coreset with the procedure stated in Remark 1.

### 3.3 The Fully-Dynamic Implementation

In this section, we show that our robust coreset of Section 3.2 can be efficiently implemented in a fully-dynamic environment, even if the number of outliers is dynamically changed.

The standard -coreset usually has two important properties. If and are respectively the -coresets of two disjoint sets and , their union should be an -coreset of . Also, if is an -coreset of and is an -coreset of , should be an -coreset of . Based on these two properties, one can build a coreset for incremental data stream by using the merge-and-reduce technique [BS80, HM04]. Very recently, Henzinger and Kale [HK20] extended it to the more general fully-dynamic setting, where data items can be deleted and updated as well.

Roughly speaking, the merge-and-reduce technique uses a sequence of “buckets” to maintain the coreset for the input streaming data, and the buckets are merged by a bottom-up manner. However, it is challenging to directly adapt this strategy to the case with outliers, because we cannot determine the number of outliers in each bucket. A cute aspect of our hybrid robust coreset framework is that we can easily resolve this obstacle by using an size auxiliary table together with the merge-and-reduce technique (note that even for the case without outliers, maintaining a fully-dynamic coreset needs space [HK20]). Due to the space limit, we briefly introduce our idea below, and leave the full details to our supplement.

Recall that we partition the input data into two parts: the “suspected inliers” and the “suspected outliers”, where . We follow the same notations used in Section 3.2. For the first part, we just apply the vanilla merge-and-reduce technique to obtain a fully-dynamic coreset ; for the other part, we can just take a -sample or take the whole set (if we require to be ), and denote it as . Moreover, we maintain a table to record the key values and its position in the merge-and-reduce tree, for each ; they are sorted by the s in the table. To deal with the dynamical updates (e.g., deletion and insertion), we also maintain a critical pointer pointing to the data item (recall has the -th largest cost among defined in Section 3.2).

When a new data item is coming or an existing data item is going to be deleted, we just need to compare it with so as to decide to update or accordingly; after the update, we also need to update and the pointer in . If the number of outliers is changed, we just need to update and first, and then update and (for example, if is increased, we just need to delete some items from and insert some items to ). To assist these updating operations, we also set one bucket as the “hot bucket”, which serves as a shuttle to execute all the data shifts. See Figure 1 for the illustration. Let be the size of each leaf bucket; our time complexity for insertion and deletion is ; for updating to , our time complexity is , where is the error bound for the robust coreset in Definition 3.

## 4 Coreset for Continuous-Bounded Learning Problems

As mentioned in Section 3.2, we need a black box ordinary coreset (without considering outliers) construction method in the hybrid robust coreset framework. In this section, we provide two general -coreset construction methods for the continuous-and-bounded learning problems.

### 4.1 Importance Sampling Based Coreset Construction

We follow the importance sampling based approach [LS10]. For each data point , it has a sensitivity that measures its importance to the whole input data . Computing the sensitivity is often challenging but an upper bound of the sensitivity actually is often sufficient for the coreset construction. Suppose and we denote by for convenience. Assume is an upper bound of and let . The coreset construction is as follows. We sample a subset from , where each element of is sampled i.i.d. with probability ; we assign a weight to each sampled data item of .

###### Theorem 3 ([Bfl16]).

Let be the VC dimension (or shattering dimension) of the range space induced from . If the size of is , then is an -coreset with probability .

Therefore the only remaining issue is how to compute the s. We denote by for short. Recall that we assume our cost function to be -Lipschitz (or -smooth, -Lipschitz continuous Hessian) in Definition 1. That is, we can bound the difference between and . For example, if we assume the cost function to be -smooth, we have and , where . Consequently, we obtain an upper bound of :

 (20)

In our supplement, we further show that computing such an upper bound is equivalent to solving a quadratic fractional programming. This programming can be reduced to a semi-definite programming [BT09], which can be solved in time. So the total running time of the coreset construction is .

A drawback of Theorem 3 is that the coreset size depends on the VC dimension induced by . For some objectives, can be very large or difficult to obtain. Here, we prove that for a continuous-and-bounded cost function, the coreset size can be independent of ; instead, it depends on the doubling dimension  [CGM+16] of the parameter space . Doubling dimension is a widely used measure to describe the growth rate of the data, which can also be viewed as a generalization of the Euclidean dimension. For example, the doubling dimension of a -dimensional Euclidean space is . The proof of Theorem 4 is placed in the supplement.

###### Theorem 4.

Suppose the cost function is continuous-and-bounded as described in Definition 1, and let be the doubling dimension of the parameter space. Then, if we run the importance sampling based coreset construction method with the sample size , is an -coreset with probability . The hidden constant of depends on and .

###### Remark 4.

The major advantage of Theorem 4 over Theorem 3

is that we do not need to know the VC dimension induced by the cost function. On the other hand, the doubling dimension is often much easier to know (or estimate),

e.g., the doubling dimension of a given instance in is just , no matter how complicated the cost function is. The reader is also referred to [HJL+18] for a more detailed discussion on the relation between VC (shattering) dimension and doubling dimension.

### 4.2 Spatial Partition Based Coreset Construction

The reader may realize that the coreset size presented in Theorem 4 (and also Theorem 3) is data-dependent. That is, the coreset size depends on the value , which can be different for different input instance (because the formula (20) depends on ). To achieve a data-independent coreset size, we introduce the following method based on spatial partition, which is partly inspired by the previous -median/means clustering coreset construction idea of [CHE09]. We extend their method to the continuous-and-bounded learning problems and call is as Generalized Exponential Layer (GEL) method.

We set and . Then, we partition all the data points to different layers according to their cost with respect to . Specifically, we assign a point to the -th layer if ; otherwise we assign it to the -th layer. It is easy to see the number of the layers is . We denote the set of points falling in the -th layer as . From each , we take a small sample uniformly at random, where each point of is assigned the weight . Finally, the union set form our final coreset.

###### Theorem 5.

Suppose the cost function is continuous-and-bounded as described in Definition 1, and let be the doubling dimension of the parameter space. The above coreset construction method GEL can achieve an -coreset of size in linear time. The hidden constant of depends on and .

To prove Theorem 5, the key is show that each can well represent the layer with respect to any in the bounded region (i.e., the ball centered at with radius in the parameter space as described in Definition 1). First, we use the continuity property to bound the difference between and for each with a fixed ; then, together with the doubling dimension, we can generalize this bound to any in the bounded region. The full proof is shown in the supplement.

## 5 Conclusion and Future Work

In this paper, we propose a robust coreset framework for the continuous-and-bounded learning problems (with outliers). Our framework can be efficiently implemented in the dynamical setting. We put our experimental results in the supplement due to the space limit. In future, it is interesting to consider constructing (dynamic) robust coresets for other types of optimization problems in machine learning.

## References

• [AHV04] P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan (2004) Approximating extent measures of points. J. ACM 51 (4), pp. 606–635. Cited by: §1.
• [BMD+05] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh (2005) Clustering with bregman divergences. J. Mach. Learn. Res. 6, pp. 1705–1749. Cited by: Appendix A, §1.1.
• [BT09] A. Beck and M. Teboulle (2009) A convex optimization approach for minimizing the ratio of indefinite quadratic functions over an ellipsoid. Mathematical Programming 118, pp. 13–35. External Links: Document, ISSN 0025-5610,1436-4646 Cited by: §4.1.
• [BS80] J. L. Bentley and J. B. Saxe (1980) Decomposable searching problems I: static-to-dynamic transformation. J. Algorithms 1 (4), pp. 301–358. Cited by: §1.1, §1, §3.3.
• [BR18] B. Biggio and F. Roli (2018)

Wild patterns: ten years after the rise of adversarial machine learning

.
84, pp. 317–331. Cited by: §1.
• [BD99] J. A. Blackard and D. J. Dean (1999)

Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables

.
Computers and Electronics in Agriculture 24 (3), pp. 131–151. External Links: ISSN 0168-1699, Document Cited by: Appendix D.
• [BFL16] V. Braverman, D. Feldman, and H. Lang (2016) New frameworks for offline and streaming coreset constructions. CoRR abs/1612.00889. External Links: 1612.00889 Cited by: Theorem 3.
• [CGM+16] T.-H. H. Chan, A. Gupta, B. M. Maggs, and S. Zhou (2016) On hierarchical routing in doubling metrics. ACM Trans. Algorithms 12 (4), pp. 55:1–55:22. Cited by: §4.1.
• [CG13] S. Chawla and A. Gionis (2013)

K-means–: a unified approach to clustering and outlier detection

.
In SDM, Cited by: Appendix D.
• [CAZ18] J. Chen, E. S. Azer, and Q. Zhang (2018) A practical algorithm for distributed clustering and outlier detection. See DBLP:conf/nips/2018, pp. 2253–2262. Cited by: §1.
• [CHE09] K. Chen (2009) On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM J. Comput. 39 (3), pp. 923–947. Cited by: §4.2.
• [FL11] D. Feldman and M. Langberg (2011) A unified framework for approximating and clustering data. In

Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San Jose, CA, USA, 6-8 June 2011

, L. Fortnow and S. P. Vadhan (Eds.),
pp. 569–578. Cited by: §1.1, §1, §2, §3.1.
• [FEL20] D. Feldman (2020) Introduction to core-sets: an updated survey. CoRR abs/2011.09384. External Links: 2011.09384 Cited by: §1.
• [GGV+19] A. Ginart, M. Y. Guan, G. Valiant, and J. Zou (2019) Making AI forget you: data deletion in machine learning. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 3513–3526. Cited by: §1.
• [GKL+17] S. Gupta, R. Kumar, K. Lu, B. Moseley, and S. Vassilvitskii (2017-03) Local search methods for k-means with outliers. Proc. VLDB Endow. 10 (7), pp. 757–768. External Links: ISSN 2150-8097 Cited by: Appendix D, §1.
• [HM04] S. Har-Peled and S. Mazumdar (2004) On coresets for k-means and k-median clustering. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, June 13-16, 2004, L. Babai (Ed.), pp. 291–300. Cited by: §1.1, §3.3.
• [HAU92] D. Haussler (1992) Decision theoretic generalizations of the PAC model for neural net and other learning applications. Inf. Comput. 100 (1), pp. 78–150. Cited by: Lemma 10.
• [HK20] M. Henzinger and S. Kale (2020) Fully-dynamic coresets. In 28th Annual European Symposium on Algorithms, ESA 2020, September 7-9, 2020, Pisa, Italy (Virtual Conference), F. Grandoni, G. Herman, and P. Sanders (Eds.), LIPIcs, Vol. 173, pp. 57:1–57:21. Cited by: §1, §3.3, §3.3.
• [HJL+18] L. Huang, S. H.-C. Jiang, J. Li, and X. Wu (2018) Epsilon-coresets for clustering (with outliers) in doubling metrics. In 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2018, Paris, France, October 7-9, 2018, M. Thorup (Ed.), pp. 814–825. Cited by: §1, §2, §3.1, Remark 4.
• [IMM+14] P. Indyk, S. Mahabadi, M. Mahdian, and V. S. Mirrokni (2014) Composable core-sets for diversity and coverage maximization. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS’14, Snowbird, UT, USA, June 22-27, 2014, R. Hull and M. Grohe (Eds.), pp. 100–108. Cited by: §1.
• [KYJ13] M. Kaul, B. Yang, and C. S. Jensen (2013) Building accurate 3d spatial networks to enable next generation intelligent transportation systems. See DBLP:conf/mdm/2013-1, pp. 137–146. External Links: Document Cited by: Appendix D.
• [LS10] M. Langberg and L. J. Schulman (2010) Universal epsilon-approximators for integrals. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, January 17-19, 2010, M. Charikar (Ed.), pp. 598–607. Cited by: §4.1.
• [LXY20] S. Li, J. Xu, and M. Ye (2020) Approximating global optimum for probabilistic truth discovery. Algorithmica 82 (10), pp. 3091–3116. External Links: Document Cited by: Appendix A.
• [LGM+15] Y. Li, J. Gao, C. Meng, Q. Li, L. Su, B. Zhao, W. Fan, and J. Han (2015) A survey on truth discovery. SIGKDD Explor. 17 (2), pp. 1–16. Cited by: §1.1.
• [LLS01] Y. Li, P. M. Long, and A. Srinivasan (2001) Improved bounds on the sample complexity of learning. Journal of Computer and System Sciences 62 (3), pp. 516–527. Cited by: §3.1.
• [LBK16] M. Lucic, O. Bachem, and A. Krause (2016) Strong coresets for hard and soft bregman clustering with applications to exponential family mixtures. In

Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS 2016, Cadiz, Spain, May 9-11, 2016

, A. Gretton and C. C. Robert (Eds.),
JMLR Workshop and Conference Proceedings, Vol. 51, pp. 1–9. Cited by: §1.1.
• [MSS+18] A. Munteanu, C. Schwiegelshohn, C. Sohler, and D. P. Woodruff (2018) On coresets for logistic regression. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 6562–6571. Cited by: Appendix D, §1.1, §3.1.
• [RL87] P. J. Rousseeuw and A. Leroy (1987) Robust regression and outlier detection. Wiley. Cited by: §2.
• [SB14] S. Shalev-Shwartz and S. Ben-David (2014) Understanding machine learning - from theory to algorithms.. Cambridge University Press. External Links: ISBN 978-1-10-705713-5 Cited by: §1.1, Definition 1.
• [SSS+09] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan (2009) Stochastic convex optimization. In COLT 2009 - The 22nd Conference on Learning Theory, Montreal, Quebec, Canada, June 18-21, 2009, Cited by: Definition 1.
• [TF18] E. Tolochinsky and D. Feldman (2018) Generic coreset for scalable learning of monotonic kernels: logistic regression, sigmoid and more. External Links: 1802.07382 Cited by: §1.1.
• [ZIN03] M. Zinkevich (2003) Online convex programming and generalized infinitesimal gradient ascent. In Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, pp. 928–936. Cited by: Definition 1.

## Appendix A Instances of Continuous-and-Bounded Learning Problems

#### Logistic Regression

For and , the loss function of Logistic regression is

 (21)

This function is -Lipschitz and -smooth.

#### Bregman Divergence [Bmd+05]

Let function be strictly convex and differentiable, then the Bregman divergence between respect to is

 (22)

If for , then we have

 dϕ(y,x)≤L∥y−x∥+∥∇ϕ(x)∥∥y−x∥≤2L∥y−x∥.

In this case, Bregman divergence is -Lipschitz.

#### Truth Discovery [Lxy20]

For and the function of truth discovery is , where

 ftruth(t)={t20≤t<11+logt2t≥1 (23)

We can prove that is -Lipschitz.

#### k-median

The loss function of -median is -Lipschitz. However, if , we cannot conclude the Lipschitzness of -median, whose loss function is , where is a set of points. But we can still consider it as an -Lipschitz function if we make some modifications of the definition of continuous-and-bounded learning problems. In the previous discussion, we imply the parameter is an atom element. As for problems like -median, the parameter is a collection of several atom parameters, i.e., . In this case, we require that the parameter space is the direct product of every atom parameter space . Each is a ball of radius centered at . Let and . If is continuous for each parameter , we still have . Therefore, for problems like -clustering, we only need to consider the -clustering case.

## Appendix B Proofs

### b.1 Proof of Theorem 2

Recall that is an -coreset of , is a -sample of , where .

We have and . So our aim is to prove

 f(θ,CI+CII)=f(θ,CI)+f(θ,CII)≈f(θ,XI+XII).

We will bound and repectively at first and then bound their sum. We present the following claim.

###### Claim 5.

We have the following results for the partition made by and .

1. These are from the definition of and .

2. These are from the construction method immediately.

 CI+CIV⊆XI+XIV,CII+CIII⊆XII+XIII
3. is any element in set , then

 f(θ,xI+II)≤f(θ,xIII+IV),f(~θ,xI+IV)≤f(~θ,xII+III)
4. Because of the definition of and the continuity of , we have

 f(~θ,xI)≤τ, f(θ,xI)≤τ+ξ(ℓ) f(~θ,xII)≥τ, f(θ,xII)≥τ−ξ(ℓ)
5. If () is not empty, then

 τ−2ξ(ℓ)≤f(~θ,xIV)≤τ,τ−ξ(ℓ)≤f(θ,xIV)≤τ+ξ(ℓ) τ≤f(~θ,xII)≤τ+2ξ(ℓ),τ−ξ(ℓ)≤f(θ,xII)≤τ+ξ(ℓ)
###### Proof.

The proofs of are straightforward. As for 5, if () is not empty, combining the result of 3 and 4, we have

 f(θ,xIV)≥f(θ,xII)≥τ−ξ(ℓ) f(θ,xIV)≤f(~θ,xIV)+ξ(ℓ)≤τ+ξ(ℓ) f(~θ,xIV)≤f(~θ,xII)≤τ f(~θ,xIV)≥f(θ,xIV)−ξ(ℓ)≥τ−2ξ(ℓ)

Similarly, we can derive

 f(θ,xII)≤f(θ,xIV)≤τ+ξ(ℓ) f(θ,xII)≥f(~θ,xII)−ξ(ℓ)≥τ−ξ(ℓ) f(~θ,xII)≤f(θ,xII)+ξ(ℓ)≤τ+2ξ(ℓ)

To prove theorem 2, we need to prove the following key lemmas first.

We have

 ∣∣fz