A Unified Approach to Coreset Learning

11/04/2021
by   Alaa Maalouf, et al.
0

Coreset of a given dataset and loss function is usually a small weighed set that approximates this loss for every query from a given set of queries. Coresets have shown to be very useful in many applications. However, coresets construction is done in a problem dependent manner and it could take years to design and prove the correctness of a coreset for a specific family of queries. This could limit coresets use in practical applications. Moreover, small coresets provably do not exist for many problems. To address these limitations, we propose a generic, learning-based algorithm for construction of coresets. Our approach offers a new definition of coreset, which is a natural relaxation of the standard definition and aims at approximating the average loss of the original data over the queries. This allows us to use a learning paradigm to compute a small coreset of a given set of inputs with respect to a given loss function using a training set of queries. We derive formal guarantees for the proposed approach. Experimental evaluation on deep networks and classic machine learning problems show that our learned coresets yield comparable or even better results than the existing algorithms with worst-case theoretical guarantees (that may be too pessimistic in practice). Furthermore, our approach applied to deep network pruning provides the first coreset for a full deep network, i.e., compresses all the network at once, and not layer by layer or similar divide-and-conquer methods.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

06/09/2020

Faster PAC Learning and Smaller Coresets via Smoothed Analysis

PAC-learning usually aims to compute a small subset (ε-sample/net) from ...
06/09/2020

Coresets for Near-Convex Functions

Coreset is usually a small weighted subset of n input points in R^d, tha...
08/11/2021

Learning to Hash Robustly, with Guarantees

The indexing algorithms for the high-dimensional nearest neighbor search...
02/21/2018

Coresets For Monotonic Functions with Applications to Deep Learning

Coreset (or core-set) in this paper is a small weighted subset Q of the ...
01/02/2020

A Loss-Function for Causal Machine-Learning

Causal machine-learning is about predicting the net-effect (true-lift) o...
01/30/2013

Dynamic Jointrees

It is well known that one can ignore parts of a belief network when comp...
07/15/2021

Applying the Case Difference Heuristic to Learn Adaptations from Deep Network Features

The case difference heuristic (CDH) approach is a knowledge-light method...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Coreset is usually defined as a small weighted subset of the original input set that provably approximates the given loss (objective) function for every query in a given set of queries. Coresets are useful in machine learning applications as they offer significant efficiency improvements. Namely, traditional (possibly inefficient, but provably optimal) algorithms can be applied on coresets to obtain an approximation of the optimal solution on the full dataset using time and memory that are smaller by order of magnitudes. Moreover, existing heuristics that already run fast can be improved in terms of accuracy by running them many times on the coreset in the time it takes for a single run on the original (big) dataset. Finally, coresets can be maintained for a distributed & streaming data, where the stream is distributed in parallel from a server to

machines (e.g. cloud), and the goal is to maintain the optimal solution (or an approximation to it) for the whole input seen so far in the stream using small update time, memory, and communication to the server.

In the recent decade, coresets, under different formal definitions, were applied to many machine learning algorithms e.g. logistic regression 

[22, 47], SVM [18, 59, 57, 58, 60], clustering problems [11, 16, 25, 39, 53], matrix approximation [13, 42, 45, 52, 43], -regression [8, 9, 55]

, decision trees 

[24], and others; see surveys [14, 50, 23].

Some attempts of using coresets were recently suggested in application to deep networks. Apart from the standard use of coresets for reducing the amount of computations in training, e.g., by replacing full data [46] or a batch [54] with a coreset, there are other applications that motivate the use of summarization methods in deep networks, e.g., model compression, continual learning, domain adaptation, federated learning, neural architecture search. We discuss some of the them below.

Model Compression. Deep networks are highly over-parametrized, resulting in high memory requirements and slower inference. While many methods have been developed for reducing the size of a previously trained network with no (or small) accuracy loss [36, 33, 5, 20, 10, 26, 65, 64, 44], most of them relied on heuristics, which performed well on known benchmarks, but diverged considerably from the behavior of the original network on specific sub-sets of input distribution [21]. Few previous works [49, 48, 2, 35] tried to resolve this problem by deriving a coreset for a fully connected or a convolutional layer with provable trade-off between the compression rate and the approximation error for any future input. However, since these works construct a coreset for a layer, the full network compression is performed in a layer-by-layer fashion.

Limited Data Access. Problems, such as continual / incremental learning [34, 51, 63, 66, 38, 3], domain adaptation [32, 1], federated learning [15] do not have access to the full data (due to memory limitation or privacy issues) and only a small summary of it can be used. Coresets offer a natural solution for these problems.

NAS. Another important application that could benefit from coresets is neural architecture search (NAS). Evaluating different architectures or a choice of parameters using a large data set is extremely time consuming. A representative, small summary of the training set could be used for a reliable approximation of the full training, while greatly speeding up the search. Few recent works [67, 56] (inspired by the work of [62]) tried to learn a small synthetic set that summarizes the full set for NAS.

Previous attempts of summarizing a full training set with a small subset or a synthetic set showed a merit of using coresets in modern AI (e.g., for training deep network). However, the summarization methods that they suggested were based on heuristics with no guarantees on the approximation error. Hence, it is not clear that existing heuristics for data summarization could scale up to real-life problems. On the other hand, theoretical coresets that provably quantify the trade-off between data reduction and information loss for an objective function of interest, are mostly limited to simple, shallow models due to the challenges discussed below in Section 1.1

. From the theoretical perspective, it seems that we cannot have coresets for a reasonable neural network under the classic definition of the worst case query (e.g. see Theorem 6 

[49]). In this paper we try to find a midway between these two paradigms.

1.1 Coreset challenges

In many modern machine learning problems, obtaining non-trivial theoretical worst-case guarantees is usually impossible (due to a high complexity of the target model, e.g. deep networks or since every point in the input set is important in the sense of high sensitivity [61]). Even for the simple problems, it may take years to design a coreset and prove its correctness for a specific problem at hand.

Another problem with the existing theoretical frameworks is the lack of generality. Even the most generic frameworks among them [12, 29] replace the problem of computing a coreset for points with new optimization problems (known as sensitivity bound), one for each of the input points. Solving these, however, might be harder than solving the original problem. Hence, different approximation techniques are usually tailored for each and every problem.

1.2 Our Contribution

The above observations suggest that there is a need in a more generic approach that can compute a coreset automatically for a given pair of dataset and loss function, and can be applied to hard problems, such as deep networks. It seems that this would require some relaxation in the standard coreset definition. Would such a relaxed coreset produced by a generic algorithm yield comparable empirical results with the traditional coresets that have provable guarantees? We affirmably answer this question by providing:

  1. A new definition of a coreset, which is a relaxation of the traditional definition of the strong coreset.

  2. AutoCL: a generic and simple algorithm that is designed to compute a coreset (under the new definition) for almost any given input dataset and loss function.

  3. Example applications with highly competitive empirical results for: (a) problems with known coreset construction algorithms, namely, linear regression and logistic regression, where the goal is to summarize the input training set, and (b) model compression, i.e., learning a coreset of all training parameters of a deep neural network at once (useful for model pruning). To our knowledge, this is the first algorithm that aims to compute a coreset for the network at once, and not layer by layer or similar divide-and-conquer methods. It is also the first approach that suggests to represent the coreset itself as a small (trainable) network that keeps improving on each iteration. In this sense we suggest "coreset for deep learning using deep learning".

  4. Open code for reproducing our results [6]. We expect that it would be the baseline for producing “empirical" coresets for many problems in the future. Mainly, since it requires very little familiarity with the existing theoretical research on coresets.

2 Preliminaries

Notations. For a set of items, we use to denote the number of items in (i.e., ). For an event we use

as the probability that event

occurs, and for a random variable

with a probability measure , we use to denote its mean (expected value). Finally, for a loss function and an input set of variables (from any form), we use to denote a standard gradient computation of with respect to the set of variables , and to denote a standard variables update () using a gradient step, where is the learning rate.

The following (generic) definition of a query space encapsulates all the ingredients required to formally define an optimization problem.

Definition 1 (Query space; see Definition 4.2 in [4]).

Let be a (possibly infinite) set called ground set, be a (possibly infinite) set called query set, and let be a loss (or cost) function. Let be a finite set called input set, and let be a weight function. The tuple is called a query space over .

Typically, in the training step (of machine learning model), we solve the optimization problem, i.e., we aim at finding the solution that minimizes the sum of fitting errors over every .

Definition 2 (Query cost).

Let be a query space over . Then, for a query we define the total cost of as

In the next definition, we describe formally a (strong) coreset for a given optimization problem.

Definition 3 (Traditional Coresets).

For a query space , and an error parameter , an -coreset is a pair such that , is a weight function, and for every , is a multiplicative approximation for , i.e.,

(1)

3 Method

In this section we first explain our approach in general, emphasising its novelty and then, we present our suggested framework including all the details.

3.1 Novel Framework

We propose a practical and generic framework for coreset construction to a wide family of problems via the following steps:

  1. Make a problem simpler by relaxing the definition of a coreset. Namely, we propose a new -coreset for the Average Loss (in Definition 5) that is a relaxation of the standard definition (in Definition 3), and is more suited for the learning formalism.

  2. Define coreset construction as a learning problem. Here, the coreset (under the new definition in Definition 5) is the training variable.

  3. Find the coreset that optimizes the empirical risk over a training set of queries. We assume that we are given a set of queries, chosen i.i.d. from an unknown distribution and we find a coreset that approximates the average loss of the original input data over the training set of queries.

  4. Show that the optimized coreset generalizes to all members in the query set. Namely, the expected loss on the coreset over all queries approximates the expected loss on the original input data.

3.2 -Coreset for the Average Loss

We relax the definition of a coreset by observing that in data mining and machine learning problems we are usually interested in approximating the average loss over the whole set of queries rather than approximating the loss of a specific query. To this end, we define a distribution over the set of queries in Definition 4, and then focus on approximating the expected loss in Definition 5.

Definition 4 (Measurable query space).

Let be a query space over the ground set , and let be a probability measure on a Probability space . Then, the tuple is called a measurable query space over .

Definition 5 (-coreset for the Average Loss).

Let be a measurable query space over . Let be an error parameter, be a set, and be a weight function such that:

i.e., the expected loss of the original set over the randomness of sampling a query from the distribution is approximated by the expected loss on .

Then, the pair is called an -coreset for the measurable query space .

While, is also an -coreset of , coreset is efficient if the cardinality of is significantly smaller than , i.e., , hopefully by order of magnitude.

Remark: Throughout the literature, the term “coreset” usually refers to a small weighted subset of the input set (data). However, in other works (and in ours), this requirement is relaxed [7, 50]. In many applications this relaxation gives a significant benefit as it supports a much larger family of instances as coreset candidates.

3.3 Coreset Learning

Input: A finite input set , and its weight function , a finite set of queries , a loss function , and an integer .

1:   is an arbitrary set of vectors in .
2:   for every .
3:  for  do
4:      The average loss on .
5:      The average loss on .
6:      The approximation error that we wish to minimize, is a hyper-parameter to balance the two losses.
7:      Update , where is the learning rate.
8:      Update .
9:  end for
10:  return
Algorithm 1

We propose to learn a coreset (and its weights) as in Definition 5 using gradient-based methods. We assume that we are given a set , its weights such that ,111We use this assumption for simplicity of the writing. Practically, we can implement it by scaling the input weights to sum to , and formally, all is needed is scaling the sample size of the queries according to the sum of weights. and a set of queries sampled i.i.d. from (according to the measure ). First, we aim to compute an -coreset of with respect to the finite set of queries . Formally speaking, should satisfy:

(2)

To do so, we can treat as our training data and learn coreset of with respect to the objective by minimizing the following loss:

This will guarantee that is an -coreset for the measurable query space , where

is the uniform distribution over the finite set

, i.e., for every .

However, we wish that the constraint in Eq. (2) would hold for the whole set of queries in order to obtain an -coreset for our desired (original) measurable query space . To obtain a generalized solution (as we show in Section 3.4), we need to bound . To do so, we should guarantee that the sum of coreset weights approximates the original sum of weights, i.e:

(3)

The motivation behind bounding Eq (3) is as follows. Recall that is the ground set, i.e., . Let , so that enforcing Eq. (3), yields for every

Hence, we “force” our coreset to have a bounded loss over the whole query space , furthermore, this bound is proportional to the bound of the loss on the original input , i.e, it is proportional to

and the approximation error .

To summarize, we learn an -coreset of with respect to the objective given a training data (set of queries) . To enforce the conditions in Eqs. (2) and (3) to hold with small , we minimize the following loss:

(4)

Here, is a hyper-parameter to balance the two losses. The algorithm for coreset learning is summarised in Algorithm 1.

3.4 Generalization

We start by stating the sufficient guarantees for the -coreset (i.e., the sufficient guarantees to obtain a generalized solution):

  1. With high probability, the expected loss on the set over all queries in (i.e., ) is approximated by the average loss on the same set over the sampled set of queries, i.e., with high probability

  2. The same should hold for , i.e., with high probability

Then, by Eq. (2), we have that approximates , hence combining i and ii with Eq. (2), yields that approximates .

To show that i holds, we rely on Hoeffding’s inequality as follows.

Claim 1 (Mean of Losses).

Let be a measurable query space such that , and let . Let be an approximation error, and let be a probability of failure. Let be a sample of queries from , chosen i.i.d, where each is sampled with probability . Then, with probability at least ,

This claim states that, with high probability, the average loss on the set over the i.i.d sampled set of queries approximates the expected loss on the set over all queries in (i.e., ). However, the size of should be large enough and proportional to the approximation error , the probability of failure , and finally, the maximum loss over every , i.e., (see Claim 1). Now, recall that is the ground set, i.e., , and . As we formally show in Section B, since, and are fixed, and since , all is needed for Claim 1 to hold, is to sample enough queries (based on the Hoeffding’s inequality).

To show that ii holds, we can also use the Hoeffding’s inequality, but additionally we need to bound . This was the reason for adding the constraint on the sum of weights: to obtain . Formally,

Claim 2.

Let be a measurable query space over , where , and let . Let be an approximation error, be a probability of failure, and let be an integer. Let be a sample of queries from , chosen i.i.d, where each is sampled with probability . Let be the output of a call to ; see Algorithm 1. If

  1. and

Then, we obtain that, with probability at least ,

Proof.

See proof in Section C in the appendix. ∎

3.5 Bridging the Gap Between Theory and Practice

We take one more step towards deriving effective, practical coresets and replace the loss in Eq. 4 (and Line 6 in Algorithm 1) with a formulation that is more similar to the standard coreset definition, namely, and we minimize this loss on average over the training set of queries ; See Algorithm 2 in the appendix.

A solution obtained by Algorithm 2 aims to minimize the average approximation error over every query in the sampled set and thus is very similar to the Definition 3 with the modification of average instead of the worst case. This enables us to obtain a better coreset in practice that approximates the loss of every query (as the minimization is on the average approximation error over all queries and not only on the difference between the average losses of the coreset and the original data over all queries). Our empirical evaluation in Section 4 verifies that the coreset obtained by running Algorithm 2 generalizes to unseen queries, i.e., the average approximation error of the coreset over all queries is small compared to other coreset construction algorithms. Moreover, we show below that the solution obtained by Algorithm 2 satisfies Definition 5.

Let be a solution that minimizes the average loss in Algorithm 2. We can find a constant , such that

(5)

For a constant from Definition 5, let , and let . By simple derivations (see Section D.1 in the appendix) we can show that

(6)

Hence by Claim 2 the solution obtained by Algorithm 2 generalizes to the whole measurable query space and thus it satisfies the definition of -coreset, while simultaneously satisfying Eq. (5) which is closely related to the original definition of coresets as in Definition 3.

4 Experimental Results

We proposed a unified framework for coreset construction that allows us to use the same algorithm for different problems. We demonstrate this on the examples of training set reduction for linear and logistic regression in Section 4.1 and on the examples of model size reduction a.k.a. model compression of MLP and CNN in Section 4.2. We show that in both cases our unified framework yields comparable or even better results than previous coresets, which are specifically fitted to the problem at hand.

Fig. 1: Linear regression: a – Approximation error for the optimal solution as a function of the coreset’s size; b – average approximation error on the unseen test data as a function of the coreset’s size.
Fig. 2: Logistic regression: a – Approximation error for the optimal solution as a function of the coreset’s size; b – average approximation error on the unseen test data as a function of the coreset’s size.

4.1 Training Data Coresets

We demonstrate the practical strength of our coreset construction scheme in the context of data reduction for linear and logistic regression.

4.1.1 Setup

For linear regression, we ran our experiments on the 3D Road Networks dataset222https://archive.ics.uci.edu/ml/datasets/3D+Road+Network+(North+Jutland,+Denmark) (North Jutland, Denmark) [27] that contains 434,874 records. We used two attributes: “Longitude” [Double] and “Latitude” [Double] to predict the third attribute “Height in meters” [Double]. We created a set of queries by sampling models from training trajectories of linear regression computed using the full data set from 20 random starting points. We split the sampled models into training, validation and tests sets of sizes 20,000 (=20,000), 2,000, 2,000 correspondingly. We computed weighted coresets of different sizes, from 50 to 140. For each coreset size, we invoked Algorithm 2 with Adam optimizer [28]

for 10 epochs with a batch size of 25 and learning rate of 0.01. The results were averaged across

trials. In this experiments we used .

For the logistic regression we performed the experiments on HTRU 333https://archive.ics.uci.edu/ml/datasets/HTRU2 dataset, comprising 17,898 radio emissions of the pulsar star represented by 8 features and a binary label [41]. We created a set of queries similarly to linear regression and we sampled from this set training, validation and test sets of sizes 8,000, 1,600, 800 correspondingly. The results were averaged across trials.

To make the optimization simpler, we removed the weight fitting term from the loss in Algorithm 2 and assumed that all members of the coreset have the same weight . We ran the optimization for epochs with the batch size of using Adam optimizer and learning rate of . Using this modification, we computed coresets of different sizes randing from 100 to 500.

The differences in hyper-parameters and the coreset sizes between the logistic and linear regression experiments are due to the higher complexity of the problem for logistic regression. First, computing a coreset for logistic regression is known to be a complex problem where (high) lower bounds on the coreset size exists [47]. The second (and probably less significant) reason is the dimension of the input data, where we used a higher dimensional input in logistic regression.

4.1.2 Results

We refer to a weighted labeled input dataset by , where is the dataset, and are the labeling function and weight function respectively, i.e., each point in is a sample in the dataset, is its corresponding label and, is its weight. Similarly, we refer to the compressed labeled data set (coreset) by . We report the results using two measures as explained below.

  1. Approximation error for the optimal solution. Let be the query that minimizes the corresponding objective loss function, e.g., in linear regression: , where . For each coreset , we compute , then we calculate the approximation error for the optimal solution as

  2. Average approximation error. For every coreset , we report the average case approximation error over every query in the test set , i.e.,

We compare our coresets for linear regression with uniform sampling and with the coreset from [45]; of the three methods is shown in Figure 1(a) and in Figure 1(b). We compare our coreset for logistic regression with uniform sampling and with the coreset from [60]; of the compared methods is shown in Figure 2(a) and in Figure 2(b).

In both experiments we observe that our learned coresets outperform the uniform sampling, and the theoretical counterparts. Our method yields very low average approximation error, because it was explicitly trained to derive a coreset that minimizes the average approximation error on the training set of queries, and the learned coreset succeeded to generalize to unseen queries.

4.2 Model Coreset for Structured Pruning

The goal of model compression is reducing the run time and the memory requirements during inference with no or little accuracy loss compared to the original model. Structured pruning reduces the size of a large trained deep network by reducing the width of the layers (pruning neurons in fully-connected layers and filters in convolutional layers). An alternative approach is sparsification, which zeros out unimportant parameters in a deep network. The main drawback of sparsification is that it leads to an irregular network structure, which needs a special treatment to deal with sparse representations, making it hard to achieve actual computational savings. Structured pruning simply reduces the size of the tensors, which allows running the resulting network without any amendment. Due to the advantage of structured pruning over sparsification, we perform structured pruning of a deep networks in our experiments.

We assume that the target small architecture is given, and our task is to compute the training parameters of the small architecture that best approximate the original large network. We view filters in CNN or neurons in a fully connected network as items in the full set , and the training data as the query set . We use the small architecture to define the coreset size in each layer and we learn an equally weighted coreset (the small network) using Algorithm 2 and setting . We report the experiments for structured pruning of a fully connected network in Section 4.2.1 and of channel pruning in Section 4.2.2.

4.2.1 Neuron Pruning

Setup. We used LeNet--

model with 266,610 parameters trained on MNIST 

[30] as our baseline fully-connected model. It comprises two fully connected hidden layers with and

neurons correspondingly, each followed with a ReLu activation. After training the baseline model with Adam optimizer for

epochs and batch size of , it achieved test accuracy of and loss = . The target small architecture included neurons in the first layer and in the second, resulting in compression ratio. We applied the training procedure in Algorithm 2 to learn the weights of this network using Adam optimizer with regularization for epochs with the batch size of .

Results. The coreset (compressed) model achieved accuracy and loss on the test data, i.e., improvement in both terms. Next, we compare our results to a pair of other coreset-based compression methods in Table II, and to non-coreset methods: Filter Thresholding (FT) [31], SoftNet [19], and ThiNet [40] implemented in [35]. We observe that the learned coreset performs better than most compared methods and comparably to the algorithm derived from the theoretical coreset framework. Note that previous coreset methods [49, 35] are designed for a single layer, while our algorithm does not have this limitation and can be applied to compress all layers of the network in a single run. Moreover applied to DNN compression, our framework can work on individual weights (sparcification), neurons (as shown above) and channels (as we show next).

4.2.2 Channel Pruning

Setup.

We used Pytorch implementation of VGGNet-19 network

444VGG-code-link for CIFAR10 from [37] with about 20M parameters as our baseline CNN model (see Table I for more details). The baseline accuracy and loss in our experiments was and correspondingly. The target architecture555https://github.com/foolwood/pytorch-slimming of the small network (see Table I) corresponds to 70% compression ratio and to the reduction of the parameters by roughly 88%. We ran Algrothm 2 using the small architecture to define the size of each layer for epochs with batch size of using Adam optimizer and regularization.

Results. Our compressed model improved the baseline network and achieved accuracy and loss. Table III compares the small network accuracy of the learned coreset with the channel pruning coreset from [48] and several non-coreset methods. While the results are comparable, our algorithm is much simpler and is not tailored to the problem at hand. The coreset reported in [48] was constructed by applying a channel pruning coreset in a layer by layer fashion, while our learned coreset is computed in one-shot for the entire network. Finally, we remind the reader that our framework is generic and could be applied to many other applications in addition to compressing DNNs.

Layer Width (original) Width (compressed)
1 64 49
2 64 64
3 128 128
4 128 128
5 256 256
6 256 254
7 256 234
8 256 198
9 512 114
10 512 41
11 512 24
12 512 11
13 512 14
14 512 13
15 512 19
16 512 104
TABLE I: VGG-19 original and compressed architectures.

width=1 Pruning Method Baseline Small Model Compression Error(%) Error(%) Ratio FT[31] 1.59 +0.35 81.68% SoftNet [19] 1.59 +0.41 81.69% ThiNet [40] 1.59 +10.58 75.01% Sample-based Coreset [35] 1.59 +0.41 84.32% Pruning via Coresets [48] 2.16 -0.13 % Learned Coreset (ours) 2.07 -0.04 89.63%

TABLE II: Neural Pruning of LeNet-300-100 for MNIST. The results of FT, SoftNet, ThiNet and Sample-Based Coreset are reported in [35]. ‘+’ and ‘-’ correspond to increase and decrease in error, respectively.

width=1 Pruning Method Baseline Small Model Compression Error(%) Error(%) Ratio Unstructured Pruning [17] 6.5 -0.02 80% Structured Pruning [37] 6.33 -0.13 70% Pruning via Coresets [48] 6.33 -0.29 70% Learned Coreset (ours) 6.75 -0.26 70%

TABLE III:

Channel Pruning of VGG-19 for CIFAR-10

5 Conclusions

We proposed a novel unified framework for coreset learning that is theoretically motivated and can address problems for which obtaining theoretical worst-case guarantees is impossible. Following this framework, we suggested a relaxation of the coreset definition from the worst case to the average loss approximation. We proposed a learning algorithm that inputs a sample set of queries and a loss function associated with the problem at hand and outputs an average-loss coreset that holds for the training set of queries and generalizes to unseen queries. We showed that if the sample set of queries is sufficiently large, then the average loss over the coreset closely approximates the average loss over the full set for the entire query space. We then showed empirically, that our learned coresets are capable to generalize to unseen queries even for arbitrary sampling sizes. Our experiments demonstrated that coresets learned by our new approach yielded comparable and even better approximation of the optimal solution loss and average loss over the unseen queries than coresets that have worst-case guarantees. Moreover, our method applied to the problem of deep networks pruning provides the first full-network coreset with excellent performance. In future work we will try reducing the sampling bound and will apply the proposed framework to derive new coresets.

References

  • [1] T. Asami, R. Masumura, Y. Yamaguchi, H. Masataki, and Y. Aono (2017) Domain adaptation of dnn acoustic models using knowledge distillation. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 5185–5189. Cited by: §1.
  • [2] C. Baykal, L. Liebenwein, I. Gilitschenski, D. Feldman, and D. Rus (2019) Data-dependent coresets for compressing neural networks with applications to generalization bounds. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • [3] Z. Borsos, M. Mutny, and A. Krause (2020) Coresets via bilevel optimization for continual learning and streaming. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Cited by: §1.
  • [4] V. Braverman, D. Feldman, and H. Lang (2016) New frameworks for offline and streaming coreset constructions. arXiv preprint arXiv:1612.00889. Cited by: Definition 1.
  • [5] J. Chen, S. Chen, and S. J. Pan (2020)

    Storage efficient and dynamic flexible runtime channel pruning via deep reinforcement learning

    .
    Advances in Neural Information Processing Systems 33. Cited by: §1.
  • [6] Code (2021) Open source code for all the algorithms presented in this paper. Note: the authors commit to publish upon acceptance of this paper or reviewer request. Cited by: item iv.
  • [7] M. B. Cohen, S. Elder, C. Musco, C. Musco, and M. Persu (2015)

    Dimensionality reduction for k-means clustering and low rank approximation

    .
    In

    Proceedings of the forty-seventh annual ACM symposium on Theory of computing

    ,
    pp. 163–172. Cited by: §3.2.
  • [8] M. B. Cohen and R. Peng (2015) Lp row sampling by lewis weights. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pp. 183–192. Cited by: §1.
  • [9] A. Dasgupta, P. Drineas, B. Harb, R. Kumar, and M. W. Mahoney (2009) Sampling algorithms and coresets for ell_p regression. SIAM Journal on Computing 38 (5), pp. 2060–2078. Cited by: §1.
  • [10] X. Dong, J. Huang, Y. Yang, and S. Yan (2017) More is less: a more complicated network with less inference complexity. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 5840–5848. Cited by: §1.
  • [11] D. Feldman, M. Faulkner, and A. Krause (2011) Scalable training of mixture models via coresets. In Advances in neural information processing systems, pp. 2142–2150. Cited by: §1.
  • [12] D. Feldman and M. Langberg (2011) A unified framework for approximating and clustering data. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pp. 569–578. Cited by: §1.1.
  • [13] D. Feldman, M. Schmidt, and C. Sohler (2013) Turning big data into tiny data: constant-size coresets for k-means, pca and projective clustering. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pp. 1434–1453. Cited by: §1.
  • [14] D. Feldman (2020) Core-sets: an updated survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, https://arxiv.org/abs/2011.09384 10 (1), pp. e1335. Cited by: §1.
  • [15] J. Goetz and A. Tewari (2020) Federated learning via synthetic data. In CoRR, Vol. abs/2008.04489. External Links: 2008.04489 Cited by: §1.
  • [16] L. Gu (2012)

    A coreset-based semi-supverised clustering using one-class support vector machines

    .
    In Control Engineering and Communication Technology (ICCECT), 2012 International Conference on, pp. 52–55. Cited by: §1.
  • [17] S. Han, J. Pool, J. Tran, and W. Dally (2015) Learning both weights and connections for efficient neural network. Advances in neural information processing systems 28, pp. 1135–1143. Cited by: TABLE III.
  • [18] S. Har-Peled, D. Roth, and D. Zimak (2007) Maximum margin coresets for active and noise tolerant learning.. In IJCAI, pp. 836–841. Cited by: §1.
  • [19] Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang (2018)

    Soft filter pruning for accelerating deep convolutional neural networks

    .
    In

    Proceedings of the 27th International Joint Conference on Artificial Intelligence

    ,
    pp. 2234–2240. Cited by: §4.2.1, TABLE II.
  • [20] Y. He, P. Liu, Z. Wang, Z. Hu, and Y. Yang (2019) Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4340–4349. Cited by: §1.
  • [21] S. Hooker, A. Courville, Y. Dauphin, and A. Frome (2020) What does a pruned deep neural network forgets?. Bridging AI and Cognitive Science ICLR Workshop. Cited by: §1.
  • [22] J. Huggins, T. Campbell, and T. Broderick (2016) Coresets for scalable bayesian logistic regression. In Advances In Neural Information Processing Systems, pp. 4080–4088. Cited by: §1.
  • [23] I. Jubran, A. Maalouf, and D. Feldman (2019) Introduction to coresets: accurate coresets. arXiv preprint arXiv:1910.08707. Cited by: §1.
  • [24] I. Jubran, E. E. S. Shayda, I. Newman, and D. Feldman (2021) Coresets for decision trees of signals. arXiv preprint arXiv:2110.03195. Cited by: §1.
  • [25] I. Jubran, M. Tukan, A. Maalouf, and D. Feldman (2020) Sets clustering. In International Conference on Machine Learning, pp. 4994–5005. Cited by: §1.
  • [26] M. Kang and B. Han (2020) Operation-aware soft channel pruning using differentiable masks. In International Conference on Machine Learning, pp. 5122–5131. Cited by: §1.
  • [27] M. Kaul, B. Yang, and C. S. Jensen (2013) Building accurate 3d spatial networks to enable next generation intelligent transportation systems. In 2013 IEEE 14th International Conference on Mobile Data Management, Vol. 1, pp. 137–146. Cited by: §4.1.1.
  • [28] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.1.
  • [29] M. Langberg and L. J. Schulman (2010) Universal -approximators for integrals. In Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms, pp. 598–607. Cited by: §1.1.
  • [30] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.2.1.
  • [31] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf (2016) Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: §4.2.1, TABLE II.
  • [32] J. Li, M. L. Seltzer, X. Wang, R. Zhao, and Y. Gong (2017) Large-scale domain adaptation via teacher-student learning. In Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 2017, F. Lacerda (Ed.), pp. 2386–2390. Cited by: §1.
  • [33] Y. Li, S. Gu, L. V. Gool, and R. Timofte (2019) Learning filter basis for convolutional neural network compression. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5623–5632. Cited by: §1.
  • [34] Z. Li and D. Hoiem (2018) Learning without forgetting. IEEE Trans. Pattern Anal. Mach. Intell. 40 (12), pp. 2935–2947. Cited by: §1.
  • [35] L. Liebenwein, C. Baykal, H. Lang, D. Feldman, and D. Rus (2020) Provable filter pruning for efficient neural networks. In ICLR, Cited by: §1, §4.2.1, TABLE II.
  • [36] Z. Liu, H. Mu, X. Zhang, Z. Guo, X. Yang, K. Cheng, and J. Sun (2019) Metapruning: meta learning for automatic neural network channel pruning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3296–3305. Cited by: §1.
  • [37] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang (2017) Learning efficient convolutional networks through network slimming. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2755–2763. Cited by: §4.2.2, TABLE III.
  • [38] D. Lopez-Paz and M. Ranzato (2017) Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pp. 6467–6476. Cited by: §1.
  • [39] M. Lucic, O. Bachem, and A. Krause (2016-09–11 May) Strong coresets for hard and soft bregman clustering with applications to exponential family mixtures. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, A. Gretton and C. C. Robert (Eds.), Proceedings of Machine Learning Research, Vol. 51, Cadiz, Spain, pp. 1–9. External Links: Link Cited by: §1.
  • [40] J. Luo, J. Wu, and W. Lin (2017) Thinet: a filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pp. 5058–5066. Cited by: §4.2.1, TABLE II.
  • [41] R. J. Lyon, B. Stappers, S. Cooper, J. Brooke, and J. Knowles (2016) Fifty years of pulsar candidate selection: from simple filters to a new principled real-time classification approach. Monthly Notices of the Royal Astronomical Society 459 (1), pp. 1104–1123. Cited by: §4.1.1.
  • [42] A. Maalouf, I. Jubran, and D. Feldman (2019) Fast and accurate least-mean-squares solvers. In Advances in Neural Information Processing Systems, pp. 8305–8316. Cited by: §1.
  • [43] A. Maalouf, I. Jubran, M. Tukan, and D. Feldman (2021) Coresets for the average case error for finite query sets. Sensors 21 (19), pp. 6689. Cited by: §1.
  • [44] A. Maalouf, H. Lang, D. Rus, and D. Feldman (2021) Deep learning meets projective clustering. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • [45] A. Maalouf, A. Statman, and D. Feldman (2020) Tight sensitivity bounds for smaller coresets. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2051–2061. Cited by: §1, §4.1.2.
  • [46] B. Mirzasoleiman, J. Bilmes, and J. Leskovec (2020) Coresets for data-efficient training of machine learning models. In International Conference on Machine Learning (ICML), Cited by: §1.
  • [47] A. Munteanu, C. Schwiegelshohn, C. Sohler, and D. Woodruff (2018) On coresets for logistic regression. In Advances in Neural Information Processing Systems, pp. 6561–6570. Cited by: §1, §4.1.1.
  • [48] B. Mussay, D. Feldman, S. Zhou, V. Braverman, and M. Osadchy (2020) Data-independent structured pruning of neural networks via coresets. In CoRR, Vol. abs/2008.08316. External Links: 2008.08316 Cited by: §1, §4.2.2, TABLE II, TABLE III.
  • [49] B. Mussay, M. Osadchy, V. Braverman, S. Zhou, and D. Feldman (2020) Data-independent neural pruning via coresets. In 8th International Conference on Learning Representations, ICLR 2020, Cited by: §1, §1, §4.2.1.
  • [50] J. M. Phillips (2016) Coresets and sketches. arXiv preprint arXiv:1601.00617. Cited by: §1, §3.2.
  • [51] S. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert (2017)

    ICaRL: incremental classifier and representation learning

    .
    In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 5533–5542. Cited by: §1.
  • [52] T. Sarlos (2006) Improved approximation algorithms for large matrices via random projections. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), pp. 143–152. Cited by: §1.
  • [53] M. Schmidt, C. Schwiegelshohn, and C. Sohler (2019) Fair coresets and streaming algorithms for fair k-means. In International Workshop on Approximation and Online Algorithms, pp. 232–251. Cited by: §1.
  • [54] S. Sinha, H. Zhang, A. Goyal, Y. Bengio, H. Larochelle, and A. Odena (2019) Small-gan: speeding up GAN training using core-sets. CoRR abs/1910.13540. External Links: Link, 1910.13540 Cited by: §1.
  • [55] C. Sohler and D. P. Woodruff (2011) Subspace embeddings for the l1-norm with applications. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pp. 755–764. Cited by: §1.
  • [56] F. P. Such, A. Rawal, J. Lehman, K. O. Stanley, and J. Clune (2019) Generative teaching networks: accelerating neural architecture search by learning to generate synthetic training data. CoRR abs/1912.07768. External Links: 1912.07768 Cited by: §1.
  • [57] I. W. Tsang, J. T. Kwok, and P. Cheung (2005) Core vector machines: fast svm training on very large data sets. Journal of Machine Learning Research 6 (Apr), pp. 363–392. Cited by: §1.
  • [58] I. W. Tsang, J. T. Kwok, and P. Cheung (2005) Very large svm training using core vector machines.. In AISTATS, Cited by: §1.
  • [59] I. Tsang, J. Kwok, and J. M. Zurada (2006) Generalized core vector machines. IEEE Transactions on Neural Networks 17 (5), pp. 1126–1140. Cited by: §1.
  • [60] M. Tukan, A. Maalouf, and D. Feldman (2020) Coresets for near-convex functions. In Advances in Neural Information Processing Systems, Cited by: §1, §4.1.2.
  • [61] M. Tukan, C. Baykal, D. Feldman, and D. Rus (2020) On coresets for support vector machines. In International Conference on Theory and Applications of Models of Computation, pp. 287–299. Cited by: §1.1.
  • [62] T. Wang, J. Zhu, A. Torralba, and A. A. Efros (2018) Dataset distillation. External Links: 1811.10959 Cited by: §1.
  • [63] J. Wen, Y. Cao, and R. Huang (2018) Few-shot self reminder to overcome catastrophic forgetting. CoRR abs/1812.00543. Cited by: §1.
  • [64] J. Ye, X. Lu, Z. Lin, and J. Z. Wang (2018) Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • [65] M. Ye, C. Gong, L. Nie, D. Zhou, A. Klivans, and Q. Liu (2020) Good subnetworks provably exist: pruning via greedy forward selection. In International Conference on Machine Learning, pp. 10820–10830. Cited by: §1.
  • [66] M. Zhang, T. Wang, J. H. Lim, and J. Feng (2019) Prototype reminding for continual learning. CoRR abs/1905.09447. Cited by: §1.
  • [67] B. Zhao, K. R. Mopuri, and H. Bilen (2020) Dataset condensation with gradient matching. CoRR abs/2006.05929. Cited by: §1.

Appendix A Hoeffding Theorem

Theorem 3 (Hoeffding).

Let be independent random variables, when it is known that for every , is strictly bounded by the intervals . Define the empirical mean of these variables by , then

Appendix B Proof of Claim 1

Proof.

First, observe that: (i) the probability distribution

is defined over the set , and (ii) the function in our case is a function of , since and are fixed (given). Thus, we can define the corresponding probability distribution for the (multi)-set as follows: For every (where ) we have that .

Moreover, the sampled set has its corresponding sampled losses set . Hence, we have that

(7)
(8)

By applying Hoeffding’s inequality (see Theorem 3 in the appendix) we have:

(9)

where and are the lower and upper bounds on the loss of the th sampled query respectively.

Since, by the definition of we have that for every , , we obtain that,

(10)

Pluging in (10) yields

(11)

Finally, combining (8), (9), and (11) proves the claim as

Appendix C Proof of Claim 2

Proof.

Let , and let . First we observe that,

(12)

where the third derivation holds by the definition of , and the fourth holds since . We also have,

(13)

where the third inequality holds by the definition of , the fourth by Assumption 1, and the last holds since . By combining  (12) and (13) we get that .

Now, we note that Claim 1 holds for any measurable query space. Hence, for the pair of measurable query spaces and , if the sampled set satisfies that , then by Claim 1 we get that:

(14)

and

(15)

By the triangle inequality we have that

(16)
(17)
(18)

By (14) and (15), and by the assumption (2) on the output , we have that (16), (17), and (18) are bounded by . Hence,

Appendix D Practical implementation

Input: A finite input set and its weight function , a finite set of queries , a loss function , and an integer .

1:   is an arbitrary set of vectors in .
2:   for every .
3:  for  do
4:     for every  do
5:          The cost of the query on .
6:          The cost of the query on .
7:          The approximation error that we wish to minimize is the learning rate.
8:          Update
9:          Update .
10:     end for
11:  end for
12:  return
Algorithm 2

While the training of Algorithm 2 is formalized as a stochastic process, i.e., sequentially, for every , we compute the approximation error for this one query , we then update the learned variables based on this error. However, it can be implemented using a minibatch of several queries . Here, the approximation error with respect to the current batch is and the learned variables are updated based on this error.

d.1 Proof of Equation 6

For a constant from Definition 5, let , and let . We show that

(19)
Proof.
(20)

where the first derivation holds since for any set of number , the derivation in (6) follows from the definition of , i.e., since for every , the one after holds since for any pair , and the last derivation holds by (5). ∎

Alaa Maalouf received his B.Sc. and M.Sc. in Computer Science at the University of Haifa, Israel, in 2016 and 2019 respectively, and is now a Ph.D. student under the supervision of Prof. Dan Feldman. His main research interests focus on Machine/Deep Learning, Robotics, Computational Geometry and Coresets (data summarization) for Big Data.

Gilad Eini received his B.Sc. in Computer Science at the University of Haifa, Israel, in 2017, and is on the verge of finishing his M.Sc. under the supervision of Prof. Dan Feldman. His main research interests focus on Machine/Deep Learning, Computer vision and Coresets (data summarization) for Big Data.

Ben Mussay Ben Mussay received the BSc degree and the MSc degree in computer science from the Univeristy of Haifa, Israel, in 2019 and 2020, respectively. His reseach interests are sublinear algorithms and deep learning.

Dan Feldman is an associate professor and the head of the Robotics and Big Data Lab at the University of Haifa, after returning from a 3 years post-doc at at Caltech and MIT. During his PhD at the University of Tel-Aviv he developed data reduction techniques known as core-sets, based on computational geometry. Since his post-docs, Dan’s coresets are applied for main problems in Machine Learning, Big Data, computer vision, EEG and robotics. His group in Haifa continues to design and implement core-sets with provable guarantees for such real-time systems.

Margarita Osadchy Margarita Osadchy is an Associate Professor in the Department of Computer Science at the University of Haifa. She is a member of the Data Science Research Center and the member of the scientific committee of the Center for Cyber Law and Policy at the University of Haifa. She received the PhD degree with honors in computer science from the University of Haifa, Israel. She was a visiting research scientist at the NEC Research Institute and then a postdoctoral fellow in the Department of Computer Science at the Technion. Her main research interests are deep learning, machine learning, computer vision, and computer security and privacy.