Wasserstein Coresets for Lipschitz Costs

05/18/2018
by   Sebastian Claici, et al.
MIT
0

Sparsification is becoming more and more relevant with the proliferation of huge data sets. Coresets are a principled way to construct representative weighted subsets of a data set that have matching performance with the full data set for specific problems. However, coreset language neglects the nature of the underlying data distribution, which is often continuous. In this paper, we address this oversight by introducing a notion of measure coresets that generalizes coreset language to arbitrary probability measures. Our definition reveals a surprising connection to optimal transport theory which we leverage to design a coreset for problems with Lipschitz costs. We validate our construction on support vector machine (SVM) training, k-means clustering, k-median clustering, and linear regression and show that we are competitive with previous coreset constructions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/27/2019

Clustering through the optimal transport barycenter problem

The problem of clustering a data set is formulated in terms of the Wasse...
02/15/2020

On Coresets for Support Vector Machines

We present an efficient coreset construction algorithm for large-scale S...
06/24/2021

Sharp Convergence Rates for Empirical Optimal Transport with Smooth Costs

We revisit the question of characterizing the convergence rate of plug-i...
12/24/2020

Unsupervised neural adaptation model based on optimal transport for spoken language identification

Due to the mismatch of statistical distributions of acoustic speech betw...
05/28/2019

Evaluation of Machine Learning-based Anomaly Detection Algorithms on an Industrial Modbus/TCP Data Set

In the context of the Industrial Internet of Things, communication techn...
03/09/2022

An entropic generalization of Caffarelli's contraction theorem via covariance inequalities

The optimal transport map between the standard Gaussian measure and an α...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Data sets with hundreds of millions of examples are becoming the norm in machine learning, whether for Bayesian inference, clustering, or regression. The complexity of algorithms for these tasks typically scales in the size of the data set, making it difficult to employ the entire input effectively. Several techniques attempt to overcome this challenge for scalable machine learning, from streaming optimization to subsampling; these often reduce computational complexity on large data sets but are accompanied with weak theoretical guarantees.

Originally proposed for computational geometry (Agarwal et al., 2005), coresets recently have been introduced as a principled means of reducing input sizes for machine learning algorithms. Intuitively, a coreset of a data set is a “representative” subsample on which a given machine learning algorithm is guaranteed to produce similar output. Coresets have been applied successfully to learning tasks including clustering (Bachem et al., 2018), classification via SVMs (Tsang et al., 2005)

, neural network compression 

(Baykal et al., 2018), and Bayesian inference (Huggins et al., 2016).

Coreset computation is typically posed as a discrete problem: Given a fixed data set and learning algorithm, how can we construct a smaller data set from which the learning algorithm will have similar performance? This posing of the problem is compatible with descriptions of coresets in computational geometry but neglects a key theme in machine learning: a data set is nothing more than an empirical sample from an underlying data distribution—the latter being the key to describing a learning task. That is, we typically do not need a coreset for a specific data set but rather for the distribution from which it was drawn.

To address this gap, in this paper we extend the definition of a coreset to (possibly smooth) data distributions, where the classical notion of a coreset is recovered when the distribution is composed of a finite set of deltas (an empirical distribution). Beyond broadening the definition, we show that a sufficient condition for extracting a coreset can be written in the language of optimal transport. For Lipschitz cost functions, our approach defines a new framework for coreset analysis and construction.

We apply our constructions to extract coresets for classification and clustering, with performance competitive with that of previous techniques and broader generality. We rely on a recently proposed stochastic algorithm for approximating distributions in the Wasserstein metric (Claici et al., 2018). In brief, we construct a discrete distribution supported on points that minimizes distance to the original distribution in the Wasserstein metric. We can show that several methods for classification and clustering have costs that are upper-bounded by the Wasserstein distance between the learned and true data distributions. This leads to fairly compact coresets that perform well in practice.

Contributions.

We give a practical, parallelizable algorithm for coreset construction for a variety of problems, backed by an interpretation of coresets in continuous probabilistic language. Our construction reveals a surprising tie-in with optimal transport theory. Our coresets are among few in machine learning that do not rely on importance sampling. We prove bounds on the size of the coreset that are independent of data set size and generalize to any machine learning problem whose cost is separable and Lipschitz. Finally, we compare with state-of-the-art on SVM binary classification, linear regression, -means clustering, and -median clustering, and show competitive performance.

1.1 Related work

Our work lies at the intersection of two seemingly unrelated lines of research, joining the probabilistic language of optimal transport research with the discrete setting of data compression via coresets.

Coresets.

Initially introduced for problems in computational geometry (Agarwal et al., 2005), coresets have found their way to machine learning research via importance sampling (Langberg & Schulman, 2010). Coreset applications are varied, and generic frameworks exist for constructing them for almost any problem (Feldman & Langberg, 2011). Among the most relevant and recent applications are -means and -median clustering (Har-Peled & Mazumdar, 2004; Arthur & Vassilvitskii, 2007; Feldman et al., 2013; Bachem et al., 2018), Bayesian inference (Campbell & Broderick, 2018; Huggins et al., 2016), support vector machine training (Tsang et al., 2005), and neural network compression (Baykal et al., 2018). While coreset language is discrete, the sensitivity-based approach that importance sampling coresets use currently was introduced in a continuous setting for approximating expectations of functions under measures that are absolutely continuous with respect to the Lebesgue measure (Langberg & Schulman, 2010). For more information, see the introduction by Bachem et al. (2018) and the survey paper by Munteanu & Schwiegelshohn (2018).

Optimal Transport

is a relative newcomer to machine learning. Sparked by advances in entropically-regularized transport (Cuturi, 2013)

, an influx of applications of optimal transport to machine learning have appeared, from supervised learning 

(Schmitz et al., 2017; Carrière et al., 2017), to Bayesian inference (Staib et al., 2017; Srivastava et al., 2015) and neural network training (Arjovsky et al., 2017; Montavon et al., 2016; Genevay et al., 2018). Details can be found in several recent surveys (Solomon, 2018; Peyré & Cuturi, 2018; Lévy & Schwindt, 2017).

Our approach is inspired by semi-discrete methods that compute transport from a continuous measure to a discrete one by leveraging power diagrams (Aurenhammer, 1987). Efficient algorithms that use computational geometry tools to perform gradient iterations to solve the Kantorovich dual problem have been introduced for 2D (Mérigot, 2011) and 3D (Lévy, 2015). Closer to our method are the algorithms by De Goes et al. (2012) and Claici et al. (2018) that solve a non-convex problem for the support of a discrete uniform measure that minimizes transport cost to an input image (De Goes et al., 2012) or the barycenter of the input distributions (Claici et al., 2018).

2 Optimal Transport and Coresets

2.1 Optimal Transport

Optimal transport measures distances between probability distributions in a geometric fashion. Given a metric space

and measures , we define the -Wasserstein (transport) cost:

where is the set of measure couplings between and :

2.2 Coresets

Let be a family of functions from to , and a subset of with weighting function . We define the cost of for a given as

(1)

A coreset is a weighted subset of the initial data that yields similar costs to the full data set at a fraction of the size. Formally, The pair is an -coreset for the cost function and the function family if and for all . A coreset always exists as we can take and to satisfy the inequality.

Typically, coresets are problem-specific, for example applied to -means clustering or SVM classification. In such cases, the function family (or query set) reflects the cost of the original problem. For example, for -means clustering the family is the set of functions of the form parameterized by a set of points where each .

3 Wasserstein Coresets

3.1 Measure Coresets

Instead of using discrete language, we define a measure coreset as a (not necessarily discrete) measure that yields a good approximation to the data distribution under a given family of functions. Given an input measure and a family of functions , we define a cost in analogy to the discrete case (1) as

(2)

where here is no longer a discrete set. With this in mind, we define measure coresets: [Measure Coreset] We call a measure coreset for if is absolutely continuous with respect to and for all

(3)

Note that such a always exists, since satisfies the inequality. Definition 2 generalizes Definition 1. The latter can be recovered from the former by setting .

Proof.

We verify

The constraint restricts the support of to a subset . Taking completes the proof. ∎

For measure coresets to be practical, we restrict to of the form . We will henceforth use the absolute error form of (3):

Coreset constructions typically use the relative error. However, our construction below yields an absolute error -coreset. While not strictly equivalent, absolute error constructions provide stronger guarantees when the cost of the full measure is at least , and a weaker guarantee otherwise. To be consistent with prior work, we report relative error in our empirical results. From here on we will use the term -coreset to mean absolute error -measure coreset.

3.2 Connection to Optimal Transport

To unravel the connection to optimal transport, we start from the dual problem for the cost:

(4)

where is the space of all bounded continuous functions on , and is the -transform of .

It is not hard to show in the case that the constraint restricts the set of admissible functions to the set of -Lipschitz functions in and the -transform of is  (Santambrogio, 2015; Villani, 2009). We can rewrite (4) as

(5)

From definition (2), we can rewrite the coreset condition as

for all functions . If , by (5) we obtain a strict upper bound on the coreset error via the Wasserstein cost:

Hence, we have [Wasserstein Coresets] When , a sufficient condition for to be an -coreset for and is

Thus, a strategy for constructing an -point coreset for a measure and a given family of functions is to solve for in

(6)

Relatively few algorithms have been proposed to solve (6). Instead, we replace in (6) with the 2-Wasserstein cost , justified by the inequality .

3.3 Coreset Construction

We use a recent algorithm for constructing the Wasserstein barycenter of a set of input measures proposed by Claici et al. (2018). We start from the dual problem of . The objective is a function of the points and the dual potentials :

(7)

where is the power cell of point . The minimizer of (7) with respect to the weight vector yields the transport cost for fixed point positions  (Mérigot, 2011).

Problem (7) is concave in the weights and strictly concave modulo constant shifts. However, it is highly non-convex with respect to the point positions. Nonetheless, taking derivatives of with respect to the weights and points reveals a simple optimization strategy based on alternating a gradient ascent of a concave problem with a fixed point iteration for the point positions.

Explicitly, we compute

(8)

The complete algorithm is given as Algorithm 1. We use a backtracking search for the step size selection in line 4. As opposed to Claici et al. (2018), we can compute gradients exactly for empirical measures, and thus we do not typically have to rely on an accelerated gradient method.

1: random draw of samples from
2:for  do
3:     while  do
4:          StepSize()
5:         
6:     end while
7:     for  do
8:         
9:     end for
10:end for
Algorithm 1 Construct a coreset for with points.

The crux of the algorithm is in computing the derivatives in (8). We detail how to compute the derivatives for specific cases in section 4.2.

4 Analysis

In what follows, we give bounds on the size and construction time of our coreset. These bounds incorporate recent results in approximations of measures by discrete distributions under the Wasserstein metric. In several cases, we match the best known results for deterministic discrete coresets, and empirical results show that our coreset frequently outperforms even randomized constructions.

4.1 Coreset Size

We use the following theorem: [Metric convergence, (Kloeckner, 2012; Brancolini et al., 2009; Weed & Bach, 2017)] Suppose is a compactly supported measure in and is a uniform measure supported on points that minimizes . Then . Moreover, if is supported on a lower dimensional subspace of (that is, the support of has Hausdorff dimension ), then . Our coresets are based on a approximation. Let minimize . By Theorem 4.1, and the inequality between metrics, we have the following:

If we choose large enough such that , then it also holds that

and thus is an -coreset for .

Since , if we have a globally optimal solution for (6), then the resulting coreset has size . While we cannot guarantee this bound in practice since the approach of Claici et al. (2018) does not guarantee global optimality, empirically we observe that this bound holds and in fact is an overestimate on the size of the coreset.

The size of the coreset is predictably independent of the size of the initial data set as we can generalize to arbitrary measures. Furthermore, the size is independent of additional variables in the underlying problem, e.g. the number of medians in -median clustering.

This improves over the best known deterministic coreset size for -means and -median of (Har-Peled & Mazumdar, 2004), however we must be careful as our coreset bounds are given in absolute error. For -means and -median we are typically in the regime where the full data set has large cost, but if that does not hold, the coresets are no longer comparable. Better randomized construction algorithms exist for both -means/-median and SVM with sizes that do not have such a strong dependence on dimension. Empirically, our coresets are competitive even with randomized construction algorithms.

The previous bounds hold if we can solve (6) to optimality. is non-convex with respect to the , and problem (7) generalizes -means when is an empirical measure. Hence, solving to optimality is NP-hard. The following weak guarantee, however, follows by definition of :

[Weak Guarantee on Coreset Size] If for fixed and optimal it holds that , then is an absolute error -coreset for .

4.2 Construction Time

Construction time depends strongly on the characteristics of the measure we are approximating. The large majority of the time is spent evaluating the expectations in (8). Since we run the gradient ascent until , and perform fixed point iterations, the construction requires calls to an oracle that computes the integrals in (8) for each power cell region . Since line 8 implements a medoid update step similar to the -means algorithm, we cannot rely on convergence criteria for the outer loop of Algorithm 1. Instead we set a number of maximum iterations . In practice, we observe convergence to a fixed point typically within iterations.

If is a discrete distribution, then the two expectations can be computed in closed form in time using a nearest neighbor data structure. To our knowledge, the only other scenario where we can compute (8) in closed form is when is piecewise uniform on finitely many closed and connected regions such that . The computation reduces to a convex body volume computation. In this case, the time is dominated by the construction time of the powercell (Aurenhammer, 1987).

Figure 2: Evaluation of -median coreset against the construction from Barger & Feldman (2016)

, a sensitivity approach, and uniform sampling. (a) Synthetic data set drawn from 10 multivariate normal distributions with

. (b) Pendigit data set with clusters.
(a)
(b)
(a)
(b)
(a)
(b)
Figure 1: Classification and regression. (a) Evaluation of the SVM coreset against a uniform sample, and the Core Vector Machine approach of Tsang et al. (2005). Note that our coreset performs well even with only a fraction of the data. (b) Comparison with SVD and a uniform sample on a synthetic data set. Our coreset accurately recovers the principal components with only a fraction of the data.
Figure 2: Evaluation of -median coreset against the construction from Barger & Feldman (2016), a sensitivity approach, and uniform sampling. (a) Synthetic data set drawn from 10 multivariate normal distributions with . (b) Pendigit data set with clusters.
Figure 3: Evaluation of -means coreset against the construction of Barger & Feldman (2016), a sensitivity approach, and uniform sampling. (a, b) As in Figure 3. Despite the cost not being Lipschitz, our performance is comparable to that of Barger & Feldman (2016).
Figure 1: Classification and regression. (a) Evaluation of the SVM coreset against a uniform sample, and the Core Vector Machine approach of Tsang et al. (2005). Note that our coreset performs well even with only a fraction of the data. (b) Comparison with SVD and a uniform sample on a synthetic data set. Our coreset accurately recovers the principal components with only a fraction of the data.

5 Experiments

There are several machine learning problems with Lipschitz cost functions. We test on several machine learning algorithms: -median/-means, SVM binary classification, and linear regression.

5.1 Support Vector Machine Training

The hinge loss for a support vector machine is given by

(9)

where are the data points and labels, and

is a separating hyperplane. The goal is to find a hyperplane

that linearly separates positive () from negative () examples.

To translate this problem into probabilistic language, define and where and are appropriate normalization constants. We can parameterize the loss by the hyperplane , and define a measure theoretic version of (9):

(10)

where we have split the cost over the positive and negative measures. To move from a parameterization over vectors to a parameterization over functions, define the function families where

The functions in are Lipschitz with constant . We thus incur an error of which has to be accounted for in coreset construction. Since the function class is Lipschitz we are guaranteed the coreset properties in section 4.1, and can construct a coreset by running Algorithm 1 for and .

To verify our algorithm, we test on the credit card data set of Yeh & Lien (2009) consisting of integer or categorical features. We compare with uniform samples from the empirical distribution, and the Core Vector Machine approach described in (Tsang et al., 2005). Our coreset significantly outperforms both approaches, especially in the regime where the coreset size is small (Figure 3a).

5.2 Linear Regression

The least squares cost for a linear estimator given an origin centered data set

is

(11)

The least squares error is not Lipschitz. Instead, we use the form of 11. Applying a similar argument to that for -medoids, we can define a cost for measures

The family function is here . Since linear regression is efficient even for large data sets in high dimensions, we compare just with the SVD decomposition and a uniform sample of the data (Figure 3b). Coresets for linear regression do exist (Boutsidis et al., 2013), but they essentially compute a low-rank SVD approximation and hence would closely track the performance of SVD.

5.3 -Means and -Median

The cost for -medoids for a given set of centers is given by

(12)

where for -median and for -means. We convert this to measure theoretic language by replacing the discrete set with a measure :

To translate into the measure coreset language of (3), we define the function families

The cost for -median is -Lipschitz, while the one for -means is not Lipschitz. Recall from (7), however, that we are solving for a minimizer with respect to the transport cost. By setting and taking a discrete distribution in (7) we recover

which is the -means cost for centers . As generalizes the -means problem, we expect our approximation to yield an -coreset even though the cost is not Lipschitz.

We test our algorithm on two data sets and report results for both -means and -median. The first, shown in Figures 3(a) and 3(a) is a synthetic data set drawn from normal distributions in using draws from each distribution. The second, shown in Figures 3(b) and 3(b), is the Pendigit data set of digits represented using features (Alimoglu & Alpaydin, 1997).

We compare against a uniform sample, a sensitivity approach that uses an -bicriteria and -means++ seeding to bound the sensitivity (Feldman & Langberg, 2011), and a deterministic algorithm designed for the streaming setting (Barger & Feldman, 2016). Our algorithm outperforms competing methods for both -means and -median clustering (Figures 3 and 3).

6 Discussion

We have introduced measure coresets, an extension of discrete coresets to probability measures. Our construction reveals a surprising connection to optimal transport theory which leads to a simple construction algorithm that applies to any problem with Lipschitz cost.

The behavior of our algorithm reveals several avenues for future research. We highlight specifically the discrepancy between the theoretical guarantees of our algorithm and its empirical performance. This gap is large and its existence indicates that stronger bounds can be proved about our coreset construction. Specifically, we conjecture that is a much too conservative function class as we solve for a approximation, instead of .

More generally, while we have used the Wasserstein metric to find a coreset, our measure coresets are defined independently of the construction presented here. We expect that more efficient algorithms exist for constructing such coresets using either other metrics or different approximation building blocks (Gaussian instead of discrete for example).

References