# Determinantal Point Processes for Coresets

When one is faced with a dataset too large to be used all at once, an obvious solution is to retain only part of it. In practice this takes a wide variety of different forms, but among them "coresets" are especially appealing. A coreset is a (small) weighted sample of the original data that comes with a guarantee: that a cost function can be evaluated on the smaller set instead of the larger one, with low relative error. For some classes of problems, and via a careful choice of sampling distribution, iid random sampling has turned to be one of the most successful methods to build coresets efficiently. However, independent samples are sometimes overly redundant, and one could hope that enforcing diversity would lead to better performance. The difficulty lies in proving coreset properties in non-iid samples. We show that the coreset property holds for samples formed with determinantal point processes (DPP). DPPs are interesting because they are a rare example of repulsive point processes with tractable theoretical properties, enabling us to construct general coreset theorems. We apply our results to the k-means problem, and give empirical evidence of the superior performance of DPP samples over state of the art methods.

## Authors

• 20 publications
• 11 publications
• 11 publications
• ### Rejection Sampling for Tempered Levy Processes

We extend the idea of tempering stable Levy processes to tempering more ...
06/02/2018 ∙ by Michael Grabchak, et al. ∙ 0

• ### Data Amplification: A Unified and Competitive Approach to Property Estimation

Estimating properties of discrete distributions is a fundamental problem...
03/29/2019 ∙ by Yi Hao, et al. ∙ 0

• ### Exact sampling of determinantal point processes with sublinear time preprocessing

We study the complexity of sampling from a distribution over all index s...
05/31/2019 ∙ by Michał Dereziński, et al. ∙ 0

• ### Ensemble Kernel Methods, Implicit Regularization and Determinental Point Processes

By using the framework of Determinantal Point Processes (DPPs), some the...
06/24/2020 ∙ by Joachim Schreurs, et al. ∙ 0

• ### Learning the Parameters of Determinantal Point Process Kernels

Determinantal point processes (DPPs) are well-suited for modeling repuls...
02/20/2014 ∙ by Raja Hafiz Affandi, et al. ∙ 0

• ### Diversity Promoting Online Sampling for Streaming Video Summarization

Many applications benefit from sampling algorithms where a small number ...
10/29/2016 ∙ by Rushil Anirudh, et al. ∙ 0

• ### Investigating the locality of neural network training dynamics

A fundamental quest in the theory of deep-learning is to understand the ...
11/01/2021 ∙ by Soham Dan, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Given a learning task, if an algorithm is too slow on large datasets, one can either speed up the algorithm or reduce the amount of data. The theory of “coresets” gives theoretical guarantees on the latter option. A coreset is a weighted sub-sample of the original data, with the guarantee that for any learning parameter, the task’s cost function estimated on the coreset is equal to the cost computed on the entire dataset up to a controlled relative error.

An elegant consequence of such a property is that one may run learning algorithms solely on the coreset, allowing for a significant decrease in the computational cost while guaranteeing a controlled error on the learning performance. There are many algorithms that produce coresets (see for instance [agarwal_geometric_2005, har-peled_geometric_2011, clarkson_coresets_nodate, langberg_universal_2010]), with some tailored for a specific task (such as -means,

-medians, logistic regression, etc.), and others more generic

[bachem_one-shot_2017]. Also, note that there are coreset results both in the streaming setting and the offline setting: we choose here to focus on the offline setting. The state of the art for many problems consists in tailoring a sampling distribution for the dataset at hand, and then sampling iid from that distribution [chen_coresets_2009, langberg_universal_2010, braverman_new_2016]. However, iid samples are generally inefficient, as an iid process is ignorant of the past, and thus liable to sample similar points repeatedly. An avenue for improvement is to produce samples that are less redundant than what iid sampling produces.

DPPs are known to produce diverse samples, and their theoretical properties are well-known [kulesza_determinantal_2012]. We show here that DPPs can be used to produce diverse samples with the coreset property. Our theorems are quite generic, and assume mostly that the cost functions under study are Lipschitz. We have two main lines of argument: the first is that DPPs do indeed produce coresets, and the second is that DPPs should produce better coresets (than iid methods) if one uses the right marginal kernel to define the DPP. We apply our results to the -means problem, for which the optimal marginal kernel is unfortunately out of reach. We nevertheless argue that a tractable approximation based on a Gaussian kernel and Random Fourier Features can work well in practice and show that it improves over the state of the art on various artificial and real-world datasets.

### 1.1. Related work

Various coreset construction techniques have been proposed in the past. We follow the recent review [munteanu_coresets-methods_2017]

to classify them in four categories:

1. Geometric decompositions [har-peled_coresets_2004, har-peled_smaller_2005, agarwal_geometric_2005, har-peled_geometric_2011]. These methods propose to first discretize the ambient space of the data in a set of cells, snap each data point to its nearest cell in the discretization, and then use these weighted cells to approximate the target tasks. In all these constructions, the minimum number of samples required to guarantee the coreset property depends exponentially in the number of dimensions of the ambient space, making them less useful in high-dimensional problems.

2. Gradient descent [badoiu_optimal_2008, de_la_vega_approximation_2003, kumar_linear-time_2010, clarkson_coresets_nodate]. These methods have been originally designed for the smallest enclosing ball problem (i.e., finding the ball of minimum radius enclosing all datapoints), and have been later generalized to other problems. One of the main drawback of these algorithms in the -means setting for instance is their running time exponentially depending on the number of classes  [kumar_linear-time_2010]. Also, these algorithms provide only so-called weak coresets.

3. Random sampling [chen_coresets_2009, langberg_universal_2010, feldman_unified_2011, braverman_new_2016, bachem_practical_2017]. The state of the art for many different tasks such as -means or

-median is currently via iid random non-uniform sampling. The optimal probability distribution with which to sample datapoints should be set proportional to a quantity known as sensitivity and introduced by Langberg et al.

[langberg_universal_2010], also known as statistical leverage scores in the related field of random linear algebra literature [drineas_lectures_2017]. In practice, it is unpractical to compute all sensitivities: state of the art algorithms rely on bi-criteria approximations to find upper bounds of the sensitivity, and set the probability distribution proportional to this upper bound. More details on these results are provided in Section 2.4.

4. Sketching and projections [phillips_coresets_2016, woodruff_sketching_2014, mahoney_randomized_2011, boutsidis_randomized_2015, boutsidis_improved_2012, cohen_dimensionality_2014, keriven_compressive_2017, clarkson_low_2012]. Another direction of research regarding data reduction that provably keeps the relevant information for a given learning task is via sketches [woodruff_sketching_2014]: compressed mappings (obtained via projections) of the original dataset that are in general easy to update with new or changed data. Sketches are not strictly speaking coresets, and the difference resides in the fact that coresets are subsets of the data, whereas sketches are projections of the data. Note finally that the frontier between the two is permeable and some data summaries may combine both.

### 1.2. Contributions

Our main contribution falls into the random sampling category within which we propose to improve over iid sampling by considering negatively correlated point processes, i.e., point processes for which sampling jointly two similar datapoints is less probable than sampling two very different datapoints. We decide to concentrate on a special type of negative correlation sampling: determinantal point processes, known to provide samples representing the “diversity” of the dataset [kulesza_determinantal_2012]. To the best of our knowledge, we provide the first coreset guarantee using non-iid random sampling. DPPs are parametrized by a matrix called marginal kernel and denoted by , whose diagonal elements encode for the inclusion probabilities of each sample, and non-diagonal elements encode the correlation between samples. We first show that whatever the choice of the non-diagonal elements of

, if the inclusion probabilities of the DPP are set proportional to the sensitivity, then the results are at least as good as the iid case. We further show with a variance argument that DPP sampling, due to the negative correlations encoded by the non-diagonal elements of

, necessarily provides better performances than its iid counterpart with same inclusion probabilities. Technical difficulties in controlling the concentration properties of correlated samples currently keeps us from exactly deriving the minimum coreset size one may hope for using DPPs.

We then apply our theorems to the -means problem where the initial data consists in points in . We discuss the ideal choice of marginal kernel

for DPP sampling in this case. This ideal kernel being prohibitive to compute in practice, we provide a heuristics based on random Fourier features of the Gaussian kernel. This heuristics outputs a coreset sample in

time where is the number of samples of the coreset, to compare to the cost of the current state of the art iid sampling algorithm via bi-criteria approximation. being necessarily larger than to obtain the coreset guarantee, our proposition is computationally heavier, especially as increases. We provide nonetheless empirical evidence showing that this additional cost comes with enhanced performances.

We also provide a side contribution that may be of independent interest: an explicit formula for the sensitivity in the -means case.

Finally, a Python toolbox will soon be available on the authors’ website111Toolbox DPP4Coreset soon available at http://www.gipsa-lab.fr/~nicolas.tremblay/index.php?page=downloads.

### 1.3. Organization of the paper

The paper is organized as follows. Section 2 recalls the background: the considered types of learning problems under consideration, the formal definition of coresets, sensitivities and DPPs. Section 3 presents our main theorems on the performance of DPPs to sample coresets. In Section 4, we show how these theorems are applicable to the -means problem. We provide in Section 5 a discussion on the choice of marginal kernel adapted to the -means problem, and detail our sampling algorithm. Finally, Section 6 presents experiments on artificial as well as real-world datasets comparing the performances of DPP sampling vs. iid sampling. Section 7 concludes the paper. Note that for readability’s sake, we pushed many proofs and some implementation details in the Appendix.

## 2. Background

Let be a set of datapoints. Let be a metric space of parameters, and an element of . We consider cost functions of the form:

 (1) L(X,θ)=∑x∈Xf(x,θ),

where is a non-negative -Lipschitz function () with respect to , i.e., :

 (2) ∀θ∈Θ f(x,θ)⩾0, (3) ∀(θ,θ′)∈Θ2|f(x,θ)−f(x,θ′)|⩽γ dΘ(θ,θ′).

Many classical machine learning cost functions fall under this model:

-means (as will be shown in Section 4),

-median, logistic or linear regression, support-vector machines, etc.

### 2.1. Problem considered

A standard learning task is to minimize the cost over all . We write:

 (4) θopt=argminθ∈ΘL(X,θ), Lopt=L(X,θ% opt) and ⟨f⟩opt=Lopt% N.

In some instances of this problem, e.g., if is very large and/or if is expensive to evaluate and should be computed as rarely as possible, one may rely on sampling strategies to efficiently perform this optimization task.

### 2.2. Coresets

Let be a subset of . To each element , associate a weight . Define the estimated cost as:

 (5) ^L(S,θ)=∑s∈Sω(s)f(s,θ).
###### Definition 2.1 (Coreset).

Let . The weighted subset is a -coreset for if, for any parameter , the estimated cost is equal to the exact cost up to a relative error:

 (6) ∀θ∈Θ∣∣∣^LL−1∣∣∣⩽ϵ.

This is the so-called “strong” coreset definition, as the -approximation is required for all . A weaker version of this definition exists in the literature where the -approximation is only required for . In the following, we focus on theorems providing the strong coreset property.

Let us write the optimal solution computed on the weighted subset . An important consequence of the coreset property is the following:

 (1−ϵ)L(X,θopt)⩽(1−ϵ)L(X,^θopt)⩽^L(S,^θopt)⩽^L(S,θopt)⩽(1+ϵ)L(X,θopt),

i.e.: running an optimization algorithm on the weighted sample will result in a minimal learning cost that is a controlled -approximation of the learning cost one would have obtained by running the same algorithm on the entire dataset . Note that the guarantee is over costs only: the estimated optimal parameters and may be different. Nevertheless, if the cost function is well suited to the problem: either there is one clear global minimum and the estimated parameters will almost coincide; or there are multiple solutions for which the learning cost is similar and selecting one over the other is not an issue.

In terms of computation cost, if the sampling scheme is efficient, is very large and/or is expensive to compute for each datapoint, coresets thus enable a significant gain in computing time.

### 2.3. Sensitivity

Langberg and Schulman [langberg_universal_2010] introduce the notion of sensitivity:

###### Definition 2.2 (Sensitivity).

The sensitivity of a datapoint with respect to is:

 (7) σi=maxθ∈Θf(xi,θ)L(X,θ)∈[0,1].

Also, the total sensitivity is defined as :

 (8) S=N∑i=1σi.

The sensitivity is sometimes called statistical leverage score [drineas_lectures_2017]. It plays a crucial role in the iid random sampling theorems in the coreset literature as well as in the randomized numerical linear algebra literature [mahoney_randomized_2011]. In words, the sensitivity is the worse case contribution of datapoint

in the total cost. Intuitively, the larger it is, the larger its “outlierness”

[lucic_linear-time_2016].

### 2.4. iid importance sampling and state of the art results

In the iid sampling paradigm, the importance sampling estimator of is the following. Say the sample set consists in samples drawn iid with replacement from a probability distribution . Denote by

the random variable counting the number of occurences of

in . One may define , the so-called importance sampling estimator of , as :

 (9) ^L(S,θ)=∑if(xi,θ)ϵimpi.

One can show that , such that

is an unbiased estimator of

:

 (10) E(^L(S,θ))=L(X,θ).

The concentration of around its expected value is controlled by the following state of the art theorem:

###### Theorem 2.3 (Coresets with iid random sampling).

Let be a probability distribution over all datapoints with the probability of sampling and . Draw iid samples with replacement according to . Associate to each sample a weight . The obtained weighted subset is a -coreset with probability provided that:

 (11) m⩾m∗

with

 (12) m∗=O(1ϵ2(maxiσipi)2(d′+log(1/δ))),

where is the pseudo-dimension of (a generalization of the Vapnik-Chervonenkis dimension). The optimal probability distribution minimizing is . In this case, the obtained weighted subset is a -coreset with probability provided that:

 (13) m⩾O(S2ϵ2(d′+log(1/δ))).

For instance, in the -means setting222In the literature [feldman_unified_2011, balcan_distributed_2013], is often taken to be equal to . We nevertheless agree with [bachem_practical_2017] and their discussion in Section 2.6 regarding -means’ pseudo-dimension and thus write , and such that the coreset property is guaranteed with probability provided that:

 (14) m⩾O(k2ϵ2(dklogk+log(1/δ))).

This theorem is taken from [bachem_practical_2017] but its original form goes back to [langberg_universal_2010]. Note that sensitivities cannot be computed rapidly, such that, as it is, this theorem is unpractical. Thankfully, bi-criteria approximation schemes (such as Alg. 2 of [bachem_practical_2017], or other propositions such as in [feldman_unified_2011, makarychev_bi-criteria_2015]) may be used to efficiently find an upper bound of the sensitivity for all : . Noting , and setting , one shows that the coreset property may be guaranteed in the iid framework provided that . This idea of using bi-criteria approximations to upper bound the sensitivity also goes back to [langberg_universal_2010] and has been used in many works on coresets [feldman_unified_2011, makarychev_bi-criteria_2015, braverman_new_2016, bachem_one-shot_2017].

Note that if one authorizes coresets with negative weights (that is, authorizes negative weights in the estimated cost equation 5), then the stated result may be further improved [feldman_unified_2011]. Nevertheless, we prefer to restrict ourselves to positive weights as optimization algorithms such as Lloyd’s -means heuristics [lloyd_least_1982] are in practice more straightforward to implement on positively weighted sets rather than on sets with possibly negative weights.

Finally, Braverman et al. (Thm 5.5 of [braverman_new_2016]) improve the previous theorem by showing that under the same non-uniform iid framework, the coreset property is guaranteed provided that , thus reducing the term in to . In this paper, we present results proportional to the squared total sensitivity , and we thus prefer to focus on the results of Thm. 2.3 in order to ease comparison.

### 2.5. Correlated importance sampling

Eq. (9) is not suited to correlated sampling and, in the following, we will use a slightly different importance sampling estimator, more adapted to this case. Consider a point process defined on that outputs a random sample . For each data point , denote by its inclusion (or marginal) probability:

 (15) πi=P(xi∈S).

Moreover, denote by the random Boolean variable such that if , and otherwise. In the following, we will focus on the following definition of the importance sampling cost estimator :

 (16) ^L(S,θ)=∑if(xi,θ)ϵiπi.

By construction, , such that is an unbiased estimator of :

 (17) E(^L(S,θ))=L(X,θ).

Studying the coreset property in this setting boils down to studying the concentration properties of around its expected value.

### 2.6. Determinantal Point Processes

In order to induce negative correlations within the samples, we choose to focus on Determinantal Point Processes (DPP), point processes that have recently gained attention due to their ability to output “diverse” subsets within a tractable probabilistic framework (for instance with explicit formulas for marginal probabilities). In the following, denotes the set of first integers .

###### Definition 2.4 (Determinantal Point Process [kulesza_determinantal_2012]).

Consider a point process, i.e., a process that randomly draws an element . It is determinantal if there exists a semi-definite positive matrix verifying such that, for every ,

 P(A⊆S)=det(KA),

where is the restriction of to the rows and columns indexed by the elements of . is called the marginal kernel of the DPP.

By definition, the probability of inclusion of , denoted by , is equal to and the expected number333in fact, the number of samples of a DPP is itself random: it is a sum of Bernoullis parametrized by

’s eigenvalues (see

[kulesza_determinantal_2012] for details) of samples is . Moreover, to gain insight in the repulsive nature of DPPs, one may readily see that the joint marginal probability of sampling and reads: and is necessarily smaller than , the joint probability in the case of uncorrelated sampling. The stronger the “interaction” between and (encoded by the absolute value of element ), the smaller the probability of sampling both jointly: this determinantal nature thus favors diverse sets of samples.

Our goal will be to design the best possible such that sampling a DPP with marginal kernel guarantees the coreset property with high probability.

## 3. Coreset theorems

We now detail our main contributions. In Sections 3.1 to 3.2, we present our main theorems providing sufficient conditions on the marginal probabilities (i.e., the diagonal elements of ) to guarantee the coreset property. We will see that, similar to the iid case (theorem 2.3), the optimal marginal probability should be set proportional to the sensitivity. These theorems are valid for any choice of non-diagonal elements of the matrix . We further discuss in Section 3.3

how one may take advantage of these additional degrees of freedom to improve the coreset performance over iid sampling.

### 3.1. Determinantal Point Processes for Coresets

###### Theorem 3.1 (DPP for coresets).

Consider a sample from a DPP with marginal kernel . Let , . Denote by the minimum number of balls of radius necessary to cover . With probability higher than , is a -coreset provided that

 (18) μ⩾μ∗=max(μ∗1,μ∗2)

with:

 (19) μ∗1 =32ϵ2(ϵmaxiσi¯πi+4(maxiσi¯πi)2)log10nδ, (20) μ∗2 =32ϵ2⎛⎝ϵN¯π% min+4N2¯π2min⎞⎠log10δ,

and .

The proof is provided in Appendix A. Note that and are not independent of : they are in fact dependent via . While this formulation may be surprising at first, this is due to the fact that in non-iid settings, separating from is not as straightforward as in the iid case (in Thm. 2.3, and are independent) . Also, we decide upon this particular formulation of the theorem to mimic classical concentration results obtained with iid sampling.

In order to simplify further analysis, we suppose from now on that . As shown in the second lemma of Appendix B, this is in fact verified in the -means case on which we will focus in Section 4. Nevertheless, the following results may be generalized to cases with unconstrained if needed.

###### Lemma 3.2.

If , then and the coreset property of Theorem 3.1 is verified if:

 (21) μ⩾μ∗=32ϵ2(ϵmaxiσi¯πi+4(maxiσi¯πi)2)log10nδ

with .

###### Proof.

Denote by the index for which is minimal and, provided that , one has:

 maxiσi¯πiN¯πmin⩾Nσj⩾Nσmin⩾1.

This implies that:

 (22) (maxiσi¯πiN¯πmin)2⩾log10δ / (log10δ+logn),

as is necessarily larger than 1. One can show that Eq. (22) is equivalent to , such that . ∎

One would like to have the coreset guarantee for a minimal number of samples, that is: to find the marginal probabilities minimizing .

###### Corollary 3.3.

If there exists and such that:

 (23) ∀i ασi⩽πi⩽αβσi, (24) and αβ⩾32ϵ2(ϵ+4S)log10nδ,

then is a -coreset with probability at least . In this case, the expected number of samples verifies:

 μ⩾32ϵ2βS(ϵ+4S)log10nδ.
###### Proof.

Let us suppose that there exists and such that:

 ∀i,ασi⩽πi⩽αβσi.

Note that:

 ϵmaxiσiπi+4(maxiσiπi)2μ⩽ϵα+4μα2⩽ϵα+4βSα=βα(ϵβ+4S)⩽βα(ϵ+4S).

Thus, the inequality

 αβ⩾32ϵ2(ϵ+4S)log10nδ

implies:

 1⩾32ϵ2(ϵmaxiσiπi+4(maxiσiπi)2μ)log10nδ,

that we recognize as the coreset condition (21) by multiplying on both sides by : is indeed a -coreset with probability larger than . Moreover, in this case:

 μ=∑iπi⩾α∑iσi=αS⩾32ϵ2βS(ϵ+4S)log10nδ.

Corollary 3.3 is applicable to cases where is not too large. In fact, in order for to be smaller than , and thus smaller than as is a probability, should always be set inferior to . Now, if is so large that , then, even by setting to its minimum value , there is no admissible verifying both conditions (26) and (27). We refer to App. C for a simple workaround if this issue arises. We will further see in the experimental section (Section 6) that outliers are not an issue in practice.

### 3.2. m-Determinantal Point Processes for coresets

In some cases, we would like to specify deterministically the number of samples, instead of having a random number of them (with a given mean). This leads to -DPPs: DPPs conditioned to output samples.

###### Definition 3.4 (m-Dpp [kulesza_determinantal_2012]).

Consider a point process that randomly draws an element . This process is an -DPP with kernel if:

1. , where is a normalization constant.

Note that , the marginal probability of inclusion, is not necessarily equal to anymore in the case of an -DPP. In fact:

 πi=1Z∑S s.t |S|=m and i∈Sdet(KS)

We deliver the following result assuming that .

###### Theorem 3.5 (m-DPP for coresets).

Let be a sample from an -DPP with kernel , , and the minimal number of balls of radius necessary to cover . We assume for simplicity that . is a -coreset with probability larger than provided that:

 (25) m⩾32ϵ2(maxiσi¯πi)2log4nδ

with . Also, if there exists and such that:

 (26) ∀i ασi⩽πi⩽αβσi, (27) and αβ⩾32ϵ2Slog4nδ,

then is a -coreset with probability larger than . In this case, the number of samples verifies:

 m⩾32ϵ2βS2log4nδ.
###### Proof.

According to [pemantle_concentration_2014], replace Eq. (44) by:

 (28) m⩾8ϵ2C2log2δ,

with , where is a shorthand for ; and Eq. (46) by:

 (29) m⩾8ϵ2N2¯π2minlog2δ,

and change accordingly the rest of the proof. ∎

### 3.3. Links with the iid case and the variance argument

Let us first compare our results with Thm. 2.3 obtained in the iid setting. A few remarks are in order:

1. setting to 1 and to in Thm 3.5, the minimum number of required samples is , to compare to of Thm 2.3, where is the pseudo-dimension of . being the number of balls of radius necessary to cover , it will typically be to the power of the ambient dimension of (similar to ). Both forms are very similar, up to the dependency in and in of the log term. This difference is due to the fact that coreset theorems in the iid case (see for instance [bachem_practical_2017]) take advantage of powerful results from the Vapnik-Chervonenkis (VC) theory such as the ones detailed in [li_improved_2001]. Unfortunately, these fundamental results are valid in the iid case, and are not easily generalized to the correlated case. Further work should enable to reduce this small gap.

2. Outliers are not naturally dealt with using our proof techniques, mainly due to our multiple use of the union bound that necessarily englobes the worse-case scenario. In fact, in the importance sampling estimator used in the iid case (Eq. 9), outliers are not problematic as they can be sampled several times. In our setting, outliers are constrained to be sampled only once, which in itself makes sense, but complicates the analysis. Empirically, we will see in Section 6 that outliers are not an issue.

3. Finally, our results take only into account the inclusion probability of the DPP, that is: the diagonal elements of . These theorems are thus valid for any choice of non-diagonal elements (provided stays semi-definite positive with eigenvalues between 0 and 1). As we will see with the following variance argument: a smart choice of ’s non-diagonal elements will necessarily improve our results, thus outperforming the iid setting.

###### Theorem 3.6 (The variance argument).

Given , and writing the variance of the importance sampling estimator of Eq. 16 in the iid case, we have:

 (30) \emph{Var}(^L)=\emph{Var}iid−∑i≠jK2ijπiπjf(xi,θ)f(xj,θ).

As the function is positive, the variance of via DPP sampling is thus necessarily smaller than its iid counterpart with same probability of inclusion.

###### Proof.

We have:

 (31) Var(^L) =E(^L2)−E(^L)2 (32) =∑i,jE(ϵiϵj)πiπjf(xi,θ)f(xj,θ)−L2.

As is sampled from a DPP, we have: , i.e.:

 (33) Var(^L) =Variid−∑i≠jK2ijπiπjf(xi,θ)f(xj,θ),

where is the variance one would obtain with a Poisson-type uncorrelated sampling strategy with same marginal probability, i.e., for processes such that . ∎

As a consequence, adding negative correlations (i.e., non-zero non-diagonal elements of ) necessarily decreases the estimation’s variance. To conclude this Section: we provided theorems that explain how the diagonal elements of should be set in order to match the iid performance. And we show here that any choice of off-diagonal elements of (provided stays SDP with eigenvalues between and ) will necessarily improve the coreset performance of DPP sampling versus its iid couterpart. In addition, this variance equation will provide useful indications for our choice of marginal kernel in Section 5.

## 4. Application to k-means

The above results are valid for any learning problem of the form detailed in Section 2.1. We now specifically consider the -means problem on a set comprised of datapoints in . This problem boils down to finding centroids , all in , such that the following cost is minimized:

 L(X,θ)=∑x∈Xf(x,θ)   with   f(x,θ)=minc∈θ∥x−c∥2.

Let be the diameter of the minimum eclosing ball of (the smallest ball enclosing all points in ).

Theorem 3.1 is applicable to the -means problem, such that:

###### Corollary 4.1 (DPP for k-means).

Let be a sample from a DPP with marginal kernel . Let . With probability at least , is a -coreset provided that:

 (34) μ⩾μ∗=32ϵ2(ϵmaxiσi¯πi+4(maxiσi¯πi)2)(kdlog(24ρ2ϵ⟨f⟩opt+1)+log10δ),

with .

If there exists and such that:

 (35) ∀i, ασi⩽πi⩽αβσi, (36) and αβ⩾32ϵ2(ϵ+4S)(kdlog(24ρ2ϵ⟨f⟩opt+1)+log10δ),

then is a -coreset with probability larger than . In this case, the expected number of samples verifies:

 μ⩾32ϵ2βS(ϵ+4S)(kdlog(24ρ2ϵ⟨f⟩opt+1)+log10δ).
###### Proof.

Let us write the minimum enclosing ball of , of diameter . The potentially interesting centroids are necessarily included in such that the space of parameters in the -means setting is the set of all possible centroids in : . The metric we consider is the Hausdorff metric associated to the Euclidean distance:

 ∀θ,θ′,dΘ(θ,θ′)=max

- An -net of . Consider an -net of : it consists in at least small balls of radius (see, e.g., Lemma in [geer_empirical_2000]). Consider of cardinality . Let us show that is an -net of , that is:

 ∀θ∈Bk,∃θ∗∈Γ s.t. dΘ(θ,θ∗)⩽ϵ′.

In fact, consider . By construction, as is an -net of , we have:

 ∀i=1,…,k∃c∗i∈ΓB s.t. ∥∥ci−c∗i∥∥⩽ϵ′.

Writing , one has:

 dΘ(θ,θ∗)⩽ϵ′,

which proves that is an -net of . The number of balls of radius necessary to cover is thus at least .

- is -Lipschitz with . Consider any , and . We want to show that:

 −γ dΘ(θ,θ′)⩽f(x, θ)−f(x,θ′)⩽γ dΘ(θ,θ′).

Let us write the centroid in closest to and the centroid in closest to . Moreover, let us write the centroid in closest to . Note that and are not necessarily equal. By definition of , one has:

 ∥∥x−c′∥∥⩽∥∥x−~c′∥∥,

such that:

 ∥∥x−c′∥∥2−∥x−c∥2⩽∥∥x−~c′∥∥2−∥x−c∥2⩽∥∥~c′−c∥∥2⩽dΘ(θ,θ′).

Thus:

 f(x,θ′)−f(x,θ)=∥∥x−c′(x)∥∥2−∥x−c∥2 =(∥∥x−c′∥∥−∥x−c∥)(∥∥x−c′∥∥+∥x−c∥) ⩽(∥∥x−c′∥∥+∥x−c∥) dΘ(θ,θ′)⩽2ρ dΘ(θ,θ′).

- Finally, , as shown by the second Lemma of Appendix B.

Given all these elements, Thm. 3.1 is thus applicable to the -means setting and one obtains the desired result. ∎

Similarly, in the case of fixed-size DPP, Thm. 3.5 is applicable to the -means problem, such that:

###### Corollary 4.2 (m-DPP for k-means).

Let be a sample from an -DPP with marginal kernel . Let . With probability at least , is a -coreset provided that:

 (37) m⩾m∗=32ϵ2(maxiσi¯πi)2(kdlog(24ρ2ϵ⟨f⟩opt+1)+log4δ),

with .

If there exists and such that:

 (38) ∀i, ασi⩽πi⩽αβσi, (39) and αβ⩾32ϵ2S(kdlog(24ρ2ϵ⟨f⟩opt+1)+log4δ),

then is a -coreset with probability larger than . In this case, the number of samples verifies:

 m⩾32ϵ2βS2(kdlog(24ρ2ϵ⟨f⟩opt+1)+log4δ).
###### Remark 4.3.

Remember from Appendix C that should not be too large in order to be able to find admissible values of verifying Eqs. (35) and (36): it should in fact verify . Thankfully, this constraint is often very loose. For instance, in the case of -means (-means with ) and supposing without loss of generality that the data is centered (i.e., ), we show in the first lemma of Appendix B that , where . Thus, , and the constraint boils down to:

 (40) maxi∥xi∥2v⩽O(Nd)

Suppose for instance that the underlying data distribution is a Gaussian centered in the origin with variance . is an estimator of , such that is the normalized norm of the most extreme event of the drawing and is typically smaller than with very high probability, implying a very loose constraint indeed in our context of large . If for any reason there are true outliers for which is larger than , then, following the strategy outlined in Appendix C, one samples them beforehands, associates to each of them a weight of 1, and then applies the sampling theorems to the rest of the data.

## 5. Implementation for k-means

### 5.1. The DPP’s ideal marginal kernel

Following the theorems’ results, the ideal strategy (although unrealistic) to build the ideal marginal kernel would be as follows. 1/ Deal with outliers as explained in Appendix C until . 2/ Compute all . 3/ Set and . 4/ Set all to . 4/ Find all non-diagonal elements of in order to minimize for all the estimator’s variance, as derived in Eq. (33):

 (41) Var(^L) =Variid−∑i≠jK2ijπiπjf(xi,θ)f(xj,θ)

while constraining to be a valid marginal kernel, i.e.: SDP with , 5/ sample a DPP with kernel . On our way to derive a practical algorithm with a linear complexity in , many obstacles stand before us: there is no known polynomial algorithm to compute all in the general setting, solving exactly the minimization problem of step 4 under eigenvalue constraint is involved, and sampling from this engineered ideal costs number of operations. Designing a linear-time algorithm that provably verifies under a controlled error the conditions of our previous theorems is out-of-scope of this paper. In the following, we prefer to first recall the intuitions behind the construction of a good kernel, and then discuss the choice of kernel we advocate.

### 5.2. In practice: a marginal kernel based on the Gaussian kernel

In order for to be a good candidate for coresets, it needs to verify the following two properties:

• As indicated by the theorems, the diagonal entries should increase as the associated increases.

• As indicated by the variance equation, off-diagonal elements should be as large as possible (in absolute value) given the eigenvalue constraints. In fact, we cannot set all non-diagonal entries of to large values as the matrix’s 2-norm would rapidly be larger than 1. We thus need to choose the best pairs for which it is worth setting a large value of . A first glance at the variance equation indicates that the larger is, the larger should be, in order to decrease the variance as much as possible. Recall nevertheless that in the coreset setting, all sampling parameters should be independent of . The off-diagonal elements should thus verify the following property: the larger is the correlation between and (the more similar are and for all ), the larger should be.

We show in the following in what ways the choice of marginal kernel

 K=L(I+L)−1

with the Gaussian kernel matrix with parameter :

 ∀(i,j)Lij=exp−∥∥xi−xj∥∥2s2,

is a good candidate to build coresets for -means. Note that is called the -ensemble associated to  [kulesza_determinantal_2012]. Let us write

the orthonormal eigenvector basis of

and its diagonal matrix of sorted eigenvalues, . and naturally depend on . One shows for instance that, with respect to , is a monotonically increasing function between and .

Concerning the off-diagonal elements of , let us first note that if and are correlated (that is, in the -means setting, if they are close to each other), then

 Kij=∑kλk1+λkuk(i)uk(j)

is large in absolute value. In fact, in the limit where , then and . The determinant of the submatrix of indexed by and is therefore null: sampling both will never occur. Thus, the closer are and , the lower is the chance of sampling both jointly. Moreover, if and are far from each other (for instance, in different clusters), then the entries and of eigenvectors will be very different. Fo instance, say the dataset contains two well separated clusters of similar size. If the Gaussian parameter is set to the size of these clusters, then the kernel matrix will be quasi-block diagonal, with each block corresponding to the entries of each cluster. Also, each eigenvector will have energy either in one cluster or the other such that is necessarily small if and belong to different clusters, and the event of sampling both jointly is probable.

Concerning the probability of inclusion of , we have:

 Kii=∑kλk1+λkqi(k)2,

where is the vector of size verifying . For all , . The probability of inclusion is thus directly linked to the values of that contain the energy of : the more the energy of is contained on high values of , the larger is the probability of inclusion. Say we are again in a situation where the clusters and the choice of Gaussian parameter are such that is quasi block diagonal. Within each block, the eigenvector associated with the highest eigenvalue corresponds approximately to the constant vector. These eigenvectors being normalized, the associated entry of is thus approximately equal to where is the size of the cluster containing data . Typically, if the cluster is small, that is, if tends to 1, the associated entry tends to 1 as well, such that all the energy of is drawn towards high values of , thus increasing the probability of inclusion of . In other words, the more isolated, the higher the chance of being sampled. This corresponds to the intuition one may obtain for the sensitivity . It has indeed been shown that the sensitivity may be interpreted as a measure of outlierness [lucic_linear-time_2016].

In the context of coresets for -means, we thus advocate to sample DPPs via a Gaussian kernel -ensemble. We now move on to detailing an efficient sampling implementation.

### 5.3. Efficient implementation

Sampling a DPP from the Gaussian -ensemble verifying

 ∀(i,j)Lij=exp−∥∥xi−xj∥∥2s2

consists in the following steps:

1. Compute .

2. Diagonalize in its set of eigenvectors and eigenvalues .

3. Sample a DPP given and via Alg. 1 of [kulesza_determinantal_2012].

Step 1 costs , step 2 costs , step 3 costs . This naive approach is thus not practical. We detail in Appendix D how to reduce the overall complexity to