Simple and sharp analysis of k-means||

We present a truly simple analysis of k-means|| (Bahmani et al., PVLDB 2012) – a distributed variant of the k-means++ algorithm (Arthur and Vassilvitskii, SODA 2007) – and improve it from O(logVar X), where Var X is the variance of the input data set, to O(logVar X / loglogVar X), which we show to be tight.

Authors

• 11 publications
07/02/2020

This paper shows how to adapt several simple and classical sampling-base...
12/10/2021

Collecting Coupons is Faster with Friends

In this note, we introduce a distributed twist on the classic coupon col...
10/30/2019

On a Decentralized (Δ+1)-Graph Coloring Algorithm

We consider a decentralized graph coloring model where each vertex only ...
09/06/2021

An axiomatization of Λ-quantiles

We give an axiomatic foundation to Λ-quantiles, a family of generalized ...
09/28/2017

A Simple and Efficient MapReduce Algorithm for Data Cube Materialization

Data cube materialization is a classical database operator introduced in...
04/25/2013

An implementation of the relational k-means algorithm

A C# implementation of a generalized k-means variant called relational k...
05/31/2019

Principal Fairness: Removing Bias via Projections

Reducing hidden bias in the data and ensuring fairness in algorithmic da...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Clustering is one of the classical machine learning problems. Arguably the simplest and most basic formalization of clustering is the

-means formulation: we are given a (large) set of points in the Euclidean space and are asked to find a (small) set of centers so as to minimize the sum of the squared distances between each point and the closest center. Due to its simplicity, -means is considered as the problem that tests our understanding of clustering.

The classical, yet still state-of-the-art algorithm -means(Arthur & Vassilvitskii, 2007a) combines two ideas to approach the problem. First, a fast randomized procedure finds a set of centers that by itself is known to be competitive in expectation with respect to the optimal solution. Then, the classical Lloyd’s algorithm (Lloyd, 1982) is run to improve the found solution until a local minimum is achieved.

A significant disadvantage of the -meansis the inherent sequential nature of the first, seeding step: one needs to pass through the whole data times, each time to sample a single center. To overcome this problem, (Bahmani et al., 2012) devised -means: a distributed version of the -meansalgorithm.

In their algorithm one passes through the dataset only few times to extract a set of roughly candidate centers, from which one later chooses the final centers by the means of the classical -meansalgorithm.

Our contribution

In this work we first provide a new, simple analysis of -means, thus simplifying known proofs (Bahmani et al., 2012) and (Bachem et al., 2017). In particular, if we denote by the sum of squared distance of data points of to their mean (we call this quantity the variance of the data) and the optimal solution, we prove that rounds of the -meansalgorithm suffice to get expected constant approximation guarantee for the set of oversampled centers. Then we proceed by refining the analysis to provide a better bound on the number of sampling rounds needed by the algorithm: instead of rounds, we prove that

rounds suffice. This bound even holds with high probability guarantee, whereas the bounds proved in previous works hold only in expectation. Finally, we prove the second bound to be best possible for a wide range of the value of

, building on a lower bound of (Bachem et al., 2017).

The first analysis of -meansfrom (Bahmani et al., 2012)

is a remarkable display of skill, as it invokes linear programming duality as a part of the argument. The second analysis from

(Bachem et al., 2017) is more similar to ours, as it only relies on basic lemmas known from the analysis of -meansfrom (Arthur & Vassilvitskii, 2007b). Our one-page analysis is considerably shorter and, we believe, also simpler. It can be summed up as “view the algorithm as a balls into bins process”. We explain this in more detail in Section 3.

2 Background and notation

We mostly adopt the notation of the paper (Bahmani et al., 2012). Let be a point set in the -dimensional Euclidean space. We denote the standard Euclidean distance between two points by and for a subset we define the distance between and as .

For a subset we denote by its centroid, i.e., . For a set of points and we define the cost of with respect to as and use a shorthand for and for . It is easy to check that for a given point set , the center that minimizes its cost is its mean . We follow (Bachem et al., 2017) to denote and call the variance of the data.

The goal of the -means problem is to find a set of centers , that minimizes the cost for a given set of points . The -means problem is known to be NP-hard (Aloise et al., 2009; Mahajan et al., 2009) and even hard to approximate up to arbitrary precision (Awasthi et al., 2015; Lee et al., 2017).

From now on we fix an optimal solution and denote by its cost.

2.1 k-means++ algorithm

The classical -meansalgorithm (Arthur & Vassilvitskii, 2007b; Ostrovsky et al., 2013) computes the centers in

sampling steps. After the first step where the first center is taken from uniform distribution, each subsequent step samples a new point from

-distribution: if is the current set of centers, is being sampled with probability , i.e., we sample the points proportional to their current cost.

The -meansalgorithm is known to provide an approximation guarantee, in expectation. The analysis crucially makes use of the following two lemmas that we will also use.

The first lemma tells us that if we sample uniformly a random point from some point set , we expect the cost to be comparable with the cost , i.e., the smallest cost achievable with one center. One can think of as being a cluster of the optimal solution or the whole point set .

Lemma 1 (Lemma 3.1 in (Arthur & Vassilvitskii, 2007a)).

Let be an arbitrary set of points. If we sample a random point according to the uniform distribution, we have .

The second lemma ensures that up to a constant factor the same guarantee holds even for the distribution.

Lemma 2 (Lemma 3.2 in (Arthur & Vassilvitskii, 2007a)).

Let be an arbitrary set of points, be an arbitrary set of centers and be a point chosen by weighting. Then, .

The analysis of -meansuses the above two lemmas together with the important observation that the sampling samples from an optimal cluster with probability proportional to its current cost, hence we preferably sample from costly clusters.

2.2 k-means|| algorithm

The distributed variant of the -meansalgorithm, -means, was introduced in (Bahmani et al., 2012).

The algorithm consists of two parts. In the first, overseeding part (see Algorithm 1), we proceed in sequential rounds after sampling uniformly a single center as in the first step of -means. In each of the sampling rounds we sample each point of with probability , i.e., times bigger than the probability of taking the point in -means, independently on the other points.

In the second part of the algorithm we collect the set of sampled centers and create a new, weighted, instance of -means in which the weight of every center is equal to the number of points of to which the given center is the closest. The new instance is solved, e.g., with -meansas in Algorithm 2. One can prove that finding a set with in Algorithm 1 implies that the overall approximation guarantee is, up to a constant, the same as the approximation guarantee of the algorithm used in the second part of the algorithm (see (Bachem et al., 2017), proof of Theorem 1), which is, in this case , in expectation.

Hence, the analysis of Algorithm 2 boils down to bounding the number of steps needed by the Algorithm 1 to achieve constant approximation guarantee for given sampling factor . The authors of (Bahmani et al., 2012) prove the following.

Theorem 1 (roughly Theorem 1 in (Bahmani et al., 2012)).

If we choose and , Algorithm 1 gives a set with .

Their result was later reproved in (Bachem et al., 2017). We provide a new, simple proof in Section 3.

2.3 Other related work

-meanswas introduced in (Arthur & Vassilvitskii, 2007a) and a similar method was studied by (Ostrovsky et al., 2013). This direction led to approximation schemes (Jaiswal et al., 2012, 2015), constant approximation results based on additional local search (Lattanzi & Sohler, 2019; Choo et al., 2020), constant approximation bi-criteria results based on sampling more centers (Aggarwal et al., 2009; Ailon et al., 2009; Wei, 2016), approximate -means

based on Markov chains

(Bachem et al., 2016b, a) or coresets (Bachem et al., 2018), analysis of hard instances (Arthur & Vassilvitskii, 2007a; Brunsch & Röglin, 2013) or under adversarial noise (Bhattacharya et al., 2019). Consult (Celebi et al., 2013) for an overview of different seeding methods for -means .

There is a long line of work on a related -median problem. After writing down the paper, we found out that an algorithm of (Mettu & Plaxton, 2012) and its analysis is surprisingly similar to -meansand our analysis.

We discuss some related work on -means more thoroughly at the end of Section 3.

3 Warm-up: simple analysis

In this section we provide a simple analysis of Algorithm 1 based on viewing the process as a variant of the balls into bins problem. Recall that in the most basic version version of the balls into bins problem, one throws balls into bins, each ball to a uniformly randomly chosen bin, and asks, e.g., what is a probability of a certain bin to be hit by a ball. This is equal to hence, we expect a constant proportion of the bins to be hit in a single step.

To see the connection to our problem, we first define the notion of settled clusters that is similar to notions used e.g. in (Aggarwal et al., 2009; Lattanzi & Sohler, 2019).

Definition 1 (Settled clusters).

We call a cluster of the optimum solution settled with respect to current solution , if . Otherwise, we call unsettled with respect to .

We view the clusters of as bins and each sampling round of Algorithm 1 as shooting at each bin and hitting it (i.e., making the cluster settled) with some probability. Intuitively, this probability is proportional to the cost of the cluster, since this is how we defined the probability of sampling any point of . So, we view the process as a more general and repeated variant of the balls into bins process, where the costs of the clusters act like “weights” of the bins and we sample with probability (roughly) proportional to these weights. We prove now that clusters are being settled with probability roughly proportional to their cost (unless they are very costly).

Proposition 1.

Let be the current set of sampled centers and let be an unsettled cluster of the optimum solution. The cluster is not made settled in the next iteration of Algorithm 1 with probability at most

 exp(−ℓφA(C)5φX(C)).

Intuitively, for clusters with the probability of hitting them in one step is of order (using that for small positive ), while for more costly clusters the probability of hitting them is lower bounded by some constant.

Proof.

If we sample a point from according to weights, we have by Lemma 2. Hence, by Markov inequality, is made settled with probability at least . In other words, there is a subset of points , such that sampling a point from makes settled. If contains a point with , we sample and make settled with probability . Otherwise, we have

 P(A does not get settled) ≤∏x∈A′(1−ℓφx(C)/φX(C)) ≤exp(−∑x∈A′ℓφx(C)/φX(C)) ≤exp(−ℓφA(C)/(5φX(C)))

where we used and . ∎

Similarly to the classical balls into bins problem, we can now observe that the total cost of unsettled clusters drops by a constant factor in each step. This is Theorem 2 in (Bahmani et al., 2012), a crux section of their analysis.

From now on we simplify the notation and write for the cost of the point set after sampling rounds of Algorithm 1. Moreover, by we denote the total cost of yet unsettled clusters after sampling rounds.

Proposition 2 (roughly Theorem 2 in (Bahmani et al., 2012)).

Suppose that . For we have

 {E}[φt+1U]≤(1−130)φtU.

In other words, while we did not achieve constant approximation, the expected cost of yet unsettled clusters drops by a constant factor in each iteration.

Proof.

We split the unsettled clusters into two groups: a cluster with we call heavy and the otherwise we call it light. Note that the probability that a heavy cluster is not settled in th iteration is by Proposition 1 bounded by

 exp(−kφtA5φtX)≤exp(−kφtU5kφtX)≤exp(−110)≤1415

where we used that : this holds since otherwise more than half the cost of is formed by settled clusters, hence , contradicting our assumption. Hence, after the sampling step, heavy cluster does not contribute to the overall cost of unsettled clusters with probability at least . This implies that the expected drop in the cost of unsettled clusters is at least

 φtU−{E}[φt+1U]≥∑A heavyφtA/15 =115⎛⎝φtU−∑A lightφtA⎞⎠≥φtU30

where we used that the light clusters have total cost of at most . ∎

Theorem 1 now follows directly (Bahmani et al., 2012; Bachem et al., 2017) and we prove it here for completeness.

Proof of Theorem 1.

From Lemma 1 it follows that after we sample a uniformly random point, we have . From Proposition 2 it follows that . Applying this result times, we get

 {E}[φTU]≤2(2930)T{Var}X+20φ∗⋅T−1∑t=0(2930)t ≤2(2930)T{Var}X+600φ∗

Choosing and recalling that yields the desired claim. ∎

For the sake of simplicity, we did not optimize constants and analysed Algorithm 1 meaningfully only for the case . In the following remarks we note how one can extend this (or some previous) analysis and then use it to compare -meansmore carefully to a recent line of work.

Remark 1.

With more care, the approximation factor in Theorem 1 can be made arbitrarily close to . We omit the proof.

Remark 2.

With more care, for general one can prove that the number of steps of Algorithm 1 needed to sample a set of points that induce a cost of is for and for . We omit the proof.

Remark 2 allows us to make a closer comparison of -meanswith a recent line of work of (Bachem et al., 2016a, 2018) that aims for very fast algorithms that allow for an additive error of .

According to Remark 2, to obtain such a guarantee for the oversampled set of centers, Algorithm 1 needs to set and sample for steps (this was observed by (Bachem et al., 2017)) or points and sample just once (i.e., ). The approximation factor of Algorithm 2 is then multiplied by additional factor of as this is the approximation guarantee of -means.

Quite close approach to -meanswith is the one of (Bachem et al., 2018), whose authors propose a coreset algorithm that samples points from roughly the same distribution as the one used in the first sampling step of Algorithm 1. If we use their algorithm by running -meanson the provided coreset, we get an algorithm with essentially the same guarantees as Algorithm 2 with number of rounds and . The main difference is that in Algorithm 2, the weight of each sampled center used by -meanssubroutine is computed as the number of points for which the center is the closest, whereas in the coreset algorithm, each center is simply given a weight inversely proportional to the probability that the center is sampled. This allows the coreset algorithm to be faster than Algorithm 2 with and , whose time complexity is . This is at the expense of higher number of sampled points.

A beautiful paper of (Bachem et al., 2016a) uses the Metropolis algorithm on top of the classical -meansalgorithm to again achieve additive (and multiplicative ) error, while sampling only points from the same distribution as Algorithm 2 with and . While the number of taken samples is only slightly higher than the one of Algorithm 2 with , their running time is much better .

We see that the main advantage of -meanslies in the possibility of running multiple, easily distributed, sampling steps that allow us to achieve strong guarantees.

3.2 Submodular context

One can observe why is a natural bound by considering a different way of achieving the same round complexity (Choo, ). First, note that after sampling the first point , we have via Lemma 1 with the set chosen as the whole set . The process of adding new points to the solution now satisfies a natural law of diminishing returns: for any and we have

 φX({c1}∪C1∪{c})−φX({c1}∪C1) ≥φX({c1}∪C2∪{c})−φX({c1}∪C2)

In other words, the function is submodular (see e.g. (Krause & Golovin, ) for the collection of uses of submodularity in machine learning). Then one can use recent results about distributed algorithms for maximizing submodular functions (see e.g. (Mirzasoleiman et al., 2013; Barbosa et al., 2015a, b; Liu & Vondrak, 2018)) to get that in distributed rounds, one can find a set of points such that

 φX({c1}∪C)−φ∗≤(φX({c1})−φ∗)/2

i.e., the distance to the best solution drops by a constant factor. Continuing the same process for rounds, one gets the same theoretical guarantees as with running Algorithm 2. However, the advantage of -meansis its extreme simplicity and speed. Moreover, rather surprisingly, we prove in the next section that the asymptotical round complexity of -meansis actually slightly better than logarithmic.

4 Sharp analysis of k-means|| : upper bound

It may seem surprising that the proof of Theorem 1 can be strengthened, since even for the classical balls into bins problem, where we hit each bin with constant probability, we need rounds to hit all the bins with high probability. However, during our process we can disregard already settled clusters since they are not contributing substantially to the overall cost. If we go back to the classical balls into bins problems and let that process repeat on the same set of bins, with the additional property that in each round we throw each one of balls to a random bin out of those that are still empty, we expect to hit all the bins in mere steps (Lenzen & Wattenhofer, 2011). Roughly speaking, this is because of the rapid decrease in the number of bins: after the first round, the probability of a bin remain empty is roughly , but after second round it is only roughly since the number o bins decreased, in the next iteration it is even roughly and so on. In our, weighted, case we cannot hope for such rapid decrease in the number of bins, since the costs of clusters can form a geometric series, in which case we get rid of only a small number of clusters in each step (cf. Section 5) 111We believe it is an interesting problem to analyze whether there are reasonable assumptions on the data under which the round complexity indeed follows behaviour. This could explain why in practice Algorithm 1 is run only for few rounds (Bahmani et al., 2012). . In this section we show a more careful analysis that bounds the number of necessary steps to . Moreover, this holds with high probability in .

In the rest of the paper we use the notation (and for completeness, if , let ) as a courtesy to the reader.

Concentration

Since we prove a high probability result, we recall the classical Chernoff bounds that are used to argue about concentration around mean.

Theorem 2 (Chernoff bounds).

Suppose

are independent random variables taking values in

. Let denote their sum. Then for any we have

 P(X≤(1−δ){E}[X])≤{e}−{E}[X]δ2/2.

and

 P(X≥(1+δ){E}[X])≤{e}−{E}[X]δ2/3.

and for we have

 P(X≥(1+δ){E}[X])≤{e}−{E}[X]δ/3.

Intuition

In the following Proposition 3, a refined version of Proposition 2, we argue similarly, but more carefully, about one sampling step of Algorithm 1. The difference is that we analyze not only the drop in the cost of unsettled clusters, but also the drop in the number of unsettled clusters. Here is the intuition.

Let us go back to the proof of Proposition 2, where we distinguished heavy and light clusters. Heavy clusters formed at least constant proportion of the cost of all clusters and every heavy cluster was hit with probability that we lower-bounded by . For a light cluster we cannot give that good a bound for the probability of hitting it, but since it is not very probable that one light cluster is hit by more than one point, if we denote , i.e., is the proportional cost of the light clusters, we expect roughly clusters to become settled. On the other hand, we may, optimistically, hope that after one iteration the cost of unsettled clusters drops from to , since the heavy clusters are hit with high probability. This is not exactly the case, since for a heavy cluster we have only a constant probability of hitting it. But we can consider two cases: either there are lot of heavy clusters and we again make clusters settled, or their cost is dominated by few, massive clusters, each of which is not settled only with exponentially small probability and, hence, we expect the cost to drop by factor.

The tradeoff between the behaviour of light and heavy clusters then yields a threshold that balances the drop of the cost and the number of unsettled clusters.

Proposition 3.

Suppose that after steps of Algorithm 1 for there are unsettled clusters and their total cost is . Assume that . After the next sampling step, with probability at least , we have that either the number of unsettled clusters decreased by at least or the total cost of unsettled clusters decreased from to at most .

Proof.

Note that implies , since otherwise, more than half the cost of would be formed by settled clusters and, hence, , a contradiction.

We will say that an unsettled cluster is heavy if its cost is at least and light otherwise. Let , i.e., the proportional cost of the light clusters. We will distinguish three possible cases.

1. ,

2. and there are at least heavy clusters,

3. and there are less than heavy clusters.

For each case we now prove that with probability we either settle at least clusters or the total cost of unsettled clusters drops from to .

1. By Proposition 1, each light cluster gets settled with probability at least , using and for . If we define to be an indicator of whether a light cluster got settled in this iteration and , we have

 {E}[X]=∑A light{E}[XA]≥∑A lightkφtA20φtU =αk/20≥k20√logγ

where we used our assumption on .

Invoking the first bound of Theorem 2, we get

 P(X≤{E}[X]/2)≤{e}−{E}[X]/8≤{e}−Θ(k0.1),

using that .

2. We proceed analogously to the previous case. By Proposition 1 we get that each heavy cluster gets settled with probability at least , using and the definition of heavy cluster.

We define to be an indicator of whether a heavy cluster got settled in this iteration and . We have

 {E}[X]=∑A heavy{E}[XA]≥k/√logγ⋅120=k20√logγ

Invoking the first bound of Theorem 2, we get

 P(X≤{E}[X]/2)≤{e}−{E}[X]/8≤{e}−Θ(k0.1),

using that .

3. Let . We call a heavy cluster massive if its cost is at least . Since we know that there are at most heavy clusters, the total cost of clusters that are heavy but not massive is at most

 k√logγ⋅10√logγφtUk≤φtU3√logγ

Hence, the total contribution of not massive clusters is at most

 φtU3√logγ+αφtU≤φtU(13√logγ+1√logγ)≤2φtU3√logγ.

By Proposition 1 each massive cluster is not settled with probability at most . Define the random variable to be equal to if a massive cluster gets settled in this iteration and otherwise. Let . Note that expected cost of massive clusters that are not settled in this iteration is bounded by .

The value of is stochastically dominated by the value of a variable defined as follows. We first replace each by some variables , each new variable being equal to with probability and zero otherwise, independently on the other variables . Note that the sum stochastically dominates the value of , since it attends the value with probability

 ⌈φtA/ζ⌉∏i=1{e}−kζ/(10φtX)≥({e}−kζ/(10φtX))2φtA/ζ ={e}−kφtA/(5φtX),

and otherwise is nonnegative. Hence, the value stochastically dominates the value .

Since the number of variables is we have

 {E}[X] ≤{E}[X′]≤2φtXζ⋅exp(−kζ/(10φtX)) ≤4φtUζ⋅exp(−k10√logγφtU20kφtU) =4φtU10√logγφtU/k⋅exp(−10√logγ20)≤k√logγ

where the last step assumes large enough.

Finally, the second bound of Theorem 2 gives

 P(X≥2{E}[X′])≤P(X′≥2{E}[X′]) ≤{e}−{E}[X′]/3={e}−Ω(k0.1)

using .

Hence, with probability the total cost of clusters that remain unsettled after this iteration is bounded by

 2φtU3√logγ+2k√logγ⋅ζ≤4φtU3√logγ

Theorem 3.

For , Algorithm 1 achieves a constant approximation ratio for the number of sampling steps steps with probability .

Proof.

To see that the number of steps is bounded by with high probability, note that by Proposition 1 each unsettled cluster is made settled with probability at least . So, unless , we have and, hence, with probability at least

 1−∏A unsettledexp(−kφtA/10φtU)=1−exp(−k/10)

we make at least one cluster settled. Union bounding over first steps of the algorithm, we conclude that the algorithm finishes in at most steps with probability .

Whatever point is taken uniformly at the beginning of Algorithm 1, for the next iteration we invoke Proposition 1 with (in its formulation we say that is unsettled cluster, but it is only for the sake of clarity) to conclude that with probability we have .

We invoke Proposition 3 and union bound over subsequent iterations of the algorithm to conclude that with probability , in each sampling step of Algorithm 1 we either make at least clusters settled or the cost of unsettled clusters decreases by a factor of . The first case can happen at most times , whereas the second case can happen at most times, until we have . Hence, the algorithm achieves constant approximation ratio after steps. ∎

Remark 3.

The high probability guarantee in Theorem 3 can be made . We omit the proof.

5 Sharp analysis of k-means|| : lower bound

In this section we show that the upper bound of steps is best possible. Note that (Bachem et al., 2017) proved that for there is a dataset such that for . Hence, for we conclude that for the number of steps necessary to achieve constant approximation we have , implying . We complement their result by showing that the same lower bound also holds for .

Theorem 4.

For any function with , there is a dataset with and such that and with probability arbitrarily close to Algorithm 1 needs iterations to achieve cost zero.

Note that since scales with the size of , we cannot have , unless the instance is trivial.

Proof.

First we describe the dataset . We place points for to the origin, i.e., . We choose to be of such size that we know that with probability at least , for given constant , the first uniformly chosen center is 0.

For each one of the remaining points we consider a new axis orthogonal to the remaining axes and place the point on this axis. For , , , we set , for some large enough and with the multiplicative constant chosen in such a way that for the variance of the data we have . Note that we need to choose to achieve . We define .

Conditioning on the first uniformly taken point being for some , we prove by induction for that with probability at least , after sampling steps of the algorithm we have for each that out of points , for some , at most

 kT⋅(23)i−t

of them have been sampled as centers. This will prove the theorem, since it implies that with probability at least , after steps the cost is still nonzero; since we have , we have

 Θ(L/logL)=Θ(log{Var}Xk/loglog{Var}Xk) =Θ(log{Var}X/loglog{Var}X).

For the claim we are proving is clearly true. For note that by induction with probability at least we have that at least

 k/T−k/T(23)=k/(3T)

of the points , for some , were not sampled as centers. Hence, and the probability that each point , , is being sampled is bounded by

 k⋅L2(T−i+1)kL2(T−t+1)/(3T)=3TL2(i−t)≤1Li−t

for large enough. Hence, for any , the expected number of points that are being hit is bounded by . To get concentration around this value for given , consider two cases.

1. . Then, by the third bound of Theorem 2 we can bound the probability of taking more than clusters by .

2. . Using the assumption , hence , we have . For sufficiently large we then have for any fixed polynomial, hence, the expected number of hits is at most , hence, with probability at least there is no hit.

In both cases, conditioning on an event of probability , for any the number of points that were sampled as centers is with probability at least bounded by

 kT((23)i−t+1+(13)i−t)≤kT⋅(23)i−t

as needed. ∎

6 Acknowledgement

I would like to thank my advisor Mohsen Ghaffari for his very useful insights and suggestions regarding the paper. I also thank Davin Choo, Christoph Grunau, Andreas Krause, and Julian Portmann for useful discussions.

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 853109)

References

• Aggarwal et al. (2009) Aggarwal, A., Deshpande, A., and Kannan, R. Adaptive sampling for k-means clustering. In

Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques

, pp. 15–28. Springer, 2009.
• Ailon et al. (2009) Ailon, N., Jaiswal, R., and Monteleoni, C. Streaming k-means approximation. In Advances in neural information processing systems, pp. 10–18, 2009.
• Aloise et al. (2009) Aloise, D., Deshpande, A., Hansen, P., and Popat, P. Np-hardness of euclidean sum-of-squares clustering. Machine Learning, 75(2):245–248, May 2009. ISSN 1573-0565.
• Arthur & Vassilvitskii (2007a) Arthur, D. and Vassilvitskii, S. K-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’07, pp. 1027–1035, Philadelphia, PA, USA, 2007a. Society for Industrial and Applied Mathematics. ISBN 978-0-898716-24-5.
• Arthur & Vassilvitskii (2007b) Arthur, D. and Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 1027–1035. Society for Industrial and Applied Mathematics, 2007b.
• Awasthi et al. (2015) Awasthi, P., Charikar, M., Krishnaswamy, R., and Sinop, A. K. The hardness of approximation of euclidean k-means. arXiv preprint arXiv:1502.03316, 2015.
• Bachem et al. (2016a) Bachem, O., Lucic, M., Hassani, H., and Krause, A. Fast and provably good seedings for k-means. In Advances in neural information processing systems, pp. 55–63, 2016a.
• Bachem et al. (2016b) Bachem, O., Lucic, M., Hassani, S. H., and Krause, A. Approximate k-means++ in sublinear time. In

Thirtieth AAAI Conference on Artificial Intelligence

, 2016b.
• Bachem et al. (2017) Bachem, O., Lucic, M., and Krause, A. Distributed and provably good seedings for k-means in constant rounds. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 292–300. JMLR. org, 2017.
• Bachem et al. (2018) Bachem, O., Lucic, M., and Krause, A. Scalable k-means clustering via lightweight coresets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1119–1127, 2018.
• Bahmani et al. (2012) Bahmani, B., Moseley, B., Vattani, A., Kumar, R., and Vassilvitskii, S. Scalable k-means++. Proceedings of the VLDB Endowment, 5(7):622–633, 2012.
• Barbosa et al. (2015a) Barbosa, R., Ene, A., Nguyen, H., and Ward, J. The power of randomization: Distributed submodular maximization on massive datasets. In International Conference on Machine Learning, pp. 1236–1244, 2015a.
• Barbosa et al. (2015b) Barbosa, R., Ene, A., Nguyen, H. L., and Ward, J. A new framework for distributed submodular maximization, 2015b.
• Bhattacharya et al. (2019) Bhattacharya, A., Eube, J., Röglin, H., and Schmidt, M. Noisy, greedy and not so greedy k-means++, 2019.
• Brunsch & Röglin (2013) Brunsch, T. and Röglin, H. A bad instance for k-means++. Theoretical Computer Science, 505:19–26, 2013.
• Celebi et al. (2013) Celebi, M. E., Kingravi, H. A., and Vela, P. A. A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert systems with applications, 40(1):200–210, 2013.
• (17) Choo, D. personal communication.
• Choo et al. (2020) Choo, D., Grunau, C., Portmann, J., and Rozhoň, V. k-means++: few more steps yield constant approximation, 2020.
• Jaiswal et al. (2012) Jaiswal, R., Kumar, A., and Sen, S. A simple -sampling based ptas for k-means and other clustering problems, 2012.
• Jaiswal et al. (2015) Jaiswal, R., Kumar, M., and Yadav, P. Improved analysis of -sampling based ptas for k-means and other clustering problems. Information Processing Letters, 115(2):100–103, 2015.
• (21) Krause, A. and Golovin, D. Submodular function maximization.
• Lattanzi & Sohler (2019) Lattanzi, S. and Sohler, C. A better k-means++ algorithm via local search. In International Conference on Machine Learning, pp. 3662–3671, 2019.
• Lee et al. (2017) Lee, E., Schmidt, M., and Wright, J. Improved and simplified inapproximability for k-means. Information Processing Letters, 120:40–43, 2017.
• Lenzen & Wattenhofer (2011) Lenzen, C. and Wattenhofer, R. Tight bounds for parallel randomized load balancing, 2011.
• Liu & Vondrak (2018) Liu, P. and Vondrak, J. Submodular optimization in the mapreduce model. arXiv preprint arXiv:1810.01489, 2018.
• Lloyd (1982) Lloyd, S. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2):129–137, March 1982. ISSN 1557-9654.
• Mahajan et al. (2009) Mahajan, M., Nimbhorkar, P., and Varadarajan, K. The planar k-means problem is np-hard. In International Workshop on Algorithms and Computation, pp. 274–285. Springer, 2009.
• Mettu & Plaxton (2012) Mettu, R. and Plaxton, G. Optimal time bounds for approximate clustering, 2012.
• Mirzasoleiman et al. (2013) Mirzasoleiman, B., Karbasi, A., Sarkar, R., and Krause, A. Distributed submodular maximization: Identifying representative elements in massive data. In Advances in Neural Information Processing Systems, pp. 2049–2057, 2013.
• Ostrovsky et al. (2013) Ostrovsky, R., Rabani, Y., Schulman, L. J., and Swamy, C. The effectiveness of lloyd-type methods for the k-means problem. Journal of the ACM (JACM), 59(6):1–22, 2013.
• Wei (2016) Wei, D. A constant-factor bi-criteria approximation guarantee for k-means++. In Advances in Neural Information Processing Systems, pp. 604–612, 2016.