    # A Refined Approximation for Euclidean k-Means

In the Euclidean k-Means problem we are given a collection of n points D in an Euclidean space and a positive integer k. Our goal is to identify a collection of k points in the same space (centers) so as to minimize the sum of the squared Euclidean distances between each point in D and the closest center. This problem is known to be APX-hard and the current best approximation ratio is a primal-dual 6.357 approximation based on a standard LP for the problem [Ahmadian et al. FOCS'17, SICOMP'20]. In this note we show how a minor modification of Ahmadian et al.'s analysis leads to a slightly improved 6.12903 approximation. As a related result, we also show that the mentioned LP has integrality gap at least 16+√(5)/15>1.2157.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Clustering is a central problem in Computer Science, with many applications in data science, machine learning etc. One of the most famous and best-studied problems in this area is Euclidean

-Means: given a set of points (or demands) in and an integer , select points (centers) so as to minimize . Here is the Euclidean distance between points and and for a set of points , . In other words, we wish to select centers so as to minimize the sum of the squared Euclidean distances between each demand and the closest center. Equivalently, a feasible solution is given by a partition of the demands into subsets (clusters). The cost of a cluster is , where is the center of mass of . We recall that can also be expressed as . Our goal is to minimize the total cost of these clusters.

Euclidean -Means is well-studied in terms of approximation algorithms. It is known to be APX-hard. More precisely, it is hard to approximate -Means below a factor in polynomial time unless [6, 16]. The hardness was improved to under the Unique Games Conjecture 

. Some heuristics are known to perform very well in practice, however their approximation factor is

or worse on general instances [3, 4, 17, 21]. Constant approximation algorithms are known. A local-search algorithm by Kanugo et al.  provides a approximation666Throughout this paper by we mean an arbitrarily small positive constant. W.l.o.g. we assume .. The authors also show that natural local-search based algorithms cannot perform better than this. This ratio was improved to by Ahmadian et al. [1, 2] using a primal-dual approach. They also prove a approximation for general (possibly non-Euclidean) metrics. Better approximation factors are known under reasonable restrictions on the input [5, 7, 10, 20]. A PTAS is known for constant  or for constant dimension [10, 12]. Notice that can be always assumed to be by a standard application of the Johnson-Lindenstrauss transform . This was recently improved to  and finally to .

In this paper we describe a simple modification of the analysis of Ahmadian et al.  which leads to a slightly improved approximation for Euclidean -Means (see Section 2).

###### Theorem 1.

There exists a deterministic polynomial-time algorithm for Euclidean -Means with approximation ratio for any positive constant , where

 ρ:=(1+√12(2+3√3−2√2+3√3+2√2))2<6.12903.

The above approximation ratio is w.r.t. the optimal fractional solution to a standard LP relaxation for the problem (defined later). As a side result (see Section 3), we prove a lower bound on the integrality gap of this relaxation (we are not aware of any explicit such lower bound in the literature).

###### Theorem 2.

The integrality gap of , even in the Euclidean plane (i.e., for ), is at least .

### 1.1 Preliminaries

As mentioned earlier, one can formulate Euclidean -Means in term of the selection of centers. In this case, it is convenient to discretize the possible choices for the centers, hence obtaining a polynomial-size set of candidate centers, at the cost of an extra factor in the approximation ratio (we will neglect this factor in the approximation ratios since it is absorbed by analogous factors in the rest of the analysis). In particular we will use the construction in  (Lemma 24) that chooses as the centers of mass of any collection of up to points with repetitions. In particular in this case.

Let be an abbreviation for . Then a standard LP-relaxation for -Means is as follows:

 min ∑i∈F,j∈Dxij⋅c(j,i) LPk-Means s.t. ∑i∈Fxij≥1 ∀j∈D xij≤yi ∀j∈D,∀i∈F ∑i∈Fyi≤k ∀j∈D,∀i∈F xij,yi≥0 ∀j∈D,∀i∈F

In an integral solution, we interpret as being a selected center in ( is open), and as demand being assigned to center 777Technically each demand is automatically assigned to the closest open center. However it is convenient to allow also sub-optimal assignments in the LP relaxation.. The first family of constraints states that each demand has to be assigned to some center, the second one that a demand can only be assigned to an open center, and the third one that we can open at most centers.

For any parameter (Lagrangian multiplier), the Lagrangian relaxation of (w.r.t. the last matrix constraint) and its dual are as follows:

 min ∑i∈F,j∈Dxij⋅c(j,i)+λ⋅∑i∈Fyi−λk LP(λ) s.t. ∑i∈Fxij≥1 ∀j∈D xij≤yi ∀j∈D,∀i∈F xij,yi≥0 ∀j∈D,∀i∈F
 max ∑j∈Dαj−λk DP(λ) s.t. ∑j∈Dmax{0,αj−c(j,i)}≤λ ∀i∈F (1) αj≥0 ∀j∈D

Above replaces the dual variable corresponding to the second constraint in the primal in the standard formulation of the dual LP. Notice that, by removing the fixed term in the objective functions of and , one obtains the standard LP relaxation for the Facility Location problem (FL) with uniform facility cost and its dual .

We say that a -approximation algorithm for a FL instance of the above type is Lagrangian Multiplier Preserving (LMP) if it returns a set of facilities that satisfies:

 ∑j∈Dc(j,S)≤ρ(OPT(λ)−λ|S|),

where is the value of the optimal solution to .

## 2 A Refined Approximation for Euclidean k-Means

In this section we present our refined approximation for Euclidean -Means. We start by presenting the LMP approximation algorithm for the FL instances arising from -Means described in  in Section 2.1. We then present the analysis of that algorithm as in  in Section 2.2. In Section 2.3 we describe our refined analysis of the same algorithm. Finally, in Section 2.4 we sketch how to use this to approximate -Means.

### 2.1 A Primal-Dual LMP Algorithm for Euclidean Facility Location

We consider an instance of Euclidean FL induced by a -Means instance in the mentioned way, for a given Lagrangian multiplier .

We consider exactly the same Lagrangian Multiplier Preserving (LMP) primal-dual algorithm as in . In more detail, let be a parameter to be fixed later. The algorithm consists of a dual-growth phase and a pruning phase. The dual-growth phase is exactly as in the classical primal-dual algorithm JV by Jain and Vazirani . We start with all the dual variables set to and an empty set of tentatively open facilities. The clients such that for some are frozen, and the other clients are active. We grow the dual variables of active clients at uniform rate until one of the following two events happens. The first event is that some constraint of type (1) becomes tight. At that point the corresponding facility is added to and all clients with are set to frozen. The second event is that for some some . In that case is set to frozen. In any case, the facility that causes to become frozen is called the witness of . The phase halts when all clients are frozen.

In the pruning phase we will close some facilities in , hence obtaining the final set of open facilities . Here deviates from . For each client , let be the set of facilities such that contributed with a positive amount to the opening of . Symmetrically, for , let be the clients that contributed with a positive amount to the opening of . For , we let , where the values are considered at the end of the dual-growth phase. We set conventionally for . Intuitively, is the “time” when facility is tentatively open (at which point all the dual variables of contributing clients stop growing). We define a conflict graph over tentatively open facilities as follows. The node set of is . We place an edge between iff the following two conditions hold: (1) for some client , (in words, contributes to the opening of both and ) and (2) one has . In this graph we compute a maximal independent set , which provides the desired solution to the facility location problem (where each client is assigned to the closest facility in ).

We remark that the pruning phase of differs from the one of only in the definition of , where condition (2) is not required to hold (or, equivalently, behaves like for ).

### 2.2 The Analysis in 

The general goal is to show that

 ∑j∈Dc(j,IS)≤ρ(∑j∈Dαj−λ|IS|),

for some as small as possible. This shows that the algorithm is an LMP -approximation for the problem. It is sufficient to prove that, for each client , one has

 c(j,IS)ρ≤αj−∑i∈N(j)∩IS(αj−c(j,i))=αj−∑i∈ISmax{0,αj−c(j,i)}.

Let and . We distinguish cases depending on the value of :

#### Case A: s=1.

Let . Then for any ,

 c(j,IS)ρ≤c(j,IS)=c(j,i∗)=αj−(αj−c(j,i∗)).

#### Case B: s>1.

Here we use the properties of Euclidean metrics. The sum is the sum of the squared distances from to the facilities in . This quantity is lower bounded by the sum of the squared distances from to the centroid of . Recall that . We also observe that, by construction, for any two distinct one has

 c(i,i′)>δ⋅min{ti,ti′}≥δ⋅αj,

where the last inequality follows from the fact that is contributing to the opening of both and . Altogether one obtains

 ∑i∈Sc(j,i)≥∑i∈Sc(μ,i)=12s∑i∈S∑i′∈Sc(i,i′)≥(s−1)δαj2.

Thus

 ∑i∈S(αj−c(j,i))≤(s−δ(s−1)2)αj=(s(1−δ2)+δ2)αjδ≥2,s≥2≤(2−δ2)αj.

Using the fact that for all , hence , one gets

 (δ2−1)c(j,IS)δ≥2≤(δ2−1)αj.

We conclude that

 ∑i∈S(αj−c(j,i))+(δ2−1)c(j,IS)≤(2−δ2)αj+(δ2−1)αj=αj.

This gives the desired inequality assuming that .

#### Case C: s=0.

Consider the witness of . Notice that and . Hence

 d(j,i)+√δti≤(1+√δ)√αj.

If , then . Otherwise there exists such that . Thus In both cases one has hence

 c(j,IS)≤(1+√δ)2αj.

This gives the desired inequality for .

#### Fixing δ.

Altogether we can set . The best choice for (namely, the one that minimizes ) is the solution of . This is achieved for and gives .

### 2.3 A Refined Analysis

We refine the analysis in Case B as follows. Let . We already proved that, for , . Hence it is sufficient to upper bound

 c(j,S)αj−∑i∈S(αj−c(j,i))=c(j,S)∑i∈Sc(j,i)−(s−1)αj=c(j,S)Δ−(s−1)αj.

Instead of using the upper bound we use the average

 c(j,S)≤1s∑i∈Sc(j,i)=Δs.

Then it is sufficient to upper bound

 1sΔΔ−(s−1)αj.

The derivative in of the above function is . Hence the maximum is achieved for the smallest possible value of . Recall that we already showed that . Hence a valid upper bound is

 1s(s−1)δαj2(s−1)δαj2−(s−1)αj=1sδ/2δ/2−1s≥2≤δ/4δ/2−1.

This imposes rather than in Case B. Notice that this is an improvement for . The best choice of is now obtained by imposing . This gives and .

### 2.4 From Facility Location to k-Means

We can use the refined approximation for Euclidean Facility Location from previous section to derive a approximation for Euclidean k-Means, for any constant . Here we follow the approach of  with only minor changes. In more detail, the authors consider a variant of the FL algorithm described before, whose approximation factor is rather than . A careful use of this algorithm leads to a solution opening precisely facilities, which leads to the desired approximation factor. In their analysis the authors use slight modifications of the inequality (coming from Case C, which is the same in their and our analysis). The goal is to prove that the modified algorithm is approximate. Here and are used as parameters. Therefore it is sufficient to replace their values of these parameters with the ones coming from our refined analysis. The rest is identical.

## 3 Lower Bound on the Integrality Gap

In this section we describe our lower bound instance for the integrality gap of . It is convenient to consider first the following slightly different relaxation, based on clusters (with as defined in Section 1):

 min ∑C∈CwCxC LP′k-Means s.t. ∑C∈C:j∈CxC≥1 ∀j∈D ∑C∈CxC≤k xC≥0 ∀C∈C

Here denotes the set of possible clusters, i.e. the possible subsets of points. In an integral solution means that cluster is part of our solution.

Our instance is on the Euclidean plane, and its points are the (10) vertices of two regular pentagons of side length . These pentagons are placed so that any two vertices of distinct pentagons are at large enough distance to be fixed later. Here . We remark that our argument can be easily extended to an arbitrary number of points by taking such pentagons for any integer so that the pairwise distance between vertices of distinct pentagons is at least , and setting .

A feasible fractional solution is obtained by setting for every consisting of a pair of consecutive vertices in the same pentagon (so we are considering fractional clusters in total). Obviously this solution is feasible. The cost of each such cluster is . Hence the cost of this fractional solution is .

Next consider the optimal integral solution, consisting of clusters. Recall that the radius of each pentagon (i.e. the distance from a vertex to its center) is and the distance between two non-consecutive vertices in the same pentagon is . A solution with two clusters consisting of the vertices of each pentagon costs . Any cluster involving vertices of distinct pentagons costs at least , hence for large enough the optimal solution forms clusters only with vertices of the same pentagon. In more detail the optimal solution consists of clusters containing the vertices of one pentagon and clusters containing the vertices of the remaining pentagon. Let be the minimum cost associated with one pentagon assuming that we form clusters with its vertices. Clearly . Regarding , it is obviously convenient to choose two consecutive vertices in the unique cluster of size . Thus . For , we note, as is easy to verify, that clusters with consecutive vertices are less expensive than the alternatives. For , one might form one cluster of size and one of size . This would cost . Alternatively, one might form one cluster of size and one of size , at smaller cost . Thus . For , one might form two clusters of size and one of size , or two clusters of size and one of size . The associated cost in the two cases is and , resp. Hence . So the overall cost of the optimal integral solution is . Thus the integrality gap of is at least .

Consider next . Here a technical complication comes from the definition of which is not part of the input instance of -Means. The same construction as above works if we let contain the centers of mass of any set of or points. Notice that this is automatically guaranteed by the construction in  for . In this case the optimal integral solutions to and are the same in the considered example. Furthermore one obtains a feasible fractional solution to of cost by setting for the centers of mass of any two consecutive vertices of the same pentagon, and setting for each point and the two closest centers with positive . This concludes the proof of Theorem 2.

## Acknowledgments

Work supported in part by the NSF grant 1909972 and the SNF Excellence Grant 200020B_182865/1.

## References

•  S. Ahmadian, A. Norouzi-Fard, O. Svensson, and J. Ward (2017) Better guarantees for k-means and Euclidean k-median by primal-dual algorithms. In 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 15-17, 2017, C. Umans (Ed.), pp. 61–72. External Links: Cited by: §1.
•  S. Ahmadian, A. Norouzi-Fard, O. Svensson, and J. Ward (2020) Better guarantees for k-means and Euclidean k-median by primal-dual algorithms. SIAM J. Comput. 49 (4). External Links: Cited by: §1, §1, §2.1, §2.2, §2.4, §2.
•  D. Arthur and S. Vassilvitskii (2006) How slow is the k-means method?. In Proceedings of the 22nd ACM Symposium on Computational Geometry, Sedona, Arizona, USA, June 5-7, 2006, N. Amenta and O. Cheong (Eds.), pp. 144–153. External Links: Cited by: §1.
•  D. Arthur and S. Vassilvitskii (2007) K-means++: the advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, Louisiana, USA, January 7-9, 2007, N. Bansal, K. Pruhs, and C. Stein (Eds.), pp. 1027–1035. External Links: Link Cited by: §1.
•  P. Awasthi, A. Blum, and O. Sheffet (2010) Stability yields a PTAS for k-median and k-means clustering. In 51th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2010, October 23-26, 2010, Las Vegas, Nevada, USA, pp. 309–318. External Links: Cited by: §1.
•  P. Awasthi, M. Charikar, R. Krishnaswamy, and A. K. Sinop (2015) The hardness of approximation of Euclidean k-means. In 31st International Symposium on Computational Geometry, SoCG 2015, June 22-25, 2015, Eindhoven, The Netherlands, L. Arge and J. Pach (Eds.), LIPIcs, Vol. 34, pp. 754–767. External Links: Cited by: §1.
•  M. Balcan, A. Blum, and A. Gupta (2009) Approximate clustering without the approximation. In Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2009, New York, NY, USA, January 4-6, 2009, C. Mathieu (Ed.), pp. 1068–1077. External Links: Link Cited by: §1.
•  L. Becchetti, M. Bury, V. Cohen-Addad, F. Grandoni, and C. Schwiegelshohn (2019) Oblivious dimension reduction for k-means: beyond subspaces and the Johnson-Lindenstrauss lemma. In

Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019

, M. Charikar and E. Cohen (Eds.),
pp. 1039–1050. External Links: Cited by: §1.
•  V. Cohen-Addad and Karthik C. S. (2019) Inapproximability of clustering in metrics. In 60th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2019, Baltimore, Maryland, USA, November 9-12, 2019, D. Zuckerman (Ed.), pp. 519–539. External Links: Cited by: §1.
•  V. Cohen-Addad, P. N. Klein, and C. Mathieu (2019) Local search yields approximation schemes for k-means and k-median in Euclidean and minor-free metrics. SIAM J. Comput. 48 (2), pp. 644–667. External Links: Cited by: §1.
•  W. F. de la Vega, M. Karpinski, C. Kenyon, and Y. Rabani (2003) Approximation schemes for clustering problems. In Proceedings of the 35th Annual ACM Symposium on Theory of Computing, June 9-11, 2003, San Diego, CA, USA, L. L. Larmore and M. X. Goemans (Eds.), pp. 50–58. External Links: Cited by: §1.1, §3.
•  Z. Friggstad, M. Rezapour, and M. R. Salavatipour (2019) Local search yields a PTAS for k-means in doubling metrics. SIAM J. Comput. 48 (2), pp. 452–480. External Links: Cited by: §1.
•  K. Jain and V. V. Vazirani (2001) Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and Lagrangian relaxation. J. ACM 48 (2), pp. 274–296. Cited by: §2.1.
•  W. B. Johnson and J. Lindenstrauss (1984) Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics 26, pp. 189–206. Cited by: §1.
•  T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu (2004) A local search approximation algorithm for k-means clustering. Comput. Geom. 28 (2-3), pp. 89–112. External Links: Cited by: §1.
•  E. Lee, M. Schmidt, and J. Wright (2017) Improved and simplified inapproximability for k-means. Inf. Process. Lett. 120, pp. 40–43. External Links: Cited by: §1.
•  S. P. Lloyd (1982) Least squares quantization in PCM. IEEE Trans. Inf. Theory 28 (2), pp. 129–136. External Links: Cited by: §1.
•  K. Makarychev, Y. Makarychev, and I. P. Razenshteyn (2019) Performance of Johnson-Lindenstrauss transform for k-means and k-medians clustering. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, M. Charikar and E. Cohen (Eds.), pp. 1027–1038. External Links: Cited by: §1.
•  J. Matousek (2000) On approximate geometric k-clustering. Discrete Comput. Geom. 24 (1), pp. 61–84. External Links: Cited by: §1.
•  R. Ostrovsky, Y. Rabani, L. J. Schulman, and C. Swamy (2012) The effectiveness of Lloyd-type methods for the k-means problem. J. ACM 59 (6), pp. 28:1–28:22. External Links: Cited by: §1.
•  A. Vattani (2011) k-means requires exponentially many iterations even in the plane. Discrete Comput. Geom. 45 (4), pp. 596–616. External Links: Cited by: §1.