-means problem is one of the oldest and most fundamental questions in computational geometry and combinatorial optimization. Given-point data set in and an integer , the goal is to partition into
disjoint subsets so as to minimize the total within-cluster variance or equivalently speaking, choosecenters so as to minimize the total squared distances between each point and its closest center. This problem is NP-hard in general dimension even when , proved by reduction from the densest cut problem adhp2009 . And the latest work shows it is APX-hard acks2015 and the inaproximability is at least 1.0013 lsw2016 . However, Lloyd l1982
proposed a local search heuristic to this problem that performs very well in practical and still widely used today. Berkhinb2006 states in public that ”it is by far the most popular clustering algorithm used in scientific and industrial applications”.
Capacity constraint is literally a straightforward variant as well as a nature requirement for almost all combinatorial models, such as the capacitated supply chain, capacitated lot-sizing, capacitated facility location, capacitated vehicle routing, etc. However, this constraint always raises the difficulties of the problem dramatically. For clustering problems like facility location and
-median, the capacitated version breaks the nature structure of optimal solution that every data point is assigned to its nearest open center. Moreover, the wide-used integer-linear programs for these problem have unbounded integral gap when the capacity constraints are added. The state-of-the-art approximation algorithms also suffer from this limitation. Some researchers have to switch techniques instead of linear program based ones. Note the linear program based techniques perform very well in the uncapacitated facility location and-median.
As stressed, the -means problem is one of the classical and fundamental questions in literature. Most of the works are from practical application as well as data mining perspective and thus more time-concerned. So heuristics are very popular among these areas, for example Lloyd’s algorithm, -means++ av2007 and -means—— bmvkv2012 . Seldom works are from theoretical computer science perspective which are more performance-concerned. Although -means++ is proved to be -approximate, we believe it is not the limit. The first constant approximation for -means in general dimension is a -approximation based on local search, which is proved almost tight by showing any local search with a fixed number of swap operations achieving at least -approximation kmn2002 . Recent results show local search yields PTAS for -means with fixed and data dimension in Euclidean space frs2016 ; ckm2016 .
However, due to its hardness, seldom approximation algorithm is proposed for capacitated -means. Fortunately, this problem has strong connection with other capacitated clusterings like capacitated facility location and capacitated -median. For capacitated facility location, the state-of-the-art approximation algorithm achieves a -approximation ratio based on local search zcy2005 . While for uncapacitated facility location the corresponding result is l2013 which is very close to the inapproximability lower bound of . For the capacitated -median, most research in literature is pseudo-approximation that violates either the cardinality constraint or capacity constraints. Based on an exploring of extended relaxations for the capacitated -median, Li S. l2017 presents an -approximation that violates the cardinality constraint by a factor of . Byrka et al. br2016 present a constant approximation with violation of capacity constraint. They are both the state-of-the-art bi-factor approximation results for uniform capacitated -median. Recently a combination of the approximation algorithms and FPT algorithms is developed among those problems for which we fail to give polynomial-time constant approximations. Adamczyk et al. abm2018 then propose a constant FPT() approximation for both uniform and nonuniform capacitated -median which inspired our work.
Uniform capacity constraint is literally easier than nonuniform one. For -means, there even no appropriate mathematical model for nonuniform capacity constraint. Since it makes no cense to define nonuniform capacities on infinite many candidates. In this paper, we present the first constant FPT algorithm without breaking the cardinality constraint or capacity constraints for uniform capacitated -means. The rest of the paper is organized in a nature way that first proposes the algorithm and then analyzes its running time and performance ratio. Actually the proposed algorithm is presented briefly in Algorithm section and more detail as well as high-level idea is illustrated in Analysis section.
The -means (KM) problem can be formally described as: Given data set , where are
-dimensional real vectors. The object is to partition the data set intodisjoint subsets so as to minimize the total within-cluster sum of squared distances or variance. One can easy prove that for a fixed finite cluster , the minimizer must be the centroid of , which we denote by
This is how and why Lloyd Algorithm works. Thus, the goal of -means can be formally stated as to find a partition of so as to minimize the following object:
The capacity constraint is quite nature and simple, that is, for . For the sake of ease, we assume a positive integer.
Here is an easy observation that for the uncapacitated -means problem, if we are given center set of cardinality , the rest of the problem would be easy. Since the object is to minimize the total squared distances from each data point to the center set, one can assign each data point to its nearest center in the given center set, which is so-called Voronoi partition of the data set. However, it is not the case for capacitated -means. By the following lemma, we show it is still tractable.
When given center set, the -means problem with capacity constraints are polynomial-time tractable.
Suppose we have -point data set and an inter as inputs for the capacitated -means problem. And we are given -point center set . Then the optimal (w.r.t. -means object) partition of according to with cluster upper bound constraints can be computed in polynomial time w.r.t. and . In fact, the above partition problem can be described as the following integer-linear program.
The decision variable representing the data point is assigned to center and 0 otherwise. In HCKM, once we have the location information of any two points , the distance between them can be computed immediately by . Thus the objective is essentially linear with respect to . The constraints of the above program are quite straightforward. It seems to be intractable to solve the program exactly but one can find that the constraints as well as the object are linear except the Boolean constraints. From mixed integer-linear program theory we know, the optimal solution to the nature relaxation of the above program which differs only with the constraints relaxing to must be integral. This property holds whenever the coefficient matrix of the above program is totally unimodular and is an integer, where in our program both are satisfied. Detail see sa2003 as a reference.
That is to say, we only need to solve the the optimal solution to the linear relaxation (essentially a transportation problem) in order to solve the mixed integer-linear program (1-4), which implies the claim.
However, in our algorithm we need a more general result that allow us to generalize the distance to a more complicated metric. Fortunately, as long as the distance between any two points is well defined, we can regard the as a constant coefficient and thus the above result is true still.
Later on, when we talk about the the HCKM, we are meaning to discuss the HCKM with given instance and . And let illustrate the optimal partition for the HCKM instance when the center set is known. For uncapacitated KM, we use instead which means the Voronoi partition where each data point is assigned to its nearest center in . Moreover, let and be the mapping from to in the partition and respectively. Thus and if and only if is assigned to in and similarly. Therefore, once we have the center set for HCKM, we can compute the objective value by the following, denoting by
2.2 FPT Algorithm
First, let us think about the optimal centroid set of KM and HCKM. Given , there are many candidates of the centroid set for KM and this amount do not reduce too much on the order of magnitude for HCKM. From practical and experience we know the order of is literally much smaller than that of . Thus it is interesting if we are able to reduce the number of candidates to . This idea reminds us of the algorithm for KM that partitions the data set into clusters which will be called as a subroutine in our algorithm. Since we do not require the cardinality of but care about the constant approximation here, we employ Hsu’s Algorithm ht2016 as an example that opens at most centers achieving approximation. Note adk2009 that opens at most centers achieving approximation can also be good. To deal with the cardinality constraint, we make a guess in the algorithm for the distribution of the centers among the clusters. And the rest of the problem is how to choose centers in a fixed cluster when we are given the number of the center in that cluster. The answer can be easy — greediness. In next section we will show how to bound the cost with constant times the optimum while the centers are chosen greedily.
We evaluate the proposed algorithm from both running-time and performance ratio perspective. For ease of the notation, we use to denote the squared Euclidean distance between any two locations , i.e. . And the squared Euclidean distance between location and set is given by .
3.1 Running-time Analysis
By the following lemma we show the running-time of our algorithm is only exponential w.r.t. .
Algorithm 1 terminates in time .
We begin with the subroutine that employs the Hsu’s Algorithm for KM which terminates in time . It is polynomial w.r.t. inputs and for any fixed constant . Thus the running-time of Algorithm 1 is dominated by the loop from step 4 to 10.
In step 3 we divide the whole space into a Voronoi partition by output by the KM subroutine. What we do next in the ”for” loop is to numerate all possible configurations of the number distribution for the centroid set over all regions. Ignoring the effect of , it scans at most probabilities. And the inner loop in step 5 takes numerations. In each numeration we pick a known number of copies of centers in the correlative region which exactly is what step 6 all about. After the inner loop, we actually do a more time-consuming work of solving a large number (probabilities of ) of linear programs when computing . The linear program is the relaxation of integer-linear program (1-4) in which there are variables as well as constraints. Given , we bound the solving time by . Therefore, this dominating step takes in total, implying the theorem.
What we know from this subsection is when is a fixed constant, our algorithm runs in polynomial time. Thus it is an FPT approximation algorithm in general.
3.2 Performance Analysis
First, let us take a glance at the Voronoi partition at the end of step 3 (see Fig. 1). What we get from the KM subroutine are centers as well as regions. Among them there are centers supposed to be open in later steps. Some dense regions may open more than one centers while some may open none. But in average there must be some regions where no center will be open. Thus the points in that region must be connected to the opened center in other regions and form a new cluster. We will focus on these regions and try to bound the connect cost. Next we introduce the following inequality that will be used frequently in our analysis. We all know that the triangle inequality holds in Euclidean space. We prove a similar property holds in squared Euclidean space which we call the extended triangle inequality.
Lemma 3 (Extended triangle inequality)
Let metric be the -dimensional Euclidean space equipped with distance for any two point where denotes the standard Euclidean distance or the 2-norm distance. Then for any , we have
Based on basic algebra facts, we have
where the first inequality holds because is actually the Euclidean distance between and and thus satisfying the triangle inequality. Implies the lemma.
Moreover, the extended triangle inequality can be extended to include more complicated cases.
If are inserted in two points and in metric , it must be the case that
where the first inequality comes from the triangle inequality and second from basic algebra facts.
For the ease of the rest part of analysis, we introduce some notations here. Remembering the definition of metric , we will build another metric .
For any two points , we define the distance in metric be
where and are the corresponding nearest centers in (same notation with Algorithm 1) w.r.t. the distance in metric .
Next we will use notation representing the objective value in metric and in metric for HCKM. And let together with be the corresponding optimal solution for HCKM in metric and metric . Recall is the output solution of the (uncapacitated) KM subroutine in metric and is its cost in metric . Note there is no difference between the objective value of KM and that of HCKM. The optimal solution for KM in metric is denoted by . Again, all these notations are based on the same fixed instance. Now we analysis the relation between the optimum value for HCKM in metric and that in metric . By the following, we show can be bounded in terms of and from both sides.
Suppose is assigned to in . Let () be the nearest center from () to which is obtained through the KM subroutine, i.e., , . (See Fig. 2) The lower bound is straightforward because from Corollary 4 we have,
Because of the assumption, summation over all obtaining
For the upper bound, an observation of Voronoi partition gives and thus
where the second inequality follows from the extended triangle inequality. On the other hand, we can bound using Corollary 4 and thus,
So in total,
Again, summation over all and remember , ,
completing the whole proof.
In fact, for any feasible solution to HCKM, we always have . By the above lemma we only prove the special case where is an optimal one. Blending this observation with Lemma 1, we state the following.
Algorithm 1 outputs a feasible solution to HCKM with cost satisfying
The feasibility is straightforward. For the cardinality constraint, from step 4 we have . Thus all w.r.t. vector have cardinality
is one of thus satisfying the cardinality constraint. The capacity constraints hold naturally because is obtained by solving program (1-4).
We already claim that which is derived directly from Corollary 4. To complete the rest of the proof, we only need to prove . In fact, is an exact optimal solution for HCKM in metric . That is, is a such a partition of ( is the correlative mapping) satisfying both cardinality and capacity constraints and at the same time minimize the following,
First, we prove that the optimal solution for HCKM in metric must be a subset of (probably a multi-set). By contradiction, suppose there exist an optimal solution with a center not locating in . And w.l.o.g. assume the optimal assignment mapping w.r.t. is . Considering the cluster , we can reduce the cost of this cluster by moving to . Because from the definition of metric we have,
Known the optimal solution must be a subset of , we enumerate all possible subsets for which we compute the optimal assignment by program (1-4) with . Thus is one of the optimal solutions in metric and naturally . Combining with , implies the lemma.
For the sake of portability and completeness, we propose a more general framework showing that once we have a -approximation KM subroutine as well as a -approximation subroutine for HCKM in metric , we can combine them to obtain a -approximation algorithm for HCKM in metric .
Suppose we have a -approximation KM subroutine for the uncapacitated -means problem that outputs a solution with opened centers, as well as a -approximation subroutine for HCKM in metric that outputs a solution with opened centers. Then, we can construct an -approximation algorithm for any general HCKM instances.
Suppose we have two subroutines that output for KM and for HCKM satisfying
Now we need to modify Lemma 6 and 7 according to the above conditions. First, observe that any feasible solution to KM is also feasible to HCKM and thus . Combing this with -approximation condition we know, Substitute into Lemma 6 we obtain,
Considering Lemma 7, we replace the inequality with in the proof and get
completing the proof.
Given any small constant , HCKM can be solved in time with approximation ratio .
4 Conclusion and future work
Note we can balance the performance ratio and running-time by embedding different KM subroutines. In this paper, we employ the Hsu’s KM subroutine in order to obtain a smaller performance ratio at the cost of time consumption. One can find a better balance under this frame of work. Besides, our analysis is a rough one that only concerns about the constant performance ratio. One can easily reduce the ratio as well as the time analysis since we ignore many low order terms. It is a theoretical work that satisfies both cardinality and capacity constraints and has a constant approximation ratio. A faster algorithm other than an FPT one is an interesting direction.
- (1) Aloise D, Deshpande A, Hansen P, Popat P. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75:245-249, 2009.
- (2) Awasthi P, Charikar M, Krishnaswamy R, Sinop A K. The hardness of approximation of Euclidean -means. In Proceedings of the 31st International Symposium on Computational Geometry, pages 754-767, 2015.
- (3) Lee E, Schmidt M, Wright J. Improved and simplified inapproximability for -means. Information Processing Letters, 120:40-43, 2016.
- (4) Lloyd S. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129-137, 1982.
- (5) Berkhin P. A survey of clustering data mining techniques. In: Kogan J., Nicholas C., Teboulle M. (eds) Grouping Multidimensional Data. Springer, Berlin, Heidelberg, 2006.
- (6) Arthur D, Vassilvitskii S. -means++: the advantages of careful seeding. In Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1027-1035, 2007.
- (7) Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S. Scalable -means++. In Proceedings of the 38th International Conference on Very Large Data Bases, pages 622-633, 2012.
- (8) Kanungo T, Mount D M, Netanyahu N S, Piatko C D, Silverman R, Wu A Y. A local search approximation algorithm for -means clustering. In Proceedings of the 18th annual symposium on Computational geometry, pages 10-18, 2002.
- (9) Friggstad Z, Rezapour M, Salavatipour M R. Local search yields a PTAS for -means in doubling metrics. In Proceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science, pages 365-374, 2016.
- (10) Cohen-Addad V, Klein P N, Mathieu C. Local search yields approximation schemes for -means and -median in Euclidean and minor-free metrics. In Proceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science, pages 353-364, 2016.
- (11) Zhang J, Chen B, Ye Y. A multiexchange local search algorithm for the capacitated facility location problem. Mathematics of Operations Research, 30(2):389-403, 2005.
- (12) Li S. A 1.488 approximation algorithm for the uncapacitated facility location problem. Information and Computation, 222:45-58, 2013.
- (13) Li S. On uniform capacitated -median beyond the natural LP relaxation. ACM Transactions on Algorithms, 13(2):22, 2017.
- (14) Byrka J, Rybicki B, Uniyal S. An Approximation Algorithm for Uniform Capacitated -Median Problem with Capacity Violation. In Proceedings of the 18th International Conference on Integer Programming and Combinatorial Optimization, pages 262-274, 2016.
- (15) Adamczyk M, Byrka J, Marcinkowski J, Meesum S M, Wlodarczyk M. Constant factor FPT approximation for capacitated -median. arXiv preprint arXiv:1809.05791, 2018.
- (16) Schrijver, Alexander. Combinatorial optimization: polyhedra and efficiency. Springer Science & Business Media, 2003.
- (17) Hsu D, Telgarsky M. Greedy bi-criteria approximations for -medians and -means. arXiv preprint arXiv:1607.06203, 2016.
- (18) Aggarwal A, Deshpande A, Kannan R. Adaptive sampling for -means clustering. In Proceedings of the Approximation, Randomization, and Combinatorial Optimization: Algorithms and Techniques, pages 15-28, 2009.