The -means (KM) is one of the most fundamental clustering tasks in combinatorial optimization and data mining. Given an -point data set in and an integer , the goal is to partition into disjoint subsets so as to minimize the total within-cluster variances or, equivalently speaking, to choose centers so as to minimize the total squared distances between each point and its closest center. This problem is NP-hard in general dimension even when , proved by reduction from the densest cut problem adhp2009 . And the latest work shows that it is APX-hard acks2015 in general dimension and the inapproximability is at least 1.0013 lsw2016 . Despite its hardness, Lloyd l1982 proposes a local search heuristic for this problem that performs very well in practical and is still widely used today. Berkhin b2006 states in public that ”it is by far the most popular clustering algorithm used in scientific and industrial applications”.
It turns out that in the literature, most of the work for -means has been done from a practical perspective and is thus more time-concerned. Among these results, meta heuristics are very popular like the well-known Lloyd’s algorithm, -means++ av2007 and -means bmvkv2012 . Although -means++ is proved to be -approximate, we believe it is far from the limit. The first constant approximation for -means in general dimension is a -approximation kmn2002 based on local search, which is proved to be almost tight by showing that any local search with a fixed number of swap operations achieves at least a -approximation. Ahmadian et al. ans2017 improve this factor by presenting a new primal-dual approach and achieve an approximation ratio of 6.357. Recent results frs2016 ; ckm2016 show local search allowing up to many swaps in a single neighbor search move yields a PTAS for -means in fixed dimensional Euclidean space.
Capacity constraints form a straightforward variant as well as a natural requirement for almost all combinatorial models, such as the capacitated supply chain, capacitated lot-sizing, capacitated facility location, capacitated vehicle routing, etc. However, these constraints always raise the difficulties of the problem dramatically. For clustering problems like facility location and
-median, the capacitated version breaks the nice structure of an optimal solution that every data point is assigned to its nearest open center. Moreover, the widely used integer-linear programs for these problem have unbounded integral gap when the capacity constraints are added. The state-of-the-art approximation algorithms also suffer from this limitation. Researchers have to change techniques instead of insisting on linear program based ones. Note that the linear program based techniques perform very well in the uncapacitated facility location and-median.
As stressed before, approximation algorithms are seldom proposed for capacitated -means due to its hardness. Fortunately, this problem has strong connections with other capacitated clusterings like capacitated facility location and capacitated -median. For capacitated facility location, the state-of-the-art approximation algorithm achieves a -approximation ratio based on local search zcy2005 , while for uncapacitated facility location the corresponding result is l2013 which is very close to the inapproximability lower bound of under the assumption of P NP. For the capacitated -median, most research in the literature concerns pseudo-approximation that violates either the cardinality constraint or capacity constraints. Based on an exploring of extended relaxations for the capacitated -median, Li S. l2017 presents an -approximation that violates the cardinality constraint by a factor of . Byrka et al. br2016 present a constant approximation with a violation of capacity constraint. They are both the state-of-the-art bi-factor approximation results for uniform capacitated -median. Recently, a combination of approximation algorithms and FPT algorithms has been developed for those problems, for which we fail to give polynomial-time constant approximations. Particularly, Adamczyk et al. abm2018 propose a constant FPT approximation for both uniform and nonuniform capacitated -median, which inspires our work.
Our contribution We present a constant FPT approximation algorithm without breaking any cardinality constraint or capacity constraint for any general HCKM instance. The running time of the proposed algorithm is polynomial w.r.t. the inputs except for . Also, our algorithm has the following advantages and potential: (1) It is embedded into two main subroutines which allow to be replaced by others so that one can find a better balance between performance ratio and running time under this framework; (2) The algorithm as well as the analysis are very likely to extend to solving a large family of capacitated clustering with objective to minimize the sum of arbitrary powers of Euclidean distances; (3) The technique may inspire further work of nonuniform capacitated -means, for which, to the best of our knowledge, even no appropriate mathematical model was known so far.
The remainder of the paper is organized in a natural way that first proposes the algorithm, and continues with the running time and performance analysis.
2 FPT Algorithm
The KM can be formally described as: We are given a data set , where are
-dimensional real vectors. The object is to partition the data set intodisjoint subsets so as to minimize the total within-cluster sum of squared distances (or variances). It is well known that for a fixed finite cluster , the variance minimizer must be the centroid of , which we denote by Thus the goal of the -means can be formally stated as to find a partition of so as to minimize the following objective:
In the HCKM, we force the size of each cluster to be smaller than a given input , i.e. for . For the sake of convenience, we assume w.l.o.g. that is a positive integer.
Here is an easy observation for an arbitrary optimal solution to KM. Suppose we are given the optimal center set of cardinality , then the rest of the problem would be easy. Since the object is to minimize the total squared distances from each data point to the center set, it must be the case that every data point is assigned to its nearest center in the given set, which is also known as Voronoi partition. However, this partition may violate the capacity constraints and thus be infeasible to HCKM. For this consideration, we claim the following.
When given the center set, the HCKM is polynomial-time tractable.
Suppose we have an -point data set and an integer as input. And we are given -point center set . Then the optimal partition of according to satisfying the cluster upper bound constraints can be computed in polynomial time w.r.t. and . In fact, the above partition problem can be described by the following integer-linear program.
The decision variable (out of an arbitrary solution ) equals 1 representing that is assigned to in that solution, and 0 otherwise. In HCKM, once we have the location information of any two points , the distance between them can be computed immediately by . Thus the objective is essentially linear w.r.t. . The constraints of the above program are quite straightforward.
It seems to be intractable to solve the program exactly, but one can find that the constraints as well as the object are linear except the Boolean constraints. From the theory of integer-linear program we know that at least one of the optimal solutions to the natural relaxation of the above program must be integral (Detail see sa2003 as a reference). Note that the relaxation of program (1-4) differs only by relaxing the constraints to . The integrality property of the relaxed program holds whenever the coefficient matrix of the initial integer-linear program is totally unimodular and the right-hand side terms are integers, which both are satisfied in program (1-4). That is to say, we only need to compute an optimal solution of a linear program in order to obtain the optimal assignment, implying the lemma.
For any HCKM instance , and , let illustrate the optimal partition w.r.t. an arbitrary center set . According to the above lemma, can be computed in polynomial time. For uncapacitated -means instances, we use to denote the Voronoi partition w.r.t. . Moreover, let and be the corresponding mapping from to in partition and respectively. Thus and if and only if is assigned to in and similarly. Therefore, once we have the center set (not necessary optimal) for HCKM, we can compute the objective value w.r.t. by
The rest of the HCKM, also the critical part, it to find the optimal center set . Let , there are many candidates of the centroid set for KM and this amount do not reduce too much for that of HCKM. We know from practical experience that the number of clusters is literally much smaller than the total number of data . Thus it is interesting if we are able to reduce the number of candidates to . This main idea reminds us of the bi-criteria approximation algorithm for KM that partitions the data set into clusters, which will be considered as a subroutine in our algorithm. We call the outputs of the KM subroutine the representing set, denoted by . At the end of this subroutine, we actually obtain regions in according to the Voronoi partition of the representing set, see Fig.1. By introducing a novel metric , we are able to solve the optimal solution to HCKM efficiently. And by carefully analysis, we prove that we only lose a constant factor of the optimum w.r.t. the original metric.
Let metric be the -dimensional Euclidean space equipped with distance for any two point , where denotes the standard Euclidean distance or 2-norm distance. Under this definition, the object for HCKM is simply the sum of distance between each data point with its centroid under the distance in metric . However, it is hard to compute the optimal centroid set under this metric. But for the following metric (see Fig. 2 as an example), it is not the case.
Define metric be equipped with the following distance
where and are the corresponding nearest points (w.r.t. the distance in metric ) in .
Under this novel metric, think about the case that an arbitrary data point that is assigned to a certain region among the regions. It is must be the case for any optimal solution that it is assigned to the representing point in that region. Suppose data point is assigned to instead of the representing point in which region is located, then one can reduce the cost by move to because . Next we will use notation representing the objective value in metric and in metric for HCKM. And let together with be the corresponding optimal solution for HCKM in metric and metric . Recall is the output solution of the (uncapacitated) KM subroutine in metric and is its cost in metric . Note there is no difference between the objective value of KM and that of HCKM. The optimal solution for KM in metric is denoted by . Again, all these notations are based on the same fixed instance.
Therefore, the optimal centroid set under metric must be located in the representing set. Note once we have the optimal centroid set, we can in polynomial time compute the optimal partition by solving a linear program. The rest of the problem is easy—select the optimal centroid out of candidates. This is another subroutine in our algorithm, which we take a simple but efficient one, the brute force search, i.e. to enumerate all possible distributions.
Since we do not require the cardinality of but care about the constant approximation here, we employ Hsu’s Algorithm ht2016 as an example that opens at most centers achieving approximation. Note in low-dimensional space, Bandyapadhyay and Varadarajan’s algorithm bv2015 that opens at most centers with approximation is also a good choice for our FPT algorithm. To deal with the cardinality constraint, we make a guess over the distribution of the centers among the clusters which is obtained in the KM subroutine. The optimal centroid set must be the one with minimum partition cost. For practical consideration, we add an update step that replace the current centroid set by the real centroid set of current solutions. This step definitely reduce the cost but does not change the partition.
By the above, we actually compute an optimal solution to HCKM in metric . In next section we prove that by doing this, we only lose a constant factor of the minimum cost. For ease of the analysis, we present the pseudocode of our FPT algorithm by Algorithm 1.
We evaluate the proposed algorithm from both running-time and performance ratio perspectives. For ease of the notation, we define the squared Euclidean distance between location and set by . And remember for any .
3.1 Running-time Analysis
By the following lemma we show the running-time of our algorithm is only exponential w.r.t. and polynomial w.r.t. all other inputs.
Algorithm 1 terminates in time .
We begin with the subroutine that employs the Hsu’s Algorithm for KM which terminates in time . It is polynomial w.r.t. inputs and for any fixed constant . Thus the running-time of Algorithm 1 is dominated by the loop from step 4 to 10.
In step 3 we divide the whole space into a Voronoi partition by output by the KM subroutine. What we do next in the ”for” loop is to numerate all possible configurations of the number distribution for the centroid set over all regions. Ignoring the effect of , it scans at most probabilities. And the inner loop in step 5 takes numerations. In each numeration we pick a known number of copies of centers in the correlative region which exactly is what step 6 all about. After the inner loop, we actually do a more time-consuming work of solving a large number (probabilities of ) of linear programs when computing . The linear program is the relaxation of integer-linear program (1-4) in which there are variables as well as constraints. Given , we bound the solving time by . Therefore, this dominating step takes in total, implying the theorem.
3.2 Performance Analysis
Let us take a glance at the Voronoi partition at the end of step 3. We get representing points as well as regions from the KM subroutine. Among them there are points supposed to be open in later steps. Some dense regions may open more than once while some may open none. But the proposed algorithm only open out of regions, thus in average there must be some regions where no representing point will be open. Therefore the data points in that region must be assigned to the open representing point in other region and form a new cluster. We will focus on these data points and try to bound their assignment cost. Next we introduce an inequality that will be used frequently in our analysis. We all know that the triangle inequality holds in Euclidean space. We prove a similar property holding in squared Euclidean space which we call the extended triangle inequality.
Lemma 4 (Extended triangle inequality)
For any , we have
Based on basic algebra facts, we have
where the first inequality holds because is actually the Euclidean distance between and and thus satisfying the triangle inequality. Implies the lemma.
Moreover, the extended triangle inequality can be extended to include more complicated cases.
If are inserted in two points l and k in metric , it must be the case that
where the first inequality comes from the triangle inequality and second from basic algebra facts.
Now think about the relation between the optimum value for HCKM in metric and that in metric . By the following lemma, we show can be bounded in terms of and from both sides.
Suppose is assigned to in . Let () be the nearest center from () to which is obtained through the KM subroutine, i.e., , . (See Fig. 2) The lower bound is straightforward because from Corollary 5 we have,
Because of the assumption, summation over all obtaining
For the upper bound, an observation of Voronoi partition gives and thus
where the second inequality follows from the extended triangle inequality. On the other hand, we can bound using Corollary 5 and thus,
So in total,
Again, summation over all and remember , ,
completing the whole proof.
In fact, for any feasible solution to HCKM, we always have . By the above lemma we only prove the special case where is an optimal one. Blending this observation with Lemma 1, we state the following.
Algorithm 1 outputs a feasible solution to HCKM with cost
The feasibility is straightforward. For the cardinality constraint, from step 4 we have . Thus all w.r.t. vector have cardinality
is one of thus satisfying the cardinality constraint. The capacity constraints hold naturally because is obtained by solving program (1-4).
We already claim that which is derived directly from Corollary 5. To complete the rest of the proof, we only need to prove . In fact, is an exact optimal solution for HCKM in metric . That is, is a such a partition of ( is the correlative mapping) satisfying both cardinality and capacity constraints and at the same time minimize the following,
First, we prove that the optimal solution for HCKM in metric must be a subset of (probably a multi-set). By contradiction, suppose there exist an optimal solution with a center not locating in . And w.l.o.g. assume the optimal assignment mapping w.r.t. is . Considering the cluster , we can reduce the cost of this cluster by moving to . Because from the definition of metric we have,
Known the optimal solution must be a subset of , we enumerate all possible subsets for which we compute the optimal assignment by program (1-4) with . Thus is one of the optimal solutions in metric and naturally . Combining with , implies the lemma.
For the sake of portability and completeness, we propose a more general framework showing that once we have a -approximation KM subroutine as well as a -approximation subroutine for HCKM in metric , we can combine them to obtain a -approximation algorithm for HCKM in metric .
Suppose we have a -approximation KM subroutine for the uncapacitated -means problem that outputs a solution with opened centers, as well as a -approximation subroutine for HCKM in metric that outputs a solution with opened centers. Then, we can construct an -approximation algorithm (not necessary in polynomial time) for any general HCKM instances.
Suppose we have two subroutines that output for KM and for HCKM satisfying
Now we need to modify Lemma 6 and 7 according to the above conditions. First, observe that any feasible solution to KM is also feasible to HCKM and thus . Combing this with -approximation condition we know, Substitute into Lemma 6 we obtain,
Considering Lemma 7, we replace the inequality with in the proof and get
completing the proof.
Note Algorithm 1 employs a -approximation KM subroutine and an exact subroutine for HCKM in metric , where . Substitute into Lemma 8 implying a -approximation for general HCKM. We obtain the following analysis result for our algorithm.
Algorithm 1 outputs a feasible solution to HCKM with approximation ratio in time .
4 Conclusion and future work
The proposed algorithm is a theoretical work that does not violate any cardinality and capacity constraints, and at the same time achieves a constant approximation ratio. The algorithm allows one to balance the performance ratio and running-time by embedding different subroutines. Concerning the performance analysis, we employ the Hsu’s KM subroutine and an exact subroutine for computing the optimal partition in metric at the cost of time consumption. One can find a better balance in this framework. Besides, our analysis is a rough one that only concerns to achieve a constant performance ratio. One can easily improve the ratio as well as the time analysis as we ignore many low order terms. A faster approximation algorithm other than an FPT one is also an interesting direction.
- (1) Adamczyk M, Byrka J, Marcinkowski J, Meesum S M, Wlodarczyk M. Constant factor FPT approximation for capacitated -median. arXiv preprint arXiv:1809.05791, 2018.
- (2) Ahmadian S, Norouzi-Fard A, Svensson O, Ward J. Better guarantees for -means and euclidean -median by primal-dual algorithms. In Proceedings of the 58th annual IEEE Symposium on Foundations of Computer Science, pages 61-72, 2017.
- (3) Aloise D, Deshpande A, Hansen P, Popat P. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75:245-249, 2009.
- (4) Arthur D, Vassilvitskii S. -means++: the advantages of careful seeding. In Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1027-1035, 2007.
- (5) Awasthi P, Charikar M, Krishnaswamy R, Sinop A K. The hardness of approximation of Euclidean -means. In Proceedings of the 31st International Symposium on Computational Geometry, pages 754-767, 2015.
- (6) Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S. Scalable -means++. In Proceedings of the 38th International Conference on Very Large Data Bases, pages 622-633, 2012.
- (7) Bandyapadhyay S, Varadarajan K. On variants of -means clustering. arXiv preprint arXiv:1512.02985, 2015.
- (8) Berkhin P. A survey of clustering data mining techniques. In: Kogan J., Nicholas C., Teboulle M. (eds) Grouping Multidimensional Data. Springer, Berlin, Heidelberg, 2006.
- (9) Byrka J, Rybicki B, Uniyal S. An Approximation Algorithm for Uniform Capacitated -Median Problem with Capacity Violation. In Proceedings of the 18th International Conference on Integer Programming and Combinatorial Optimization, pages 262-274, 2016.
- (10) Cohen-Addad V, Klein P N, Mathieu C. Local search yields approximation schemes for -means and -median in Euclidean and minor-free metrics. In Proceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science, pages 353-364, 2016.
- (11) Friggstad Z, Rezapour M, Salavatipour M R. Local search yields a PTAS for -means in doubling metrics. In Proceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science, pages 365-374, 2016.
- (12) Hsu D, Telgarsky M. Greedy bi-criteria approximations for -medians and -means. arXiv preprint arXiv:1607.06203, 2016.
- (13) Kanungo T, Mount D M, Netanyahu N S, Piatko C D, Silverman R, Wu A Y. A local search approximation algorithm for -means clustering. In Proceedings of the 18th annual symposium on Computational geometry, pages 10-18, 2002.
- (14) Lee E, Schmidt M, Wright J. Improved and simplified inapproximability for -means. Information Processing Letters, 120:40-43, 2016.
- (15) Lloyd S. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129-137, 1982.
- (16) Li S. A 1.488 approximation algorithm for the uncapacitated facility location problem. Information and Computation, 222:45-58, 2013.
- (17) Li S. On uniform capacitated -median beyond the natural LP relaxation. ACM Transactions on Algorithms, 13(2):22, 2017.
- (18) Schrijver A. Combinatorial optimization: polyhedra and efficiency. Springer Science & Business Media, 2003.
- (19) Zhang J, Chen B, Ye Y. A multiexchange local search algorithm for the capacitated facility location problem. Mathematics of Operations Research, 30(2):389-403, 2005.