Clustering is the task that consists in grouping data into homogeneous clusters with the goal that intra-cluster data should be more similar than inter-cluster data. Let be a set of points111For the sake of clarity and without loss of generality, we do not consider weighted points. in . Let be the non-empty clusters partitioning and denote by the set of cluster centers, the cluster prototypes. -Means is one of the oldest and yet prevalent clustering technique that consists in minimizing:
where denotes the squared Euclidean distance, and the index (or label) of the center of that is the closest nearest neighbor to (say, in case of ties, choose the minimum integer). Finding an optimal clustering minimizing globally is NP-hard when and [21, 8], and polynomial when using dynamic programming  or when setting to the center of mass. Note that there is an exponential number of optimal -means clustering yielding the same optimal objective function: Indeed, consider an equilateral triangle with and , we thus get equivalent optimal clustering related by rotational symmetries. Then make far away separated copies so that and consider , we end up with optimal -means clustering. Minimizing the -means function of Eq. 1 is equivalent to minimizing the sum of intra-cluster squared distances or maximizing the sum of inter-cluster squared distances:
Many heuristics have been proposed to overcome the NP-hardness of
-means. They can be classified into two main groups: Thelocal search heuristics and the global heuristics that can be used to initialize the local heuristics. For example, the following four heuristics are classically222See for example the R language for statistical computing, http://www.r-project.org/ implemented:
Forgy  (random): Draw uniformly at random points from to set the cluster prototypes inducing the partition. It can be proved that this best discrete -means (with ) yields a -approximation factor compared to the ordinary -means using a proof by contradiction based on the variance-bias decomposition: , where
denotes the variance andthe centroid. In fact, , the sum of intra-cluster variances (and ).
MacQueen  (online): From a given initialization of the centers defining singleton clusters (say, for the clusters), we add a new point at a time to the cluster that contains the closest center, update that cluster centroid, and reiterate until convergence. This heuristic is also called the online or single-point -means .
Lloyd  (batched): From a given initialization of cluster prototypes, (1) assign points to their closest cluster, (2) relocate cluster centers to their cluster centroids, and reiterate those two steps until convergence.
Hartigan [12, 13] (single-point relocation): From a given initialization, find how to move a point from a cluster to another cluster so that the -means cost of Eq. 1 strictly decreases and reiterate those single-point relocations until convergence is reached. Note that a point maybe assigned to a cluster which center is not its closest center .
In general, a -means clustering technique partitions the data into pairwise non-overlapping convex hulls : The Voronoi partition. A partition is said stable when a local improvement of the heuristic cannot improve its -means score. Let denotes the maximum number of stable -means partitions obtained by Forgy’s, MacQueen’s, Lloyd’s and Hartigan’s schemes, respectively.
Fact 1 (Voronoi partitions)
We have and , where denotes the number of partitions of elements into non-empty subsets (that is, the Stirling numbers of the second kind) and denotes the number of partitions with non-overlapping (and non-empty) convex hulls (that is, the number of -Voronoi partitions).
Hartigan’s single-point relocation heuristic may improved Lloyd’s clustering but not the converse . Note that Lloyd’s heuristic may require an exponential number of iterations to converge . It is an open question  to bound the maximum number of Hartigan’s iterations.
On one hand, for those local heuristics performing pivots on Voronoi partitions using primitives, initialization (i.e., the initial Voronoi partition) is crucial  to obtain a good clustering, and several restarts, denoted by , are performed in practice to choose the best clustering. In practice, Forgy’s initialization has been replaced by -means++  which provides an expected
competitive initialization. However, it was shown that there exits point sets (even in 2D) for which the probability to get such a good initialization is exponentially low (and thus requiring exponentially many initialization restarts to reach a good Voronoi partition with high probability).
On the other hand, the global -means [18, 26] builds incrementally the clustering by adding one seed at a time. Given a current -clustering it chooses the point in that minimizes the -means objective function. Thus initialization is limited to choosing the first point, and all points can be considered as this first starting point. However, Global -means requires more computation.
In this paper, we do not address the problem of choosing the most appropriate number, , of clusters: This model selection problem has been investigated in [22, 17]. We also consider the squared Euclidean distance although the results apply to any other Bregman divergence [3, 23].
The paper is organized as follows: We investigate the blessing of empty-cluster exceptions in Lloyd’s heuristic in Section 2, and of single-point-cluster exceptions in Hartigan’s scheme in Section 3. In Section 4, we describe our novel heuristic merge-split-cluster -means and report on its performances with respect to Hartigan’s heuristic. In Section 5, we present a generalization of the -means objective function where each point is associated to its closest clusters: the -means clustering. We show how to directl convert or iteratively relax a sequence of -means to a -means and compare experimentally those solutions with a direct -means. Finally, Section 6 wrap ups the contributions and discusses further perspectives.
2 The blessing of empty-cluster exceptions in Lloyd’s batched -means
Lloyd’s -means  starts by initializing the seeds of the cluster centers , and then iterates by assigning the data to their closest cluster center with respect to the squared Euclidean distance, and then relocates the cluster centers to their centroids. Those batched assignment/relocation iterations are repeated until convergence is reached: The -means cost monotonically decreases with guaranteed convergence after a finite number of iterations . The complexity of Lloyd’s -means is where denotes the number of iterations. It has been proved that Lloyd’s -means performs a maximum number of iterations exponential  or polynomial in , and the spread333The spread is the ratio of the maximum point inter-distance over the minimum point inter-distance. of the point set . Some 1D point set are reported to take iterations even for , see . We first, report a lower bound on the number of Lloyd’s stable optima :
Fact 2 (Exponentially many Lloyd’s -means minima)
Lloyd’s -means may have stable local minima.
The proof follows from the gadget illustrated in Figure 1.
The Hartigan’s heuristic [12, 13] proceeds by relocating a single point between two clusters provided that the -means cost function decreases. It can thus further decrease the -means score when Lloyd’s batched algorithm is stuck into a local minimum (but not the converse). Recently, Hartigan’s heuristic  was suggested to replace Lloyd’s heuristic on the basis that Hartigan’s local minima is a subset of Lloyd’s optima (Theorem 2.2 of ). We argue that this is true only when no Empty Cluster Exceptions (ECEs) are met by Lloyd’s iterations. Figure 1 illustrates a toy data set where Lloyd’s -means meets such an empty-cluster exception. In general at the end of the relocation stage, when points are assigned to their closest current centroids, we may have some empty clusters.
Fact 3 (empty-cluster exceptions)
Lloyd’s batched -means may produce empty cluster exceptions in a round.
Proof follows from Figure 1 by creating far apart (non-interacting) copies of the gadget and setting .
However, those empty-cluster exceptions are a blessing because we may add new seeds that will further decrease significantly the cost of -means: This is a partial re-seeding. Thus the extended Lloyd’s heuristic is: (a) assignment, (b) relocation, and (c) partial reseeding to keep exactly non-empty clusters for the next stage. We may use various heuristics for partially re-seeding like the incremental global -means  starting from to , etc.
To evaluate the frequency at which those empty-cluster exceptions occur and their number , let us take the Iris data set from the UCI repository : It consists of samples with features (classified into
labels) that we first renormalize the data-set so that coordinates on each dimension have zero mean and unit standard deviation. Let us run Lloyd’s-means with (Forgy’s) random seed initialization (with a maximum number of iterations) for . We count the number of empty cluster exceptions and report their frequency in the graph of Figure 2. We observe that the larger the , and the more frequent the exceptions. This phenomenon was also noticed in . Furthermore, increases with the dimension too . However, note that this is a tendency and the number of empty-cluster exceptions vary a lot from a data set to another one (given an initialization heuristic).
Let us now run -means and report the empirical frequency of having simultaneous empty-cluster exceptions. (Note that our replicated toy data-sets of Figure 1 may provide values). The empty-cluster frequency depends on the initialization scheme: It is higher when using Forgy’s heuristic and lower when using -means++ or global -means. Table 2 demonstrates empirically this observation. As noticed in , the number of cluster-empty exceptions rise with and and the authors  avoided this problem by setting minimum input size on clusters. They surprisingly show empirically that -means with constraints gave better clustering than -means without constraints in practice!
Finally, let us compare the best minimum -means score when performing Lloyd’s heuristic (and stopping when we meet an empty cluster exception), and the extended Lloyd’s heuristic that partially reseeds the current clustering when the algorithm meets empty-cluster exceptions. Partial reseeding can be done in many ways by starting from the current number of cluster centers the usual seeding methods (Forgy, -means++ or global -means). Table 1 presents the results for the proof of concept using Forgy’s re-seeding: We observe that partial reseeding at ECEs allows to reach (slightly) better local minima (see in Table 1).
3 The blessing of single-point cluster exceptions in Hartigan’s heuristic
Hartigan’s heuristic  consider relocating a single-point provided that it decreases the -means objective function. In , a synthetic noisy data-set is built so that with probability tending to (as the dimension tends to infinity) any initial random partition is stable wrt. Lloyd’s -means while Hartigan’s converges to the correct solution. We recall that Hartigan’s local minima are a subset of Lloyd’s minima  provided that Lloyd’s heuristic did not encounter empty-cluster exceptions. Note that a single-point cluster (with associated cluster having zero variance) cannot be relocated to other clusters since it necessarily increases the -means energy (sum of intra-cluster variances):
Table 2 provides statistics on the Hartigan’s -means score and the number of single-point-cluster exceptions (SPCEs) met when performing Hartigan’s heuristic.
Consider the case of Single-point-Cluster Exceptions (SCEs) in Hartigan’s scheme where we decide to merge this single-point cluster with another cluster and redraw another center from (that can thus decrease significantly the variance of the change cluster). We accept this relocation iff. this merge&re-seed operation decreases the -means loss. For example, when (and ), the classical Hartigan’s best clustering has -means score while the heuristic with partial reseeding (associating the single-point clusters to their closest other clusters), we obtain . We keep the experiments short here since the next Section improves Hartigan’s heuristic with detailed experiments.
4 A novel heuristic: The merge-and-split-cluster -means
This novel heuristic proceeds by considering pairs of clusters with corresponding centers and . The basic local search primitive (pivot) consists in computing the best -means score difference by merging and splitting again with two new centers and . Let and denote the Voronoi partition of induced by and . Since the clusters other than and are untouched, the difference of the -means score is written as:
where denotes the -means objective function: namely, the cluster variance of with respect to center . There are several ways (randomized or deterministic) to implement the merge-and-split operation: For example, the two new centers can be found by computing:
-means: A brute-force method computes all hyperplanes444We do not need to compute explicitly the equation of the hyperplane since clockwise/counterclockwise orientation predicates are used instead. Those predicates rely on computing the sign of a matrix determinant. passing through (extreme) points and the induced sum of variances of the below-above clusters in -time. Using topological sweep , it can be reduced to time. Note that for and unfixed , -means is NP-hard . We can also use coresets to get a -approximation of a -means  in linear time .
a discrete -means: We choose among the points of the two best centers (naively implemented in ). This yields a -approximation of -means.
When , we accept replacing and by and , respectively. Otherwise, we consider another pair of clusters and stop iterating when all pairs do not produce a lower -means score. This heuristic can be classified as a macro kind of Hartigan-type heuristic that is not based on local Voronoi assignment. Indeed, Hartigan’s heuristic moves a point from a cluster to a cluster and update the two centroids correspondingly. Our heuristic also change these two clusters but can accept further improvements with respect to a -means operation on . Thus at the last stage of a Hartigan’s heuristic, we can perform this merge-and-split heuristic to further improve the clustering. (This heuristic can further be generalized by simultaneous merging-and-splitting of clusters.)
The merge-and-split -means heuristic decreases monotonically the objective function and converges after a finite number of iterations.
Since each pivot step between Voronoi partitions strictly decreases the -means score by and that is lower bounded, it follows that the merge-and-split -means converges after a finite number of iterations. We compare our heuristic with both Hartigan’s ordinary and discrete variants that consists in moving a point to another cluster iff. the two recomputed medoids of the selected clusters yield a better -means score. Heuristic performances are compared with the same initialization (Forgy’s or -means++ seeding) and by averaging over a number of rounds: Observe in Table 3 that our heuristic (MSC for short) always outperforms discrete Hartigan’s method not suprisingly. Although the number of basic primitives (#ops) is lower for MSC, each such operation is more costly. Thus MSC -means is overall more time consuming but gets better local optima solutions. Note that the discrete -means medoid splitting procedure is very well suited for the -modes algorithm , a -means extension working on categorical data sets.
5 Clustering with the -means objective function
Let us generalize the -means objective function as follows: For each data , we associate to its nearest cluster centers (with denoting the cluster center indexes), and ask to minimize the following -means objective function (with ):
When , this is exactly the -means objective function of Eq. 1. Otherwise the clusters overlap and . Note that when , since all cluster centers coincide to the centroid (or barycenter), the center of mass. We observe that:
with equality reached when .
Both Lloyd’s and Hartigan’s heuristics can be adapted straightforwardly to this setting.
Lloyd’s -means decreases monotonically the objective function and converge after a finite number of steps.
Proof: Let and denote the cost at round , for the assignment () and relocation () stages. Let be the initial cost (say, from Forgy’s initialization of ). For , we have: In the assignment stage , each point is assigned to its nearest neighbor centers . Therefore, we have . In the relocation stage , each cluster is updated by taking its centroid . Thus we have . When (and thus ), we stop the batched iterations.
Figure 3 illustrates a -means on a toy data-set.
Since for all and the iterations strictly decreases the score function, the algorithm converges. Moreover, since the number of different cluster sets induced by means is upper bounded by , and that cluster sets cannot be repeated, it follows that -means converges after a finite number of iterations. The bound can further be improved by considering the -order weighted Voronoi diagrams, similarly to . Note that the basic Lloyd’s -means may also produce empty-cluster exceptions although those become rarer as increase (checked experimentally).
Although -means is interesting in its own (see Discussion in Section 6), it can also be used for -means. Indeed, instead of running a local search -means heuristic that may be trapped too soon into a “bad” local minimum, we prefer to run a means for a prescribed . We can then convert a -means by assigning to each point its closest neighbor (among the assigned at the end of the -means), and then compute the centroids and launch a regular Lloyd’s -means to finalize: Let -means denote this conversion. For example, for and , the converted -means beats the -means of the time for using Forgy’s initialization on Iris. Table 4 shows experimentally that converted -means beats on average the regular -means (for the Iris data-set) and this phenomenon increases not surprisingly with . However the best minimum score is often obtained by classical -means. Thus it suggests that performs better when the number of restarts is limited. In fact, -means tend to smooth the -means optimization landscape and produce less local minima but also smooth the best minima.
We can also perform a cascading conversion of -means to -means: Once a local minimum is reached for -means, we initialize a means by dropping for each point its farthest cluster, perform a Lloyd’s -means, and we reiterate this scheme until we get a -means: An ordinary -means. Let -means denote this scheme. Table 4 (right) presents the performance comparisons of a regular Lloyd’s -means with a Lloyd’s -means for various values of with the initialization of both algorithms performed by the same seeding for fair comparisons.
We have extended the classical Lloyd’s and Hartigan’s heuristics with partial re-seeding and proposed new local heuristics for -means. We summarize our contributions as follows: First, we showed the blessing of empty-cluster events in Lloyd’s heuristic and of single-point-cluster events in Hartigan’s heuristic. These events happen increasingly when the number of cluster or the dimension increase, or when running those heuristics a given number of trials to choose the best solution. Second, we proposed a novel merge-and-split-cluster -means heuristic that improves over Hartigan’s heuristic that is currently the de facto method of choice . We showed experimentally that this method brings better -means result at the expense of computational cost. Third, we generalized the -means objective function to the -means objective function and show how to directly convert or iteratively relax a -means heuristic to a -means avoiding potentially being trapped into too many local optima. -Means is yet another exploratory clustering technique for browsing the space of hard clustering partitions. For example, when -means is trapped, we may consider a -means to get out of the local minimum and then convert the -means to a -means to explore a new (local) minimum.
D.J. Newman A. Asuncion.
UCI machine learning repository, 2007.
-  D. Arthur and S. Vassilvitskii. -means++ : the advantages of careful seeding. In SODA, pages 1027 – 1035, 2007.
-  A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with Bregman divergences. Journal of Machine Learning Research, 6:1705–1749, 2005.
A note on cluster analysis and dynamic programming.Mathematical Biosciences, 18(3-4):311 – 312, 1973.
-  K. Bennett, Paul S. Bradley, and Ayhan Demiriz. Constrained -means clustering. MSR-TR-2000-65, 2000.
-  A. Bhattacharya, R. Jaiswal, and N. Ailon. A tight lower bound instance for -means++ in constant dimension. In Theory and Applications of Models of Computation, LNCS 8402, pages 7–22, 2014.
-  S. Bubeck, M. Meila, and U. von Luxburg. How the initialization affects the stability of the -means algorithm. ESAIM: Probability and Statistics, 16:436–452, 1 2012.
-  S. Dasgupta. The hardness of -means clustering. CS2007-0890, University of California, USA, 2007.
-  D. Feldman, M. Monemizadeh, and C. Sohler. A PTAS for -means clustering based on weak coresets. In SoCG, pages 11–18. 2007.
-  E. W. Forgy. Cluster analysis of multivariate data: efficiency vs interpretability of classifications. Biometrics, 1965.
-  S. Har-Peled and B. Sadri. How fast is the -means method? In SODA, pages 877–885. SIAM, 2005.
-  J. A. Hartigan. Clustering Algorithms. John Wiley & Sons, Inc., New York, NY, USA, 99th edition, 1975.
-  J. A. Hartigan and M. A. Wong. Algorithm AS 136: A -means clustering algorithm. Journal of the Royal Statistical Society. Series C, 28(1):100–108, 1979.
-  Z. Huang. Extensions to the -means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov., 2(3):283–304, September 1998.
-  M. Inaba, N. Katoh, and H. Imai. Applications of weighted Voronoi diagrams and randomization to variance-based -clustering. In SoCG, pages 332–339, 1994.
-  T. Kanungo, D. M. Mount, N. S. Netanyahu, C. Piatko, R. Silverman, and A. Y Wu. The analysis of a simple -means clustering algorithm. In SoCG, pages 100–109. 2000.
-  B. Kulis and M. I. Jordan. Revisiting -means: New algorithms via Bayesian nonparametrics. In ICML, 2012.
-  A. Likas, N. Vlassis, and J. J Verbeek. The global -means clustering algorithm. Pattern recognition, 36(2):451–461, 2003.
-  S. P. Lloyd. Least squares quantization in PCM. Technical report, Bell Laboratories, 1957. reprinted in IEEE Transactions on Information Theory, March 1982.
-  J. B. MacQueen. Some methods of classification and analysis of multivariate observations. Proceedings Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967.
-  M. Mahajan, P. Nimbhorkar, and K. Varadarajan. The planar -means problem is NP-hard. In WALCOM: Algorithms and Computation, pages 274–285. Springer, 2009.
D. Pelleg and A. W. Moore.
-means with efficient estimation of the number of clusters.In Proceedings of the Seventeenth International Conference on Machine Learning, pages 727–734, 2000.
-  N. Slonim, E. Aharoni, and . Crammer. Hartigan’s -means versus Lloyd’s -means: Is it time for a change? In IJCAI, pages 1677–1684, 2013.
M. Telgarsky and A. Vattani.
Hartigan’s method: -means clustering without Voronoi.
International Conference on Artificial Intelligence and Statistics, pages 820–827, 2010.
-  A. Vattani. -means requires exponentially many iterations even in the plane. Discrete & Computational Geometry, 45(4):596–616, 2011.
-  J. Xie, S. Jiang, W. Xie, and X. Gao. An efficient global -means clustering algorithm. Journal of computers, 6(2), 2011.