1 Introduction
Clustering is the task that consists in grouping data into homogeneous clusters with the goal that intracluster data should be more similar than intercluster data. Let be a set of points^{1}^{1}1For the sake of clarity and without loss of generality, we do not consider weighted points. in . Let be the nonempty clusters partitioning and denote by the set of cluster centers, the cluster prototypes. Means is one of the oldest and yet prevalent clustering technique that consists in minimizing:
(1) 
where denotes the squared Euclidean distance, and the index (or label) of the center of that is the closest nearest neighbor to (say, in case of ties, choose the minimum integer). Finding an optimal clustering minimizing globally is NPhard when and [21, 8], and polynomial when using dynamic programming [4] or when setting to the center of mass. Note that there is an exponential number of optimal means clustering yielding the same optimal objective function: Indeed, consider an equilateral triangle with and , we thus get equivalent optimal clustering related by rotational symmetries. Then make far away separated copies so that and consider , we end up with optimal means clustering. Minimizing the means function of Eq. 1 is equivalent to minimizing the sum of intracluster squared distances or maximizing the sum of intercluster squared distances:
(2) 
Many heuristics have been proposed to overcome the NPhardness of
means. They can be classified into two main groups: The
local search heuristics and the global heuristics that can be used to initialize the local heuristics. For example, the following four heuristics are classically^{2}^{2}2See for example the R language for statistical computing, http://www.rproject.org/ implemented:
Forgy [10] (random): Draw uniformly at random points from to set the cluster prototypes inducing the partition. It can be proved that this best discrete means (with ) yields a approximation factor compared to the ordinary means using a proof by contradiction based on the variancebias decomposition: , where
denotes the variance and
the centroid. In fact, , the sum of intracluster variances (and ). 
MacQueen [20] (online): From a given initialization of the centers defining singleton clusters (say, for the clusters), we add a new point at a time to the cluster that contains the closest center, update that cluster centroid, and reiterate until convergence. This heuristic is also called the online or singlepoint means [11].

Lloyd [19] (batched): From a given initialization of cluster prototypes, (1) assign points to their closest cluster, (2) relocate cluster centers to their cluster centroids, and reiterate those two steps until convergence.

Hartigan [12, 13] (singlepoint relocation): From a given initialization, find how to move a point from a cluster to another cluster so that the means cost of Eq. 1 strictly decreases and reiterate those singlepoint relocations until convergence is reached. Note that a point maybe assigned to a cluster which center is not its closest center [24].
In general, a means clustering technique partitions the data into pairwise nonoverlapping convex hulls : The Voronoi partition. A partition is said stable when a local improvement of the heuristic cannot improve its means score. Let denotes the maximum number of stable means partitions obtained by Forgy’s, MacQueen’s, Lloyd’s and Hartigan’s schemes, respectively.
Fact 1 (Voronoi partitions)
We have and , where denotes the number of partitions of elements into nonempty subsets (that is, the Stirling numbers of the second kind) and denotes the number of partitions with nonoverlapping (and nonempty) convex hulls (that is, the number of Voronoi partitions).
Hartigan’s singlepoint relocation heuristic may improved Lloyd’s clustering but not the converse [23]. Note that Lloyd’s heuristic may require an exponential number of iterations to converge [25]. It is an open question [24] to bound the maximum number of Hartigan’s iterations.
On one hand, for those local heuristics performing pivots on Voronoi partitions using primitives, initialization (i.e., the initial Voronoi partition) is crucial [7] to obtain a good clustering, and several restarts, denoted by , are performed in practice to choose the best clustering. In practice, Forgy’s initialization has been replaced by means++ [2] which provides an expected
competitive initialization. However, it was shown that there exits point sets (even in 2D) for which the probability to get such a good initialization is exponentially low
[6] (and thus requiring exponentially many initialization restarts to reach a good Voronoi partition with high probability).On the other hand, the global means [18, 26] builds incrementally the clustering by adding one seed at a time. Given a current clustering it chooses the point in that minimizes the means objective function. Thus initialization is limited to choosing the first point, and all points can be considered as this first starting point. However, Global means requires more computation.
In this paper, we do not address the problem of choosing the most appropriate number, , of clusters: This model selection problem has been investigated in [22, 17]. We also consider the squared Euclidean distance although the results apply to any other Bregman divergence [3, 23].
The paper is organized as follows: We investigate the blessing of emptycluster exceptions in Lloyd’s heuristic in Section 2, and of singlepointcluster exceptions in Hartigan’s scheme in Section 3. In Section 4, we describe our novel heuristic mergesplitcluster means and report on its performances with respect to Hartigan’s heuristic. In Section 5, we present a generalization of the means objective function where each point is associated to its closest clusters: the means clustering. We show how to directl convert or iteratively relax a sequence of means to a means and compare experimentally those solutions with a direct means. Finally, Section 6 wrap ups the contributions and discusses further perspectives.
2 The blessing of emptycluster exceptions in Lloyd’s batched means
Lloyd’s means [19] starts by initializing the seeds of the cluster centers , and then iterates by assigning the data to their closest cluster center with respect to the squared Euclidean distance, and then relocates the cluster centers to their centroids. Those batched assignment/relocation iterations are repeated until convergence is reached: The means cost monotonically decreases with guaranteed convergence after a finite number of iterations [15]. The complexity of Lloyd’s means is where denotes the number of iterations. It has been proved that Lloyd’s means performs a maximum number of iterations exponential [25] or polynomial in , and the spread^{3}^{3}3The spread is the ratio of the maximum point interdistance over the minimum point interdistance. of the point set [16]. Some 1D point set are reported to take iterations even for , see [11]. We first, report a lower bound on the number of Lloyd’s stable optima :
Fact 2 (Exponentially many Lloyd’s means minima)
Lloyd’s means may have stable local minima.
The proof follows from the gadget illustrated in Figure 1.
The Hartigan’s heuristic [12, 13] proceeds by relocating a single point between two clusters provided that the means cost function decreases. It can thus further decrease the means score when Lloyd’s batched algorithm is stuck into a local minimum (but not the converse). Recently, Hartigan’s heuristic [23] was suggested to replace Lloyd’s heuristic on the basis that Hartigan’s local minima is a subset of Lloyd’s optima (Theorem 2.2 of [24]). We argue that this is true only when no Empty Cluster Exceptions (ECEs) are met by Lloyd’s iterations. Figure 1 illustrates a toy data set where Lloyd’s means meets such an emptycluster exception. In general at the end of the relocation stage, when points are assigned to their closest current centroids, we may have some empty clusters.
Fact 3 (emptycluster exceptions)
Lloyd’s batched means may produce empty cluster exceptions in a round.
Proof follows from Figure 1 by creating far apart (noninteracting) copies of the gadget and setting .
However, those emptycluster exceptions are a blessing because we may add new seeds that will further decrease significantly the cost of means: This is a partial reseeding. Thus the extended Lloyd’s heuristic is: (a) assignment, (b) relocation, and (c) partial reseeding to keep exactly nonempty clusters for the next stage. We may use various heuristics for partially reseeding like the incremental global means [26] starting from to , etc.
To evaluate the frequency at which those emptycluster exceptions occur and their number , let us take the Iris data set from the UCI repository [1]: It consists of samples with features (classified into
labels) that we first renormalize the dataset so that coordinates on each dimension have zero mean and unit standard deviation. Let us run Lloyd’s
means with (Forgy’s) random seed initialization (with a maximum number of iterations) for . We count the number of empty cluster exceptions and report their frequency in the graph of Figure 2. We observe that the larger the , and the more frequent the exceptions. This phenomenon was also noticed in [5]. Furthermore, increases with the dimension too [5]. However, note that this is a tendency and the number of emptycluster exceptions vary a lot from a data set to another one (given an initialization heuristic).Let us now run means and report the empirical frequency of having simultaneous emptycluster exceptions. (Note that our replicated toy datasets of Figure 1 may provide values). The emptycluster frequency depends on the initialization scheme: It is higher when using Forgy’s heuristic and lower when using means++ or global means. Table 2 demonstrates empirically this observation. As noticed in [5], the number of clusterempty exceptions rise with and and the authors [5] avoided this problem by setting minimum input size on clusters. They surprisingly show empirically that means with constraints gave better clustering than means without constraints in practice!
Finally, let us compare the best minimum means score when performing Lloyd’s heuristic (and stopping when we meet an empty cluster exception), and the extended Lloyd’s heuristic that partially reseeds the current clustering when the algorithm meets emptycluster exceptions. Partial reseeding can be done in many ways by starting from the current number of cluster centers the usual seeding methods (Forgy, means++ or global means). Table 1 presents the results for the proof of concept using Forgy’s reseeding: We observe that partial reseeding at ECEs allows to reach (slightly) better local minima (see in Table 1).
3 The blessing of singlepoint cluster exceptions in Hartigan’s heuristic
Hartigan’s heuristic [24] consider relocating a singlepoint provided that it decreases the means objective function. In [23], a synthetic noisy dataset is built so that with probability tending to (as the dimension tends to infinity) any initial random partition is stable wrt. Lloyd’s means while Hartigan’s converges to the correct solution. We recall that Hartigan’s local minima are a subset of Lloyd’s minima [24] provided that Lloyd’s heuristic did not encounter emptycluster exceptions. Note that a singlepoint cluster (with associated cluster having zero variance) cannot be relocated to other clusters since it necessarily increases the means energy (sum of intracluster variances):
(3) 
Table 2 provides statistics on the Hartigan’s means score and the number of singlepointcluster exceptions (SPCEs) met when performing Hartigan’s heuristic.
Consider the case of SinglepointCluster Exceptions (SCEs) in Hartigan’s scheme where we decide to merge this singlepoint cluster with another cluster and redraw another center from (that can thus decrease significantly the variance of the change cluster). We accept this relocation iff. this merge&reseed operation decreases the means loss. For example, when (and ), the classical Hartigan’s best clustering has means score while the heuristic with partial reseeding (associating the singlepoint clusters to their closest other clusters), we obtain . We keep the experiments short here since the next Section improves Hartigan’s heuristic with detailed experiments.
4 A novel heuristic: The mergeandsplitcluster means
This novel heuristic proceeds by considering pairs of clusters with corresponding centers and . The basic local search primitive (pivot) consists in computing the best means score difference by merging and splitting again with two new centers and . Let and denote the Voronoi partition of induced by and . Since the clusters other than and are untouched, the difference of the means score is written as:
(4) 
where denotes the means objective function: namely, the cluster variance of with respect to center . There are several ways (randomized or deterministic) to implement the mergeandsplit operation: For example, the two new centers can be found by computing:

a
means: A bruteforce method computes all hyperplanes
^{4}^{4}4We do not need to compute explicitly the equation of the hyperplane since clockwise/counterclockwise orientation predicates are used instead. Those predicates rely on computing the sign of a matrix determinant. passing through (extreme) points and the induced sum of variances of the belowabove clusters in time. Using topological sweep [15], it can be reduced to time. Note that for and unfixed , means is NPhard [8]. We can also use coresets to get a approximation of a means [9] in linear time . 
a discrete means: We choose among the points of the two best centers (naively implemented in ). This yields a approximation of means.

a means++ heuristic: We pick at random, then pick
randomly according to the normalized distribution of the squared distances of the points in
to , see means++ [2]. We repeat a given number of rounds this initialization (say, ) and keeps the best one.
When , we accept replacing and by and , respectively. Otherwise, we consider another pair of clusters and stop iterating when all pairs do not produce a lower means score. This heuristic can be classified as a macro kind of Hartigantype heuristic that is not based on local Voronoi assignment. Indeed, Hartigan’s heuristic moves a point from a cluster to a cluster and update the two centroids correspondingly. Our heuristic also change these two clusters but can accept further improvements with respect to a means operation on . Thus at the last stage of a Hartigan’s heuristic, we can perform this mergeandsplit heuristic to further improve the clustering. (This heuristic can further be generalized by simultaneous mergingandsplitting of clusters.)
Theorem 1
The mergeandsplit means heuristic decreases monotonically the objective function and converges after a finite number of iterations.
Since each pivot step between Voronoi partitions strictly decreases the means score by and that is lower bounded, it follows that the mergeandsplit means converges after a finite number of iterations. We compare our heuristic with both Hartigan’s ordinary and discrete variants that consists in moving a point to another cluster iff. the two recomputed medoids of the selected clusters yield a better means score. Heuristic performances are compared with the same initialization (Forgy’s or means++ seeding) and by averaging over a number of rounds: Observe in Table 3 that our heuristic (MSC for short) always outperforms discrete Hartigan’s method not suprisingly. Although the number of basic primitives (#ops) is lower for MSC, each such operation is more costly. Thus MSC means is overall more time consuming but gets better local optima solutions. Note that the discrete means medoid splitting procedure is very well suited for the modes algorithm [14], a means extension working on categorical data sets.
5 Clustering with the means objective function
Let us generalize the means objective function as follows: For each data , we associate to its nearest cluster centers (with denoting the cluster center indexes), and ask to minimize the following means objective function (with ):
(5) 
When , this is exactly the means objective function of Eq. 1. Otherwise the clusters overlap and . Note that when , since all cluster centers coincide to the centroid (or barycenter), the center of mass. We observe that:
Fact 4
with equality reached when .
Both Lloyd’s and Hartigan’s heuristics can be adapted straightforwardly to this setting.
Theorem 2
Lloyd’s means decreases monotonically the objective function and converge after a finite number of steps.
Proof: Let and denote the cost at round , for the assignment () and relocation () stages. Let be the initial cost (say, from Forgy’s initialization of ). For , we have: In the assignment stage , each point is assigned to its nearest neighbor centers . Therefore, we have . In the relocation stage , each cluster is updated by taking its centroid . Thus we have . When (and thus ), we stop the batched iterations.
Figure 3 illustrates a means on a toy dataset.
Since for all and the iterations strictly decreases the score function, the algorithm converges. Moreover, since the number of different cluster sets induced by means is upper bounded by , and that cluster sets cannot be repeated, it follows that means converges after a finite number of iterations. The bound can further be improved by considering the order weighted Voronoi diagrams, similarly to [15]. Note that the basic Lloyd’s means may also produce emptycluster exceptions although those become rarer as increase (checked experimentally).
Although means is interesting in its own (see Discussion in Section 6), it can also be used for means. Indeed, instead of running a local search means heuristic that may be trapped too soon into a “bad” local minimum, we prefer to run a means for a prescribed . We can then convert a means by assigning to each point its closest neighbor (among the assigned at the end of the means), and then compute the centroids and launch a regular Lloyd’s means to finalize: Let means denote this conversion. For example, for and , the converted means beats the means of the time for using Forgy’s initialization on Iris. Table 4 shows experimentally that converted means beats on average the regular means (for the Iris dataset) and this phenomenon increases not surprisingly with . However the best minimum score is often obtained by classical means. Thus it suggests that performs better when the number of restarts is limited. In fact, means tend to smooth the means optimization landscape and produce less local minima but also smooth the best minima.
kwinmeansmeansminavgminavg320.878.9492.3978.9478.94424.2957.3163.1557.3170.33557.7646.5352.8849.7451.10680.5538.9345.6038.9341.63776.6734.1840.0034.2936.85880.3629.8736.0529.8732.52978.8527.7632.9127.9130.151079.8825.8130.2425.9728.02  klwinmeansmeansminavgminavg5258.346.5352.7249.7451.245462.446.5352.5549.7449.748280.829.8736.4029.8732.548361.129.8736.1932.7634.048655.529.8836.18932.7535.2610278.825.8130.6125.9728.2310382.525.9530.2326.4727.7610564.725.9030.3226.9928.61 
We can also perform a cascading conversion of means to means: Once a local minimum is reached for means, we initialize a means by dropping for each point its farthest cluster, perform a Lloyd’s means, and we reiterate this scheme until we get a means: An ordinary means. Let means denote this scheme. Table 4 (right) presents the performance comparisons of a regular Lloyd’s means with a Lloyd’s means for various values of with the initialization of both algorithms performed by the same seeding for fair comparisons.
6 Discussion
We have extended the classical Lloyd’s and Hartigan’s heuristics with partial reseeding and proposed new local heuristics for means. We summarize our contributions as follows: First, we showed the blessing of emptycluster events in Lloyd’s heuristic and of singlepointcluster events in Hartigan’s heuristic. These events happen increasingly when the number of cluster or the dimension increase, or when running those heuristics a given number of trials to choose the best solution. Second, we proposed a novel mergeandsplitcluster means heuristic that improves over Hartigan’s heuristic that is currently the de facto method of choice [23]. We showed experimentally that this method brings better means result at the expense of computational cost. Third, we generalized the means objective function to the means objective function and show how to directly convert or iteratively relax a means heuristic to a means avoiding potentially being trapped into too many local optima. Means is yet another exploratory clustering technique for browsing the space of hard clustering partitions. For example, when means is trapped, we may consider a means to get out of the local minimum and then convert the means to a means to explore a new (local) minimum.
References

[1]
D.J. Newman A. Asuncion.
UCI machine learning repository, 2007.
 [2] D. Arthur and S. Vassilvitskii. means++ : the advantages of careful seeding. In SODA, pages 1027 – 1035, 2007.
 [3] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with Bregman divergences. Journal of Machine Learning Research, 6:1705–1749, 2005.

[4]
R. Bellman.
A note on cluster analysis and dynamic programming.
Mathematical Biosciences, 18(34):311 – 312, 1973.  [5] K. Bennett, Paul S. Bradley, and Ayhan Demiriz. Constrained means clustering. MSRTR200065, 2000.
 [6] A. Bhattacharya, R. Jaiswal, and N. Ailon. A tight lower bound instance for means++ in constant dimension. In Theory and Applications of Models of Computation, LNCS 8402, pages 7–22, 2014.
 [7] S. Bubeck, M. Meila, and U. von Luxburg. How the initialization affects the stability of the means algorithm. ESAIM: Probability and Statistics, 16:436–452, 1 2012.
 [8] S. Dasgupta. The hardness of means clustering. CS20070890, University of California, USA, 2007.
 [9] D. Feldman, M. Monemizadeh, and C. Sohler. A PTAS for means clustering based on weak coresets. In SoCG, pages 11–18. 2007.
 [10] E. W. Forgy. Cluster analysis of multivariate data: efficiency vs interpretability of classifications. Biometrics, 1965.
 [11] S. HarPeled and B. Sadri. How fast is the means method? In SODA, pages 877–885. SIAM, 2005.
 [12] J. A. Hartigan. Clustering Algorithms. John Wiley & Sons, Inc., New York, NY, USA, 99th edition, 1975.
 [13] J. A. Hartigan and M. A. Wong. Algorithm AS 136: A means clustering algorithm. Journal of the Royal Statistical Society. Series C, 28(1):100–108, 1979.
 [14] Z. Huang. Extensions to the means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov., 2(3):283–304, September 1998.
 [15] M. Inaba, N. Katoh, and H. Imai. Applications of weighted Voronoi diagrams and randomization to variancebased clustering. In SoCG, pages 332–339, 1994.
 [16] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. Piatko, R. Silverman, and A. Y Wu. The analysis of a simple means clustering algorithm. In SoCG, pages 100–109. 2000.
 [17] B. Kulis and M. I. Jordan. Revisiting means: New algorithms via Bayesian nonparametrics. In ICML, 2012.
 [18] A. Likas, N. Vlassis, and J. J Verbeek. The global means clustering algorithm. Pattern recognition, 36(2):451–461, 2003.
 [19] S. P. Lloyd. Least squares quantization in PCM. Technical report, Bell Laboratories, 1957. reprinted in IEEE Transactions on Information Theory, March 1982.
 [20] J. B. MacQueen. Some methods of classification and analysis of multivariate observations. Proceedings Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967.
 [21] M. Mahajan, P. Nimbhorkar, and K. Varadarajan. The planar means problem is NPhard. In WALCOM: Algorithms and Computation, pages 274–285. Springer, 2009.

[22]
D. Pelleg and A. W. Moore.
means: Extending
means with efficient estimation of the number of clusters.
In Proceedings of the Seventeenth International Conference on Machine Learning, pages 727–734, 2000.  [23] N. Slonim, E. Aharoni, and . Crammer. Hartigan’s means versus Lloyd’s means: Is it time for a change? In IJCAI, pages 1677–1684, 2013.

[24]
M. Telgarsky and A. Vattani.
Hartigan’s method: means clustering without Voronoi.
In
International Conference on Artificial Intelligence and Statistics
, pages 820–827, 2010.  [25] A. Vattani. means requires exponentially many iterations even in the plane. Discrete & Computational Geometry, 45(4):596–616, 2011.
 [26] J. Xie, S. Jiang, W. Xie, and X. Gao. An efficient global means clustering algorithm. Journal of computers, 6(2), 2011.
Comments
There are no comments yet.