One of the major challenges in the theory of clustering is to bridge the large disconnect between our theoretical and practical understanding of the complexity of clustering. While theory tells us that most common clustering objectives like -means or -median clustering problems are intractable in the worst case, many heuristics like Lloyd’s algorithm or k-means++ seem to be effective in practice. In fact, this has led to the “CDNM” thesis [12, 10]: “Clustering is difficult only when it does not matter”.
We try to address the following natural questions in this paper: Why are real-world instances of clustering easy? Can we identify properties of real-world instances that make them tractable?
We focus on the Euclidean -means clustering problem where we are given points , and we need to find centers minimizing the objective . The -means clustering problem is the most well-studied objective for clustering points in Euclidean space . The problem is NP-hard in the worst-case  even for , and a constant factor hardness of approximation is known for larger .
One way to model real-world instances of clustering problems is through instance stability, which is an implicit structural assumption about the instance. Practically interesting instances of the -means clustering problem often have a clear optimal clustering solution (usually the ground-truth clustering) that is stable: i.e., it remains optimal even under small perturbations. As argued in , clustering objectives like -means are often just a proxy for recovering a ground-truth clustering that is close to the optimal solution. Instances in practice always have measurement errors, and optimizing the -means objective is meaningful only when the optimal solution is stable to these perturbations.
This notion of stability was formalized independently in a pair of influential works [12, 8]. The predominant strand of work on instance stability assumes that the optimal solution is resilient to multiplicative perturbations of the distances . For any , a metric clustering instance on point set and metric is said to be -factor stable iff the (unique) optimal clustering of remains the optimal solution for any instance where any (subset) of the the distances are increased by up to a factor i.e., for any . In a series of recent works [5, 9] culminating in , it was shown that -factor perturbation stable (i.e., ) instances of -means can be solved in polynomial time.
Multiplicative perturbation stability represents an elegant, well-motivated formalism that captures robustness to measurement errors for clustering problems in general metric spaces ( captures relative errors of 10% in the distances). However, multiplicative perturbation stability has the following drawbacks in the case of Euclidean clustering problems:
Measurement errors in Euclidean instances are better captured using additive perturbations. Uncertainty of in the position of leads to an additive error of in , irrespective of how large or small is.
The amount of stability, , needed to enable efficient algorithms (i.e., ) often imply strong structural conditions, that are unlikely to be satisfied by many real-world datasets. For instance, -factor perturbation stability implies that every point is a multiplicative factor of closer to its own center than to any other cluster center.
Algorithms that are known to have provable guarantees under multiplicative perturbation stability are based on single-linkage or MST algorithms that are very non-robust by nature. In the presence of a few outliers or noise, any incorrect decision in the lower layers gets propagated up to the higher levels.
In this work, we consider a natural additive notion of stability for Euclidean instances: the optimal clustering should not change even when each point is moved a Euclidean distance of at most . This corresponds to a small additive perturbation to the pairwise distances between the points111Note that not all additive perturbations to the distances can be captured by an appropriate movement of the points in the cluster. Hence the notion we consider in our paper is a weaker assumption on the instance.. Unlike multiplicative notions of perturbation stability [12, 5], this notion of additive perturbation is not scale invariant. Hence the normalization or scale of the perturbation is important.
Ackerman and Ben-David  initiated the study of additive perturbation stability when the distance between any pair of points can be changed by at most with being the diameter of the whole dataset. The algorithms take time and correspond to polynomial time algorithms when are constants. However, this dependence of in the exponent is not desirable since the diameter is a very non-robust quantity – the presence of one outlier (that is even far away from the decision boundary) can increase the diameter arbitrarily. Hence, these guarantees are useful mainly when the whole instance lies within a small ball and the number of clusters is small [1, 11]. Our notion of additive perturbation stability will use a different scale parameter that is closely related to the distance between the centers instead of the diameter . Our results for additive perturbation stability have no explicit dependence on the diameter, and allows instances to have potentially unbounded clusters (as in the case of far-way outliers). With some additional assumptions, we also obtain polynomial time algorithmic guarantees for large .
1.1 Additive perturbation stability and our contributions
We consider a notion of additive stability where the points in the instance can be moved by at most , where is a parameter, and is the maximum distance between pairs of means. Suppose is a -means clustering instance with optimal clustering . We say that is -additive perturbation stable (-APS) iff every -additive perturbation of has as an optimal clustering solution. Note that there is no restriction on the diameter of the instance, or even the diameters of the individual clusters. Hence, our notion of additive perturbation stability allows the instance to be unbounded.
Geometric properties of -APS instances.
Clusters in the optimal solution of an -APS instance satisfy a natural geometric condition — there is an “angular separation” between every pair of clusters.
Proposition 1.1 (Geometric Implication of -Aps).
Let be an -APS instance and let be two clusters in its optimal solution. Any point lies in a cone whose axis is along the direction with half-angle .
Hence if is the unit vector along
is the unit vector alongthen
The distance between and the apex of the cone is . We will call the scale parameter of the clustering. See Figure 1a for an illustration.
We believe that many clustering instances in practice satisfy the -APS condition for reasonable constants . In fact, our experiments in Section 7 suggest that the above geometric condition is satisfied for reasonable values e.g., .
While the points can be arbitrarily far away from their own means, the above angular separation (1) is crucial in proving the polynomial time guarantees for our algorithms. For instance, this implies that at least of the points in a cluster are within a Euclidean distance of at most from . This geometric condition (1) of the dataset enables the design of a tractable algorithm for14]. See Section 4 for details on the case.
Informal Theorem 1.2.
For any fixed , there exists a time algorithm that correctly clusters all -APS -means instances.
For -means clustering, similar techniques can be used to learn the separating halfspace for each pair of clusters. However this incurs an exponential dependence on , which renders this approach inefficient for large .222We remark that the results of  also incur an exponential dependence on . We now consider a natural strengthening of this assumption that allows us to achieve guarantees for general .
Angular Separation with additional margin separation.
We consider a natural strengthening of additive perturbation stability where there is an additional margin between any pair of clusters. This is reminiscent of margin assumptions in supervised learning of halfspaces and spectral clustering guarantees of Kumar and Kannan (see Section 1.2). Consider a -means clustering instance with optimal solution . We say this instance is iff for each , the subinstance induced by has parameter scale , and all points in the clusters lie inside cones of half-angle , which are separated by a margin of at least . This is implied by the stronger condition that the subinstance induced by is -additive perturbation stable with scale parameter even when and are moved towards each other by . See Figure 1b for an illustration. stable instances are defined formally in geometric terms in Section 3.
Informal Theorem 1.3 (Polytime algorithm for instances).
There is an -time333The hides logarithmic factors in . algorithm that given any instance that is with recovers its optimal clustering .
A formal statement of the theorem (with unequal sized clusters) and its proof are given in Section 5. We prove these polynomial time guarantees for a new, simple algorithm (Algorithm 5.1). The algorithm constructs a graph with one vertex for each point, and edges between points that are within a distance of at most (for an appropriate threshold ). The algorithm then finds the -largest connected components and uses the empirical means of these components to cluster all the points.
In addition to having provable guarantees, the algorithm also seems efficient in practice, and performs well on standard clustering datasets. Experiments that we conducted on some standard clustering datasets in UCI suggest that our algorithm manages to almost recover the ground truth and achieves a -means objective cost that is very comparable to Lloyd’s algorithm and -means++.
In fact, our algorithm can also be used to initialize Lloyd’s algorithm: our guarantees show that when the instance is , one iteration of Lloyd’s algorithm already finds the optimal clustering. Experiments suggest that our algorithm finds initializers of smaller -means cost compared to the initializers of -means++  and also recover the ground-truth to good accuracy.
Experimental results and analysis of real-world data sets can be found in Section 7.
Robustness to outliers.
Perturbation stability requires the optimal solution to remain completely unchanged under any valid perturbation. In practice, the stability of an instance may be dramatically reduced by a few outliers. We show provable guarantees for a slight modification of Algorithm 5.1 in the setting where an -fraction of the points can be arbitrary outliers, and do not lie in the stable regions. Formally, we assume that we are given an instance where there is an (unknown) set of points with such that is a instance. Here is assumed to be less than the size of the smallest cluster by a constant factor. This is similar to robust perturbation resilience considered in [9, 17]. Our experiments in Section 7 indicate that the stability or separation can increase a lot after ignoring a few points close to the margin.
In what follows, and are the maximum and minimum weight of clusters, and .
Informal Theorem 1.4.
Given where is for
and , there is a polynomial time algorithm running in time that returns a clustering consistent with on .
This robust algorithm is effectively the same as Algorithm 5.1 with one additional step that removes all low-degree vertices in the graph. This step removes bad outliers in without removing too many points from .
1.2 Comparisons to other related work
Awasthi et al. showed that -multiplicative perturbation stable instance also satisfied the notion of -center based stability (every point is a -factor closer to its center than to any other center) . They showed that an algorithm based on the classic single linkage algorithm works under this weaker notion when . This was subsequently improved by , and the best result along these lines  gives a polynomial time algorithm that works for . A robust version of -perturbation resilience was explored for center-based clustering objectives . As such, the notions of additive perturbation stability, and instances are incomparable to the various notions of multiplicative perturbation stability. Furhter as argued in , we believe that additive perturbation stability is more realistic for Euclidean clustering problems.
Ackerman and Ben-David initiated the study of various deterministic assumptions for clustering instances. The measure of stability most related to this work is Center Perturbation (CP) clusterability (an instance is -CP-clusterable if perturbing the centers by a distance of does not increase the cost much). A subtle difference is their focus on obtaining solutions with small objective cost, while our goal is to recover the optimal clustering. However, the main qualitative difference is how the length scale is defined — this is crucial for additive perturbations. The run time of the algorithm in is , where the length scale of the perturbations is , the diameter of the whole instance. Our notion of additive perturbations uses a much smaller length-scale of (essentially the inter-mean distance; see Prop. 1.1 for a geometric interpretation), and Theorem 1.2 gives a run-time guarantee of for (Theorem 1.2 is stated in terms of ). By using the largest inter-mean distance instead of the diameter as the length scale, our algorithmic guarantees can also handle unbounded clusters with arbitrarily large diameters and outliers.
The exciting results of Kumar and Kannan  and Awasthi and Sheffet also gave a determinstic margin-separation condition, under which spectral clustering (PCA followed by -means) 444This requires appropriate initializers, that they can obtain in polynomial time. finds the optimum clusters under deterministic conditions about the data. Suppose is the “spectral radius” of the dataset, where is the matrix given by the centers. In the case of equal-sized clusters, the improved results of  proves approximate recovery of the optimal clustering if the margin between the clusters along the line joining the centers satisfies . Our notion of margin in instances is analogous to the margin separation notion used by the above results on spectral clustering [16, 7]. In particular, we require a margin of where is our scale parameter, with no extra factor. However, we emphasize that the two margin conditions are incomparable, since the spectral radius is incomparable to the scale parameter .
We now illustrate the difference between these deterministic conditions by presenting a couple of examples. Consider an instance with points drawn from a mixture of Gaussians in
dimensions with identical diagonal covariance matrices with variancein the first coordinates and roughly in the others, and all the means lying in the subspace spanned by these first co-ordinates. In this setting, the results of [16, 7] require a margin separation of at least between clusters. On the other hand, these instances satisfy our geometric conditions with , and therefore our algorithm only needs a margin separation of (hence, saving a factor of )555Further, while algorithms for learning GMM models may work here, adding some outliers far from the decision boundary will cause many of these algorithms to fail, while our algorithm is robust to such outliers.. However, if the points were drawn from a mixture of spherical Gaussians in high dimensions (with ), then the margin condition required for [16, 7] is weaker.
In the -means clustering problem, we are given points in and need to find centers minimizing
A given choice of centers determines an optimal clustering where . We can rewrite the objective as
On the other hand, a given choice for cluster determines its optimal center as , the mean of the points in the set. Thus, we can reformulate the problem as minimizing over clusters of the objective
-means clustering is NP-hard for general Euclidean space even in the case of .
3 Stability definitions and geometric properties
3.1 Balance parameter
We define an instance parameter, , capturing how balanced a given instance’s clusters are.
Definition 3.1 (Balance parameter).
Given an instance with optimal clustering , we say satisfies balance parameter if for all , .
3.2 Additive perturbation stability
Definition 3.2 (-additive perturbation).
Let be a -means clustering instance with unique optimal clustering whose means are given by . Let . We say that is an -additive perturbation of if for all , .
Definition 3.3 (-additive perturbation stability).
Let be a -means clustering instance with unique optimal clustering . We say that is -additive perturbation stable (APS) if every -additive perturbation of has an optimal clustering given by .
Intuitively, the difficulty of the clustering task increases as the stability parameter decreases. For example, when the set of -APS instances contains any instance with a unique solution. In the following we will only consider .
3.3 Geometric implication of -Aps
Let be an -APS -means clustering instance such that each cluster has at least points. Fix and consider clusters , with means , . We fix the following notation.
Let and let .
Let be the unit vector in the intermean direction. Let be the space orthogonal to . For , let and be the projections onto and .
Let be the midpoint between and .
We can establish geometric conditions that must satisfy by considering different perturbations. As an example, one could move all points in and towards each other in the intermean direction a distance of
; by assumption no point has crossed the separating hyperplane and thus we can conclude the existence of a margin of width.
A careful choice of a family of perturbations allows us to prove Proposition 1.1. Consider the perturbation which moves and in opposite directions orthogonal to while moving a single point towards the other cluster parallel to (see figure 2). The following lemma establishes Proposition 1.1.
For any , .
Let be a unit vector perpendicular to . Without loss of generality, let (taking or does not change the inequality). Let such that are distinct. Let and consider the -additive perturbation given by the union of
and an unperturbed copy of .
By assumption, remain optimal clusters in . We have constructed such that the new means of , are and , and the midpoint between the means is . The halfspace containing given by the linear separator between and is . Hence, as
is classified correctly by the-APS assumption,
Then noting that , we have that . ∎
This geometric property follows from perturbations which only affect two clusters at a time. Our results follow from this weaker notion.
Motivated by Lemma 3.4, we define a geometric condition where the angular separation and margin separation are parametrized separately. These separations are implied by a stronger stability assumption where any pair of clusters is -APS with scale parameter even after being moved towards each other a distance of .
We say that a pair of clusters is -separated if their points lie in cones with axes along the intermean direction, half-angle , and apexes at distance from their means and at least from each other (see figure 1b). Formally, we require the following.
Definition 3.5 (Pairwise -separation).
Given a pair of clusters , with means , , let be the unit vector in the intermean direction and let . We say that and are -separated if and for all ,
Definition 3.6 (-separation).
We say that an instance is -separated if every pair of clusters in the optimal clustering is -separated.
4 -means clustering for
In this section, we give an algorithm that is able to cluster -means -APS instances correctly.
There exists a universal constant such that for any fixed , there exists an time algorithm that correctly clusters all -APS -means instances.
The algorithm is inspired by work in 
showing that the perceptron algorithm runs in poly-time with high probability in the smoothed analysis setting.
4.1 Review of perceptron algorithm
Suppose is a sequence of labeled -samples consistent with a linear threshold function, i.e., there exists vector such that the labeling function is consistent with . At time , the perceptron algorithm sets . At each subsequent time step, the algorithm sees sample , outputs as its guess for , sees the true label , and updates . On a correct guess, , and on a mistake .
The following well-known theorem  bounds the number of total mistakes the perceptron algorithm can make in terms of the sequence’s angular margin.
The number of mistakes made by the perceptron algorithm is bounded above by for
For a universe of elements and a function , we will denote by the multiset where appears in the multiset -many times. The size of a multiset is . The next lemma is an immediate consequence of the above theorem (see proof in Appendix A).
There exists a multiset of size at most such that correctly classifies all of .
4.2 A perceptron-based clustering algorithm
Fix the following notation: let be an -APS -means clustering instance with optimal clusters , such that each cluster has at least points. Let , , . Without loss of generality, assume that .
Lemma 3.4 gives a lower bound for in the correctly-centered set . Thus Lemma 4.3 might suggest a simple algorithm: for each multiset of bounded size and each of its possible labels, compute the cost of the associated clustering, then output the clustering of minimum cost. However, a difficulty arises as the clusters , may not be linearly separable (in particular the separating hyperplane may not pass through the origin). Note that the guarantees of the perceptron algorithm, and hence Lemma 4.3, do not hold in this case. Instead, we will apply the above idea to an instance , constructed from , in which , are linearly separable and we can efficiently lower bound .
Consider the following algorithm.
4.3 Overview of proof of Theorem 4.1
Each new instance constructed in the algorithm has labeling consistent with some linear threshold function: . Then taking , we have that .
We will lower bound for a particular instance in which have nice properties. The following lemma states that on one of the iterations of its outer for loop, Algorithm 4.4 will pick such points.
There exist points , such that and .
The geometric conditions implied by -APS allow us to bound in terms of . In particular, using this handle on , it is possible to prove the following lower bound on .
There exists constant such that for any satisfying Lemma 4.5, the corresponding instance has
The correctness of Algorithm 4.4 for all -APS -means clustering instances in which each cluster has at least points then follows from Lemmas 4.3, 4.5, and 4.6. On the other hand, the optimal -means clustering where one of the clusters has at most points can be calculated in time. An algorithm that returns the better of these two solutions thus correctly clusters all -APS -means instances, completing the proof of Theorem 4.1. See Appendix A.2 for proofs of Lemmas 4.5 and 4.6.
5 -means clustering for general
For general , we will require the stronger -separation. Consider the following algorithm.
Algorithm 5.1 recovers for any instance with and can be implemented in time.
This running time can be achieved by inserting edges into a dynamic graph in order, maintaining connected components and their means using a union-find data structure, and noting that the number of connected components can change at most times.
In particular, note that this algorithm does not need any prior knowledge of the stability parameters and its running time has no dependence on , , or .
Define the following regions of for every pair . Given , let be the corresponding clusters with means . Let be the unit vector in the inter-mean direction.
See Figure 1b. for an illustration.
It suffices to prove the following two lemmas. Lemma 5.4 states that the initialization returned by the INITIALIZE subroutine satisfies certain properties when we guess correctly. As is only used as a threshold on edge lengths, testing the distances between all pairs of data points i.e. suffices. Lemma 5.5 states that the ASSIGN subroutine correctly clusters all points given an initialization satisfying these properties.
For a instance with balance parameter and , the INITIALIZE subroutine finds a set where when .
For a instance with , the ASSIGN subroutine recovers correctly when initialized with points where .
5.1 Proof of Lemma 5.4.
Suppose and consider the graph constructed by Algorithm 5.1. We start by defining the core region of each cluster.
Definition 5.6 ().
The core regions are defined in such a way that for each cluster , all points in belong to a single connected component. Although may not contain too many points on its own, the connected component containing will contain most (at least fraction) of the points in . Hence, the largest components will be the connected components containing the different core regions. Finally, since the connected component containing contains most of the points in , the geometric conditions of -separation ensure that the empirical mean of the connected component lies in . The following lemma states some properties of the connected components in our graph. Its proof can be found in Appendix B.1.
Any connected component only contains points from a single cluster.
For all , . There is a point such that .
For all , let . Then, .
For all , is connected in .
For all , is connected in .
The largest component, , in each cluster contains for each . In particular, , and contains .
Lemma 5.8 states that the largest components (and hence ) must belong to different clusters while Lemma 5.9 states that each lie inside a good region. Together, they imply Lemma 5.4, i.e. each comes from a different good region.
The set of largest components of contains the largest component of each cluster.
Let be the largest component in and let be a component in that is not the largest. Then by the parameter, . It follows that the largest connected components are . ∎
The mean of points in lies in .
Let be the mean of the points in . As is a convex set, . As , the points not contained in have . Noting that , it follows that . Hence, . As this holds for each , . ∎
5.2 Proof of Lemma 5.5.
We will show that for any , , and , is closer to than to . The following lemma states some properties of the perpendicular bisector between and . These statements follow from the definitions of the nice regions and the angular separation. Its proof can be found in Appendix B.2.
Suppose . Then, for and , we have
To prove Lemma 5.5, we rewrite the condition as . Then we write each vector in terms of their projection on and and use the above lemma to bound each of the terms.
6 Robust -means
A simple extension of algorithm 5.1 does well even in the presence of adversarial noise for instances with -separation for large enough . Specifically, we consider the following model.
Let be a -means clustering instance with optimal clustering . We call the set of pure points. An additional set of at most -many impure points is added by an adversary. Our goal is to find a clustering of that agrees with on the pure points.
Let and let be the maximum and minimum weight of clusters. We will assume that .
7 Experimental results
We evaluate Algorithm 5.1 on multiple real world datasets and compare its performance to the performance of -means++, and also check how well these datasets satisfy our geometric conditions.
Experiments were run on unnormalized and normalized versions of four labeled datasets from the UCI Machine Learning Repository: Wine (, , ), Iris (, , ), Banknote Authentication (, , ), and Letter Recognition (, , ). Normalization was used to scale each feature to unit range.
The cost of the solution returned by Algorithm 5.1 for each of the normalized and unnormalized versions of the datasets is recorded in Table 1 column 2. Our guarantees show that under -separation for appropriate values of (see section 5), the algorithm will find the optimal clustering after a single iteration of Lloyd’s algorithm. Even when does not satisfy our requirement, we can use our algorithm as an initialization heuristic for Lloyd’s algorithm. We compare our initialization with the -means++ initialization heuristic ( weighting). In Table 1, this is compared to the smallest initialization cost of 1000 trials of -means++ on each of the datasets, the solution found by Lloyd’s algorithm using our initialization and the smallest -means cost of 100 trials of Lloyd’s algorithm using a -mean++ initialization.
Separation in real data sets.
As the ground truth clusterings in our datasets are not in general linearly separable, we consider the clusters given by Lloyd’s algorithm initialized with the ground truth solutions.
Values of for Lemma 3.4. We calculate the maximum value of such that every pair of clusters satisfies the angular and margin separations implied by -APS (Lemma 3.4). The results are recorded in Table 2. We see that the average value of lies approximately in the range .
Values of -separation. We attempt to measure the values of , , and in the datasets. For , , and a pair of clusters , , we calculate as the maximum margin separation a pair of axis-aligned cones with half-angle can have while capturing a -fraction of all points. For some datasets and values for and , there may not be any such value of , in this case we leave the corresponding entry blank. These results are collected in Table 3.
Ground truth recovery.
The clustering returned by our algorithm recovers well () the solution returned by Lloyd’s algorithm initialized with the ground truth for Wine, Iris, and Banknote Authentication across normalized and unnormalized datasets.
|Dataset||Alg 5.1||-means++||Alg 5.1 with Lloyd’s||-means++ with Lloyd’s|
|Letter Rec. (norm.)||3367.8||4092.1||2767.5||2742.3|
|Letter Rec. (norm.)||8.49e-06||0.0564||0.247|