Sims/figs for "Recovery guarantees for exemplar-based clustering," by Abhinav Nellore and Rachel Ward.
For a certain class of distributions, we prove that the linear programming relaxation of k-medoids clustering---a variant of k-means clustering where means are replaced by exemplars from within the dataset---distinguishes points drawn from nonoverlapping balls with high probability once the number of points drawn and the separation distance between any two balls are sufficiently large. Our results hold in the nontrivial regime where the separation distance is small enough that points drawn from different balls may be closer to each other than points drawn from the same ball; in this case, clustering by thresholding pairwise distances between points can fail. We also exhibit numerical evidence of high-probability recovery in a substantially more permissive regime.READ FULL TEXT VIEW PDF
Recently, Awasthi et al. introduced an SDP relaxation of the k-means
Efficient algorithms for k-means clustering frequently converge to
The problem of community detection with two equal-sized communities is
Recovery procedures in various application in Data Science are based on
Metric based comparison operations such as finding maximum, nearest and
Stochastic separation theorems play important role in high-dimensional d...
Sum-of-norms clustering is a popular convexification of K-means clusteri...
Sims/figs for "Recovery guarantees for exemplar-based clustering," by Abhinav Nellore and Rachel Ward.
Consider a collection of points in Euclidean space that forms roughly isotropic clusters. The centroid
of a given cluster is found by averaging the position vectors of its points, while themedoid, or exemplar, is the point from within the collection that best represents the cluster. To distinguish clusters, it is popular to pursue the -means objective: partition the points into clusters such that the average squared distance between a point and its cluster centroid is minimized. This problem is in general NP-hard [1, 2]. Further, it has no obvious convex relaxation, which could recover the global optimum while admitting efficient solution; practical algorithms like Lloyd’s and Hartigan-Wong  typically converge to local optima. -medoids clustering111-medoids clustering is sometimes called -medians clustering in the literature. is also in general NP-hard [5, 6], but it does admit a linear programming (LP) relaxation. The objective is to select points as medoids such that the average squared distance (or other measure of dissimilarity) between a point and its medoid is minimized. This paper obtains guarantees for exact recovery of the unique globally optimal solution to the -medoids integer program by its LP relaxation. Commonly used algorithms that may only converge to local optima include partitioning around medoids (PAM) [7, 8] and affinity propagation [9, 10].
To illustrate the difference between a centroid and a medoid, let us put faces to points. The Yale Face Database  has grayscale images of several faces, each captured wearing a range of expressions—normal, happy, sad, sleepy, surprised, and winking. Suppose every point encodes an image from this database as the vector of its pixel values. Intuitively, facial expressions represent perturbations of a background composed of distinguishing image features; it is thus natural to expect that the faces cluster by individual rather than expression. Both Lloyd’s algorithm and affinity propagation are shown to recover this partitioning in Figure 1, which also displays centroids and medoids of clusters.222 randomly initialized repetitions of Lloyd’s algorithm were run; the clustering that gave the smallest objective function value is shown. The package APCluster  was used to perform affinity propagation. The centroids are averaged faces, but the medoids are actual faces from the dataset. Indeed, applications of -medoids clustering are numerous and diverse: besides finding representative faces from a gallery of images , it can group tumor samples by gene expression levels  and pinpoint the influencers in a social network .
We formulate -medoids clustering on a complete weighted undirected graph with vertices, although recovery guarantees are proved for the case where vertices correspond to points in Euclidean space and each edge weight is the squared distance between the points it connects.333We use squared distances rather than unsquared distances only because we were able to derive stronger theoretical guarantees using squared distances. Let characters in boldface (“”) refer to matrices/vectors and italicized counterparts with subscripts (“”) refer to matrix/vector elements. Denote as the nonnegative weight of the edge connecting vertices and , and note that since is simple. -medoids clustering (KMed) finds the minimum-weight bipartite subgraph of such that and every vertex in has unit degree. The vertices in are the medoids. Expressed as a binary integer program, KMed is
Above, means the set . When , vertex serves as vertex ’s medoid; that is, among all edges between medoids and , the edge between and has the smallest weight. Otherwise, . A cluster is identified as a maximal set of vertices that share a given medoid.
Like many clustering programs, KMed is in general NP-hard and thus computationally intractable for a large . Replacing the binary constraints (5) with nonnegativity constraints, we obtain the linear program relaxation LinKMed:
For a vector (point) , let denote its norm. It is known that for any configuration of points in one-dimensional Euclidean space, the LP relaxation of -medoids clustering invariably recovers clusters when unsquared distances are used to measure dissimilarities between points. Therefore, we confine our attention to . The following is our main recovery result, and its proof is obtained in the third section.
Consider unit balls in -dimensional Euclidean space (with ) for which the centers of any two balls are separated by a distance of at least . From each ball, draw points as independent samples from a spherically symmetric distribution supported in the ball satisfying
Suppose that squared distances are used to measure dissimilarities between points, . Then there exist values of and for which the following statement holds: with probability exceeding , the optimal solution to -medoids clustering (KMed) is unique and agrees with the unique optimal solution to (LinKMed), and assigns the points in each ball to their own cluster.
The uniform distribution satisfies (
The uniform distribution satisfies (11) in dimension , but for , a distribution satisfying (11) concentrates more probability mass towards the center of the ball. This means that the recovery results of Theorem 1 are stronger for smaller . However, by applying a random projection, points in dimensions can be projected into dimensions while preserving pairwise Euclidean distances up to a multiplicative factor . In this sense, clustering problems in high-dimensional Euclidean space can be reduced to problems in low-dimensional Euclidean space.
Once the centers of any two unit balls are separated by a distance of 4, points from within the same ball are necessarily at closer distance than points from different balls. For the k-medoid problem, cluster recovery guarantees in this regime are given in . As far as the authors are aware, Theorem 1 provides the first recovery guarantees for k-medoids beyond this regime.
While the literature on clustering is extensive, three lines of inquiry are closely related to the results contained here.
Recovery guarantees for clustering by convex programming. Our work is aligned in spirit with the tradition of the compressed sensing community, which has sought probabilistic recovery guarantees for convex relaxations of nonconvex problems. Reference  presents such guarantees for the densest -clique problem : partition a complete weighted graph into disjoint cliques so that the sum of their average edge weights is minimized. Also notable are [20, 21, 22, 23], which find recovery guarantees for correlation clustering  and variants. Correlation clustering outputs a partitioning of the vertices of a complete graph whose edges are labeled either “” (agreement) or “” (disagreement); the partitioning minimizes the number of agreements within clusters plus the number of disagreements between clusters.
In all papers mentioned in the previous paragraph, the probabilistic recovery guarantees apply to the stochastic block model (also known as the planted partition model) and generalizations. Consider a graph with vertices, initially without edges. Partition the vertices into clusters. The stochastic block model [25, 26] is a random model that draws each edge of the graph independently: the probability of a “” (respectively, “”) edge between two vertices in the same cluster is (respectively, ), and the probability of a “” (respectively, “”) edge between two vertices in different clusters is (respectively, ). Unfortunately, any model in which edge weights are drawn independently does not include graphs that represent points drawn independently in a metric space. For these graphs, the edge weights are interdependent distances.
that lies closest to a set of points. This problem has only trivial overlap with ours; exemplars are “zero-dimensional hyperplanes” that lie close to clustered points, but there is only one zero-dimensionalsubspace of —the origin. Reference , on the other hand, introduces a tractable convex program that does find medoids. This program can be recast as a dualized form of -medoids clustering. However, the deterministic guarantee of :
applies only to the case where the clusters are recoverable by thresholding pairwise distances; that is, two points in the same cluster must be closer than two points in different clusters. Our probabilistic guarantees include a regime where such thresholding may fail.
specifies that a regularization parameter in the objective function must be lower than some critical value for medoids to be recovered. is essentially a dual variable associated with the of -medoids, and it remains unchosen in the Karush-Kuhn-Tucker conditions used to derive the guarantee of . The number of medoids obtained is thus unspecified. By contrast, we guarantee recovery of a specific number of medoids.
Recovery guarantees for learning mixtures of Gaussians.
We derive recovery guarantees for a random model where points are drawn from isotropic distributions supported in nonoverlapping balls. This is a few steps removed from a Gaussian mixture model. Starting with the work of Dasgupta, several papers (a representative sample is [32, 33, 34, 35, 36, 37, 38, 39]) already report probabilistic recovery guarantees for learning the parameters of Gaussian mixture models using algorithms unrelated to convex programming. Hard clusters can be found after obtaining the parameters by associating each point with the Gaussian whose contribution to the mixture model is largest at . The questions here are different from our ours: under what conditions does a given polynomial-time algorithm—not a convex program, which admits many algorithmic solution techniques—recover the global optimum? How close are the parameters obtained to their true values? The progression of this line of research had been towards reducing the separation distances between the centers of the Gaussians in the guarantees; in fact, the separation distances can be zero if the covariance matrices of the Gaussians differ [40, 41]. Our results are not intended to compete with these guarantees. Rather, we seek to provide complementary insights into how often clusters of points in Euclidean space are recovered by LP.
Approximation algorithms for -medoids clustering and facility location. As mentioned above, for any configuration of points in one-dimensional Euclidean space, the LP relaxation of -medoids clustering exactly recovers medoids for dissimilarities that are unsquared distances . In more than one dimension, nonintegral optima whose costs are lower than that of an optimal integral solution may be realized. There is a large literature on approximation algorithms for -medoids clustering based on rounding the LP solution and other methods. This literature encompasses a family of related problems known as facility location. The only differences between the uncapacitated facility location problem (UFL) and -medoids clustering are that 1) only certain points are allowed to be medoids, 2) there is no constraint on the number of clusters, and 3) there is a cost associated with choosing a given point as a medoid.
Constant-factor approximation algorithms have been obtained for metric flavors of UFL and -medoids clustering, where the measures of distance between points used in the objective function must obey the triangle inequality. Reference  obtains the first polynomial-time approximation algorithm for metric UFL; it comes within a factor of 3.16 of the optimum. Several subsequent works give algorithms that improve this approximation ratio [43, 44, 45, 46, 47, 48, 49, 50, 51, 52]. It is established in  that unless , the lower bounds on approximation ratios for metric UFL and metric -medoids clustering are, respectively, and . Here, is the solution to . In unpublished work, Sviridenko strengthens the complexity criterion for these lower bounds to .444See Theorem 4.13 of Vygen’s notes  for a proof. We thank Shi Li for drawing our attention to this result. The best known approximation ratios for metric UFL and metric -medoids clustering are, respectively,  and . Before the 2012 paper , only a -approximation algorithm had been available since 2001 . Because there is still a large gap between the current best approximation ratio for -medoids clustering () and the theoretical limit (), finding novel approximation algorithms remains an active area of research. Along related lines, a recent paper  gives the first constant-factor approximation algorithm for a generalization of -medoids clustering in which more than one medoid can be assigned to each point.
We emphasize that our results are recovery guarantees; instead of finding a novel rounding scheme for LP solutions, we give precise conditions for when solving an LP yields the -medoids clustering. In addition, our proofs are for squared distances, which do not respect the triangle inequality.
The next section of this paper uses linear programming duality theory to derive sufficient conditions under which the optimal solution to the -medoids integer program KMed coincides with the unique optimal solution of its linear programming relaxation LinKMed. The third section obtains probabilistic guarantees for exact recovery of an integer solution by the linear program, focusing on recovering clusters of points drawn from separated balls of equal radius. Numerical experiments demonstrating the efficacy of the linear programming approach for recovering clusters beyond our analytical results are reviewed in the fourth section. The final section discusses a few open questions, and an appendix contains one of our proofs.
Let be the index of vertex ’s medoid and be . For points in Euclidean space, is the index of the second-closest medoid to point . For simplicity of presentation, take when there is only one medoid. Denote as the set of points whose medoid is indexed by . Let refer to the positive part of the term enclosed in parentheses. Begin by writing a necessary and sufficient condition for a unique integral solution to LinKMed.
LinKMed has a unique optimal solution that coincides with the optimal solution to KMed if and only if there exist some and such that
Proposition 4 rewrites the Karush-Kuhn-Tucker (KKT) conditions corresponding to the linear program LinKMed in a convenient way; refer to the appendix for a derivation. Let be the number of points in the same cluster as point . Choose to obtain the following tractable sufficient condition for medoid recovery.
LinKMed has a unique optimal solution that coincides with the optimal solution to KMed if there exists a such that
for , , and
The choice of the KKT multipliers made here is democratic: each cluster has a total of “votes,” which it distributes proportionally among the for .
Now consider the dual certificates contained in the following two corollaries.
If KMed has a unique optimal solution , LinKMed also has a unique optimal solution when
Choose points from within balls in , each of radius , for which the centers of any two balls are separated by a distance of at least . Measure in units of a ball’s radius by setting . Take , where is the distance between points and and . The inequality (14) is satisfied for
by assigning the points chosen from each ball to their own cluster. Here, and are the maximum and minimum numbers of points drawn from any one of the balls, respectively. In the limit , (15) becomes .
Impose both when points and are in the same cluster and when points and are in different clusters. Combined with (13), the restrictions on are then
Condition (16) holds by definition of a medoid unless the optimal solution to KMed itself is not unique. In that event, it may be possible for a nonmedoid and a medoid in the same cluster to trade roles while maintaining solution optimality, making the LHS of (16) vanish for some . The phrasing of the corollary accommodates this edge case. ∎
The inequality (17) requires for in the same cluster as but a different cluster from . So any two points in the same cluster must be closer than any two points in different clusters.
Corollary 7 does not illustrate the utility of LP for solving KMed. Given the conditions of a recovery guarantee, clustering could be performed without LP using some distance threshold : place two points in the same cluster if the distance between them is smaller than , and ensure two points are in different clusters if the distance between them is greater than . In the separated balls model of Remark 8, guarantees that two points in the same ball are closer than two points in different balls. The next corollary is needed to obtain results for .
LinKMed has a unique optimal solution that coincides with the optimal solution to KMed if
The inequality (22) requires for and in the same cluster.
Corollary 7 imposes both extra upper bounds and extra lower bounds on in its proof. When two points in different clusters are closer than two points in the same cluster, cannot simultaneously satisfy these upper and lower bounds. To break this “thresholding barrier,” Corollary 10 imposes only extra lower bounds on and permits large . Stronger recovery guarantees are obtained for large when medoids are sparsely distributed among the points. (Note that the optimal solution is -column sparse.) The next subsection obtains probabilistic guarantees using Corollary 10 for a variant of the separated balls model of Remark 8.
The theorem stated in the introduction is proved in this section. Consider nonoverlapping -dimensional unit balls in Euclidean space for which the centers of any two balls are separated by a distance of at least . Take to be the squared distance between points and . Under a mild assumption about how points are drawn within each ball, the exact recovery guarantee of Remark 8 is extended in this subsection to the regime , where two points in the same cluster are not necessarily closer to each other than two points in different clusters. In particular, let the points in each ball correspond to independent samples of an isotropic distribution supported in the ball and which obeys
Above, is the vector extending from the center of the ball to a given point, and refers to the norm of . In dimensions, the assumption (23) holds for the uniform distribution supported in the ball. For larger , (23) requires distributions that concentrate more probability mass closer to the ball’s center. For simplicity, we assume in the sequel that the number of points drawn from each ball is equal. Let denote an expectation and
a variance. We state a preliminary lemma.
Consider sampled independently from an isotropic distribution supported in a -dimensional unit ball which satisfies
Use squared Euclidean distances to measure dissimilarities between points. Let be the medoid of the set and . Assume that , , and . With probability exceeding , all of the following statements are true.
for all .
First prove statement 1. Note that for ,
Above, . Since the are drawn from an isotropic distribution, the direction of the unit vector is independent of and drawn uniformly at random. It follows that the for
are i.i.d. zero-mean random variables despite howdepends on the . Indeed, for ,
where the last equality is obtained by integrating in generalized spherical coordinates. Bernstein’s inequality thus gives
with probability exceeding . For , the inequalities above clearly hold with unit probability. Take a union bound over the other to obtain that statement 1 holds with probability exceeding (for valid as specified in (27)).
Now observe that
It follows that statement 2 holds with probability exceeding . Moreover, statements 1 and 2 together hold with probability exceeding . Condition on them, and prove statement 3 by contradiction: suppose that exceeds . Then because ,
But from statement 1 for , this implies that
which violates the assumption that is a medoid. So all three statements hold with probability exceeding , which is the content of the lemma. ∎
We now write the main result of this section.
(Restatement of Theorem 1.) Consider unit balls in -dimensional Euclidean space (with ) for which the centers of any two balls are separated by a distance of , . From each ball, draw points as independent samples from an isotropic distribution supported in the ball which satisfies
Suppose that squared distances are used to measure dissimilarities between points, . For each , there exist values of and for which the following statement holds: with probability exceeding , the unique optimal solution to each of -medoids clustering and its linear programming relaxation assigns the points in each ball to their own cluster.
Now simplify the sufficient condition of Corollary 10 with and every . Let
be the maximum distance of a medoid to the center of its respective ball from statement 3 of Lemma 12. Note that
where the upper bound is surmised by considering point collinear with and between points and , both on one ball’s boundary. So take to narrow the sufficient condition of Corollary 10. Also note the requirement (19):
Obtain a lower bound of on the RHS by considering a point on the boundary of one ball collinear with points and . Impose
to ensure the RHS exceeds the LHS. This is equivalent to
Given the stipulations of the previous paragraph and (20) of Corollary 10, the following holds: each of -medoids clustering and its LP relaxation has a unique optimal solution that assigns the points in each ball to their own cluster if for ,
Denote as the center of the ball associated with point .555This becomes a slight abuse of notation because is not an index of any point drawn, but all its usages contained here should be clear. Find conditions under which (39) holds by treating two complementary cases of the separately:
. First, bound the number of points in a given cluster for which the inequalities spanning the previous sentence hold. From the distribution (11),
Hoeffding’s inequality thus gives
Take to obtain
Condition on the event captured in the equality above holding for every cluster. This occurs with probability exceeding . Next, observe that for , considering (as for (40)) point collinear with and between points and gives the deterministic bound
where extends from the center of the ball corresponding to . The expression on the RHS has a minimum at . Because , provided
a lower bound on the LHS of (46) is given by
It is easily verified numerically that this inequality is satisfied when for —as are the other bounds (38) and (47) on . Further, for any dimension , there exists some finite large enough such that (35) is satisfied. Similar checks may be performed to obtain other valid combinations of the parameters; more such combinations are contained in Remark 14.
Consider nonoverlapping -dimensional unit balls in for which the separation distance between the centers of any two balls is exactly . Consider the two cases that follow, referenced later as Case 1 and Case 2.
Each ball is the support of a uniform distribution.
Each ball is the support of a distribution that satisfies
where is the Euclidean distance from the center of the ball, and is some vector in . For , this is a uniform distribution. Equation (50) saturates the inequality (23), which is the distributional assumption of our probabilistic recovery guarantees.
Given one of these cases, sample each of the distributions times so that points are drawn from each ball. Solve LinKMed for this configuration of points and record when
a solution to KMed is recovered. (Call this “cluster recovery.”)
a recovered solution to KMed places points drawn from distinct balls in distinct clusters, the situation for which our recovery guarantees apply. (Call this “ball recovery,” a sufficient condition for cluster recovery.)
Examples of ball recoveries and cluster recoveries that are not ball recoveries are displayed in Figure 2 for .
We performed such simulations using MATLAB in conjunction with Gurobi Optimizer 5.5’s barrier method implementation for every combination of the choices in the table below.
5, 10, 15, 20, 25, 30 2, 3 2, 2.2, 2.4, 2.6, 2.8, 3, 3.2, 3.4, 3.6, 3.8, 4, 4.2, 4.4, 4.6, 4.8, 5 2, 3, 4, 10 Cases 1, 2
Remarkably, cluster recovery failed no more than 12 (8) times out of 1000 across all sets of 1000 simulations for Case 1 (2). It therefore appears that high-probability cluster recovery is always realized when drawing samples from the distributions we consider. However, since the KKT conditions require some assumption about how points cluster, general cluster recovery guarantees are difficult to prove. In the previous section, we obtain guarantees assuming the points cluster into the balls from which they are drawn. The ball recovery results of our simulations for Cases 1 and 2 are depicted in, respectively, Figures 3 and 4. Note that the vertical axis of each plot measures the number of failed ball recoveries. We conclude this section with the following observations.
On the whole, Case 2 yields more ball recoveries than Case 1. This is not unexpected: with the exception of , Case 2 concentrates more probability mass towards the centers of the balls than does Case 1, typically making the points drawn from each ball cluster more tightly. For , the plots in both Figures 3 and 4 correspond to draws from uniform distributions supported in the balls; they are repetitions and thus look essentially the same.
In general, as the number of points drawn from each ball is increased, the number of ball recoveries increases for fixed , , and
. This is again not unexpected: if fewer points are drawn, clustering is more susceptible to outliers that can prevent ball recovery.
As increases, the number of ball recoveries increases for fixed , , and because points drawn from different balls tend to get further apart. For , high-probability ball recovery appears to be guaranteed for greater than somewhere between and even for the small values of considered here. This is considerably better than the guarantee of Theorem 13: it holds for and only if is at least , as shown toward the end of its proof.
For , , and fixed, there are more ball recoveries for two balls than there are for three balls. This suggests that as increases, the probability of recovery decreases, which is consistent with intuition from Theorem 13.
For , , and fixed, as increases, the number of ball recoveries increases, even for the uniform distributions of Case 1. There is thus substantial room for improving our recovery guarantees, which require concentrating more probability mass towards the centers of the balls as increases.
We proved that with high probability, the -medoids clustering problem and its LP relaxation share a unique globally optimal solution in a nontrivial regime, where two points in the same cluster may be further apart than two points in different clusters. However, our theoretical guarantees are preliminary; they fall far short of explaining the success of LP in distinguishing points drawn from different balls at small separation distance and with few points in each ball. More generally, in simulations we did not present here, the -medoids LP relaxation appeared to recover integer solutions for very extreme configurations of points—in the presence of extreme outliers as well as for nonisotropic clusters with vastly different numbers of points. We thus conclude with a few open questions that interest us.
How do recovery guarantees change for different choices of the dissimilarities between points—for example, for Euclidean distances rather than for the squared Euclidean distances used here? What about for Gaussian and exponential kernels?
Can exact recovery be used to better characterize outliers?
Is it possible to obtain cluster recovery guarantees instead of just ball recovery guarantees? (“Cluster recovery” and “ball recovery” are defined right after (50).)
We thank Shi Li and Chris White for helpful suggestions. We are extremely grateful to Sujay Sanghavi for offering his expertise on clustering and for pointing us in the right directions as we navigated the literature. A.N. is especially grateful to Jun Song for his constructive suggestions and for general support during the preparation of this work. R.W. was supported in part by a Research Fellowship from the Alfred P. Sloan Foundation, an ONR Grant N00014-12-1-0743, an NSF CAREER Award, and an AFOSR Young Investigator Program Award. A.N. was supported by Jun Song’s grant R01CA163336 from the National Institutes of Health.
Finding groups in data: an introduction to cluster analysis, vol. 344. Wiley. com, 2009.
I. E. Givoni and B. J. Frey, “A binary variable model for affinity propagation,”Neural computation, vol. 21, no. 6, pp. 1589–1600, 2009.
Proceedings of the thirty-third annual ACM symposium on Theory of computing, pp. 247–257, ACM, 2001.
M. R. Korupolu, C. G. Plaxton, and R. Rajaraman, “Analysis of a local search heuristic for facility location problems,” inProceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms, pp. 1–10, Society for Industrial and Applied Mathematics, 1998.
Approximation, Randomization, and Combinatorial Optimization: Algorithms and Techniques, pp. 127–137, Springer, 2001.
(Restatement of Proposition 4.) LinKMed has a unique solution that coincides with the solution to KMed if and only if there exist some and such that
Suppose the solution to KMed is known. Let be the index set of nonzero entries of , and let be its complement. For some matrix , denote as the vector of variables for which . Eliminating the from LinKMed using the constraints (7) yields the following equivalent program:
where . The only in the program (52)-(58) have . Associate the nonnegative dual variables , , , , , and with (53), (54), (55), (56), (57), and (58), respectively. Enforcing stationarity of the Lagrangian gives