## 1 Introduction

Clustering is perhaps the most central problem in unsupervised machine learning and has been studied for over 60 years

[10]. The problem may be stated informally as follows. One is given points, lying in . One seeks to partition into sets such that the ’s for are closer to each other than to the ’s for , .Clustering is usually posed as a nonconvex optimization problem, and therefore prone to nonoptimal local minimizers, but Pelckmans et al. [8], Hocking et al. [5], and Lindsten et al. [6] proposed the following convex formulation for the clustering problem:

(1) |

This formulation is known in the literature as sum-of-norms clustering, convex clustering, or clusterpath clustering. Let be the optimizer. (Note: (1) is strongly convex, hence the optimizer exists and is unique.) The assignment to clusters is given by the ’s: for , if then are assigned to the same cluster, else they are assigned to different clusters. It is apparent that for , each is assigned to a different cluster (unless exactly), whereas for sufficiently large, the second summation drives all the ’s to be equal (and hence there is one big cluster). Thus, the parameter controls the number of clusters produced by the formulation.

Throughout this paper, we assume that all norms are Euclidean, although (1) has also been considered for other norms. In addition, some authors insert nonnegative weights in front of the the terms in the above summations. Our results, however, require all weights identically 1.

Panahi et al. [7] developed several recovery theorems as well as a first-order optimization method for solving (1). Other authors, e.g., Sun et al. [11] have since extended these results. One of Panahi et al.’s results pertains to a mixture of Gaussians, which is the following generative model for producing the data . The parameters of the model are means , variances , and probabilities , all positive and summing to 1. One draws i.i.d. samples as follows. First, an index is selected at random according to probabilities . Next, a point

is chosen according to the spherical Gaussian distribution

.Panahi et al. proved that for the appropriate choice of , sum-of-norms clustering formulation (1) will exactly recover a mixture of Gaussians (i.e., each point will be labeled with if it was selected from ) provided that for all , ,

(2) |

One issue with this bound is that as the number of samples tends to infinity, the bound seems to indicate that distinguishing the clusters becomes increasingly difficult (i.e., the ’s have to be more distantly separated as ).

The reason for this aspect of their bound is that their proof technique requires a gap of positive width (i.e., a region of containing no sample points) between and whenever . Clearly, such a gap cannot exist in the mixture-of-Gaussians distribution as the number of samples tends to infinity.

The purpose of this note is to prove that (1) can recover a mixture of Gaussians even as . This is the content of Theorem 2 in Section 4 below. Naturally, under this hypothesis we cannot hope to correctly label all samples since, as , some of the samples associated with one mean will be placed arbitrarily close to another mean. Therefore, we are content in showing that (1

) can correctly cluster the points lying within some fixed number of standard-deviations for each mean. Radchenko and Mukherjee

[9] have previously analyzed the special case of mixture of Gaussians with , under slightly different hypotheses.Our proof technique requires a cluster characterization theorem for sum-of-norms clustering derived by Chiquet et al. [3]. This theorem is not stated by these authors as a theorem, but instead appears as a sequence of steps inside a larger proof in a “supplementary material” appendix to their paper. Because we believe that this theorem is of independent interest, we restate it below and for the sake of completeness provide the proof (which is the same as the proof appearing in Chiquet et al.’s supplementary material). This material appears in Section 2.

## 2 Cluster characterization theorem

The following theorem is due to Chiquet et al. [3] but is not stated as a theorem by these authors; instead it appears as a sequence of steps in a proof of the agglomeration conjecture. Refer to the next section for a discussion of the agglomeration conjecture. We restate the theorem here because it is needed for our analysis and because we believe it is of independent interest.

###### Theorem 1.

Let denote the optimizer of (1). For notational ease, let

denote the concatenation of these vectors into a single

-vector. Suppose that is a nonempty subset of .(a) Necessary condition: If for some , for and for (i.e., is exactly one cluster determined by (1)), then there exist for , , which solve

(3) | ||||||

Note: This theorem is an almost exact characterization of clusters that are determined by formulation (1). The only gap between the necessary and sufficient conditions is that the necessary condition requires that be exactly all the points in a cluster, whereas the sufficient condition is sufficient for to be a subset of the points in a cluster. The sufficient condition is notable because it does not require any hypothesis about the other points occurring in the input.

###### Proof.

(Chiquet et al.)
Proof for Necessity (a)

As is the minimizer of the problem (1), and this objective
function, call it , is convex, it follows
that , where denotes the subdifferential, that is, the set of subgradients of
at . (See, e.g., [4] for background on convex analysis). Written explicitly in terms of the derivative of the squared-norm and subdifferential of the norm, this means that satisfies the following condition:

(4) |

where , , , , are subgradients of the Euclidean norm function satisfying

with the requirement that in the second case. Here, is notation for the closed Euclidean ball centered at of radius . Since for , for , the KKT condition for is rewritten as

(5) |

Define for , . Then

Substitute into the equation (5) to obtain

(6) |

Sum the preceding equation over , noticing that the last term cancels out, leaving

which is rearranged to (renaming to ):

(7) |

Subtract (7) from (6), simplify and rearrange to obtain

(8) |

as desired.

Proof for Sufficiency (b)

We will show that at the solution of (1),
all the ’s for have a common value under the hypothesis that is a solution to the equation (3) for , .

First, define the following intermediate problem. Let denote the centroid of for :

Consider the weighted problem sum-of-norms clustering problem with unknowns as follows: one unknown is associated with , and one unknown is associated with each (for a total of with unknown vectors):

(9) |

This problem, being strongly convex, has a unique optimizer; denote the optimizing vectors and for .

First, let us consider the optimality conditions for (9), which are:

(10) | ||||

(11) |

with subgradients defined as follows:

and

with the proviso that in the second case, .

We claim that the solution for (1) given by defining for while keeping the for , where and are the optimizers for (9) as in the last few paragraphs, is optimal for (1), which proves the main result. To show that this solution is optimal for (1), we need to provide subgradients to establish the necessary condition. Define to be the subgradients of evaluated at as follows:

for , , |

Before confirming that the necessary condition is satisfied, we first need to confirm that these are all valid subgradients. In the case that , , we have constructed to be a valid subgradient of evaluated at , and we have taken , .

In the case that , we have construct to be a valid subgradient of evaluated at , and we have taken , .

In the case that , by construction , so any vector in is a valid subgradient of evaluated . Note that since , then defined above also lies in .

## 3 Agglomeration Conjecture

Recall that when , each is in its own cluster in the solution to (1) (provided the ’s are distinct), whereas for sufficiently large , all the points are in one cluster. Hocking et al. [5] conjectured that sum-of-norms clustering with equal weights has the following agglomeration property: as increases, clusters merge with each other but never break up. This means that the solutions to (1) as ranges over

induce a tree of hierarchical clusters on the data.

This conjecture was proved by Chiquet et al. [3] using Theorem 1. Consider a and its corresponding sum-of-norms cluster model:

(12) |

###### Corollary 1.1.

The corollary follows from Theorem 1. If is a cluster in the solution of (1), then by the necessary condition, there exist multipliers satisfying (3) for . If we scale each of these multipliers by , we now obtain a solution to (3) for with replaced by , and the theorem states that this is sufficient for the points in to be in the same cluster in the solution to (12).

It should be noted that Hocking et al. construct an example of unequally-weighted sum-of-norms clustering in which the agglomeration property fails. It is still mostly an open question to characterize for which norms and for which families of unequal weights the agglomeration property holds. Refer to Chi and Steinerberger [2] for some recent progress.

## 4 Mixture of Gaussians

In this section, we present our main result about recovery of mixture of Gaussians. As noted in the introduction, a theorem stating that every point is labeled correctly is not possible in the setting of , so we settle for a theorem stating that points within a constant number of standard deviations from the means are correctly labeled.

###### Theorem 2.

Let the vertices be generated from a mixture of Gaussian distributions with parameters , , and . Let be given, and let

Let be arbitrary. Then for any , with probability exponentially close to (and depending on ) as , for the solution computed by (1), the points in are in the same cluster provided

(13) |

Here,

denotes the cumulative density function of the chi-squared distribution with

degrees of freedom (which tends to rapidly as increases). Furthermore, the cluster associated with is distinct from the cluster associated with , , provided that(14) |

###### Proof.

Let be fixed. Fix an . First, we show that all the points in are in the same cluster. The usual technique for proving a recovery result is to find subgradients to satisfy the sufficient condition, which in this case is Theorem 1 taking in the theorem to be . Observe that conditions (3) involve equalities and norm inequalities. A standard technique in the literature (see, e.g., Candès and Recht [1]) is to find the least-squares solution to the equalities and then prove that it satisfies the inequalities. This is the technique we adopt herein. The conditions (3) are in sufficiently simple form that we can write down the least-squares solution in closed form; it turns out to be:

It follows by construction (and is easy to check) that this formula satisfies the equalities in (3), so the remaining task is to show that the norm bound is satisfied. By definition of , . The probability that an arbitrary sample is associated with mean is . Furthermore, with probability , this sample satisfies , i.e., lands in . Since the second choice in the mixture of Gaussians is conditionally independent from the first, the overall probability that lands in is . Therefore,

. By the Chernoff bound for the tail of a binomial distribution, it follows that the probability that

is exponentially close to 1 for a fixed . Thus, provided , we have constructed a solution to (3) with probability exponentially close to 1.For the second part of the theorem, suppose . For each sample associated with satisfying (i.e., lying in ), the probability is that

by the fact that the spherical Gaussian distribution has mirror-image symmetry about any hyperplane through its mean. Therefore, with probability exponentially close to 1 as

, we can assume that at least one satisfies the above inequality. Similarly, with probability exponentially close to 1, at least one sample satisfiesThen

(15) |

where, in the final line, we used the two inequalities derived earlier in this paragraph.

Consider the first-order optimality conditions for equation (1), which are given by (4). Apply the triangle inequality to the summation in (4) to obtain,

(16) | ||||

(17) |

Therefore,

Therefore, we conclude that , i.e., that and are not in the same cluster, provided that the right-hand side of the preceding inequality is positive, i.e.,

This concludes the proof of the second statement. ∎

Clearly, there exists a so that the solution to (1) can simultaneously place all points in into the same cluster for each while distinguishing the clusters provided that the right-hand side of (14) exceeds the right-hand side of (13). In order to obtain a compact inequality that guarantees this condition, let us fix some values. For example, let us take and let . The Chernoff bound implies that exponentially fast in . Let be the minimum weight in the mixture of Gaussians. Let denote the maximum standard deviation in the distribution. Finally, let us take . Then the above theorem states there is a such that with probability tending to exponentially fast in , the points in , for any are each in the same cluster, and these clusters are distinct, provided that

(18) |

Compared to the Panahi et al. bound (2), we have removed the dependence of the right-hand side on as well as the factor of . (The dependence of the Panahi et al. bound on is not made explicit so we cannot compare the two bounds’ dependence on . Note that there is still an implicit dependence on in (18) since necessarily .)

## 5 Discussion

The analysis of the mixture of Gaussians in the preceding section used only standard bounds and simple properties of the normal distribution, so it should be apparent to the reader that many extensions of this result (e.g., Gaussians with a more general covariance matrix, uniform distributions, many kinds of deterministic distributions) are possible. The key technique is Theorem

1, which essentially decouples the clusters from each other so that each can be analyzed in isolation. Such a theorem does not apply to most other clustering algorithms, or even to sum-of-norm clustering in the case of unequal weights, so obtaining similar results for other algorithms remains a challenge.## References

- [1] Emmanuel J. Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717, Apr 2009.
- [2] E. Chi and S. Steinerberger. Recovering trees with convex clustering. https://arxiv.org/abs/1806.11096, 2018.
- [3] J. Chiquet, P. Gutierrez, and G. Rigaill. Fast tree inference with weighted fusion penalties. Journal of Computational and Graphical Statistics, 26:205–216, 2017.
- [4] Jean-Baptiste Hiriart-Urruty and Claude Lemaréchal. Fundamentals of convex analysis. Springer, 2012.
- [5] T. Hocking, A. Joulin, F. Bach, and J.-P. Vert. Clusterpath: An algorithm for clustering using convex fusion penalties. In International Conference on Machine Learning, 2011.
- [6] F. Lindsten, H. Ohlsson, and L. Ljung. Clustering using sum-of-norms regularization: With application to particle filter output computation. In IEEE Statistical Signal Processing Workshop (SSP), 2011.
- [7] A. Panahi, D. Dubhashi, F. Johansson, and C. Bhattacharyya. Clustering by sum of norms: Stochastic incremental algorithm, convergence and cluster recovery. Journal of Machine Learning Research, 70, 2017.
- [8] K. Pelckmans, J. De Brabanter, J. A. K. Suykens, and B. De Moor. Convex cluster shrinkage. Available on-line at ftp://ftp.esat.kuleuven.ac.be/sista/kpelckma/ccs_pelckmans2005.pdf, 2005.
- [9] Peter Radchenko and Gourab Mukherjee. Convex clustering via l1 fusion penalization. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(5):1527–1546, 2017.
- [10] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014.
- [11] D. Sun, K.-C. Toh, and Y. Yuan. Convex clustering: model, theoretical guarantees and efficient algorithm. https://arxiv.org/abs/1810.02677, 2018.