Clustering Based on Pairwise Distances When the Data is of Mixed Dimensions

09/12/2009 ∙ by Ery Arias-Castro, et al. ∙ University of California, San Diego 0

In the context of clustering, we consider a generative model in a Euclidean ambient space with clusters of different shapes, dimensions, sizes and densities. In an asymptotic setting where the number of points becomes large, we obtain theoretical guaranties for a few emblematic methods based on pairwise distances: a simple algorithm based on the extraction of connected components in a neighborhood graph; the spectral clustering method of Ng, Jordan and Weiss; and hierarchical clustering with single linkage. The methods are shown to enjoy some near-optimal properties in terms of separation between clusters and robustness to outliers. The local scaling method of Zelnik-Manor and Perona is shown to lead to a near-optimal choice for the scale in the first two methods. We also provide a lower bound on the spectral gap to consistently choose the correct number of clusters in the spectral method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the context of clustering points in an Euclidean space, traditional methods such as

-means or Gaussian mixture models, assume that each cluster is generated by sampling points in the vicinity of a centroid. The resulting clusters are ellipsoidal, and in particular full-dimensional. Several papers obtain theoretical results in this setting; see e.g., 

[43, 19, 46, 20, 1], and references therein. In a number of modern applications, however, the data may contain structures of mixed dimensions. Even the apparently simple case of affine surfaces is a relevant model for a number of real-life situations [37]. Our focus here is a more general framework making minimal assumptions on the underlying clusters. Note that our framework is inclusive of the classical setting.

1.1 Mathematical framework

We set the ambient space to be the -dimensional unit hypercube , though our results may generalize to other setting such Riemannian manifolds. In most of the paper, we assume that is fixed, and discuss the case where is large in Section 5.3. For a positive integer and a constant , let be the class of measurable, connected sets (surfaces) such that

(1)

The condition above not only implies that the surface has (e.g. Hausdorff) dimension with finite -volume, it also prevents from being too narrow in some places. We also define as the set of points in . We let .

For readers more familiar with function spaces, note that the class contains for example the image of locally bi-Lipschitz functions satisfying:

(2)

for small enough.

For and , define

This is the -neighborhood of in relative to the Euclidean metric. Given surfaces, , we generate clusters, , by sampling points in , the -neighborhood of , according to a distribution with density with respect to the uniform measure on . We call the noise level or sampling imprecision. In the noiseless case, , the points are sampled exactly on the surface. We require that , so the cluster is somewhat uniformly sampled. Our results apply without major change for non-compactly supported sampling distributions with fast-decaying tails, such as Gaussian noise. The classical setting corresponds to either (centroids), or with (full-dimensional cluster). Let be the total number of data points, which we denote by . For later use, define indices .

We assume the clusters do not intersect, and in fact that the underlying surfaces are well-separated:

(3)

The actual clusters are therefore separated by a distance of at least .

Surface clustering task. Given data , recover the clusters .

1.2 In this paper

We first consider a simple algorithm based on extracting the connected components of a -neighborhood graph built using a compactly supported kernel. We provide conditions guarantying that the algorithm perfectly recovers the underlying clusters in our theoretical setting; this is done in Section 2.1. This approach may be seen as a precursor to spectral methods, which extract ‘soft’ connected components based on an eigen-decomposition of the Laplacian of the neighborhood graph. In Section 2.2, we consider the method introduced by Ng, Jordan and Weiss [40], a standard spectral clustering algorithm. We show that, in our framework, the spectral method operates under very similar conditions as the method based on connected components. The last method we consider, in Section 2.3, is hierarchical clustering with single linkage, which is in some sense equivalent to the method based on connected components. Note that hierarchical clustering with average or complete linkage are not suitable in our context which includes elongated clusters.

It turns out that the first two methods are near-optimal in terms of separation between clusters and robustness to outliers. In Section 3, we show that, under low sampling noise, no method can perfectly separate clusters that are closer together than what the first two methods require by more than a poly-logarithmic factor. For clusters of dimension one or two, we obtain stronger results, showing that all clustering methods have in fact a non-negligible error rate in that same situation. In Section 4, we address the situation where outliers, points sampled elsewhere in space, may be present in the data, and show that the first two methods, properly modified, are able to accurately cluster within logarithmic factors of the best known detection rates [5, 3], even though the task of detection is a priori much easier than the task of clustering.

In the discussion part of the paper, Section 5

, we consider the choice of parameters, that is the scale defining the neighborhood graph and, for the spectral method, the number of eigenvectors to extract. We show that the local scaling method of Zelnik-Manor and Perona 

[51], with a number of nearest neighbors of order slightly larger than

, leads to a near-optimal choice of scale. As a consequence, computations may be restricted to small neighborhoods without compromising the clustering performance, so that a nearest-neighbor search becomes the computational bottleneck. We also provide a bound on the eigengap allowing to consistently estimate the number of clusters. Finally, we discuss how the results generalize to the case where the ambient dimension is very large or even infinite.

The various proofs are gathered at the end of the paper, in Section 6, with the proof of auxiliary results gathered in Appendices A and B. The careful reader will notice that our results could be made non-asymptotic without much change in the arguments. However, we chose to favor the statement of simple results with concise proofs.

1.3 Related work

Neighborhood graphs defined on a random set of points in Euclidean space are sometimes called random geometric graphs, and have been of interest in modeling networks. The book by Penrose [42] is a standard reference. The main difference in our case is that the support of the sampling density may be (close to) singular with respect to the Lebesgue measure. Extracting connected components from a neighborhood graph is a natural idea and has been proposed before; we comment on three publications that are particularly relevant to us [38, 10, 14]. Maier, Hein and von Luxburg [38] consider -nearest neighbor type graphs and analyze the performance of the resulting clustering algorithm within a slightly more restrictive model where both the clusters and the sampling densities are smooth, and the degree of imprecision is positive, . Within that framework, the results in that paper are non-asymptotic and more precise than our Theorem 1. Their emphasis is on choosing

optimally in terms of maximizing the probability of correctly solving the clustering task and on the effect of using different kinds of graphs. We comment on their work in more detail in Section

5. In a similar model, Biau, Cadre and Pelletier [10] focus on estimating the correct number of clusters based on counting the number of connected components in a -neighborhood graph. Both [38, 10] consider the case where the space between clusters contains points; we call those points outliers and consider this situation in Section 4.1. Brito, Ch vez, Quiroz and Yukich [14] consider a model similar to ours with all clusters full-dimensional. They also use a -nearest neighbor graph and show that, when the separation between clusters remains bounded away from zero, choosing of order makes the algorithm output the perfect clustering; this is similar to our Proposition 3. They also consider a test of non-uniformity, where the alternative is that of points clustered more closely together as opposed to a cluster hidden in a background of uniform points as we consider in Section 4. However, there are no optimality considerations. In light of [38, 10, 14], our contribution is in considering a slightly more general framework, for which we provide short proofs, and in establishing optimality results in terms of separation between clusters and robustness to outliers.

Spectral clustering methods have been specifically developed to work in the kind of framework we consider here [23]. Though very popular, few theoretical results are available on the performance of spectral methods under this type of generative model. Ng, Jordan and Weiss, in their influential paper [40], introduce their method and outline a strategy to analyze it; however, no explicit probabilistic model is considered. The same comment holds for [31]. In [47, 41], spectral clustering is taken to its empirical process limit as the number of points increases; though this provides insight on what spectral clustering is estimating, there is no result on its performance. This is similar to the analysis in [39]. Other papers, such as [24], introduce variations on the spectral method and provide theoretical results on computational aspects, not on clustering performance. Closer in spirit to the present paper is the work of Chen and Lerman [16], where the authors analyze a multi-way spectral method specifically designed for the case of affine surfaces. Our contribution here is in providing theoretical guaranties for spectral clustering methods in a rigorous mathematical framework. In doing that, we provide a concise proof of the main result in [40] partly based on information that Andrew Ng shared with the author and the proof of [16, Th 4.5] by Chen and Lerman.

To our knowledge, the minimax-type bounds on the separation between clusters obtained in Section 3

are the first of their kind in the context of clustering under a non-parametric model. In the classical setting, there is some existing literature, though very scarce; we will comment on a paper of Achlioptas and McSherry 

[1]. The literature is of course abundant in the context of estimation [50, 21, 32, 12] and classification [49, 44]. In our arguments, we use the popular approach consisting in reducing the task to a hypothesis testing problem.

1.4 Additional notation

Except for and , the parameters such and vary with . This dependence is left implicit. An event holds with high probability if as . We use standard notation, such as: for ; for ; for ; for and ; for .

2 Some standard clustering methods based on pairwise distances

We describe some common approaches to clustering, all based on pairwise distances. Each time, we provide sufficient conditions for the method to output the perfect clustering. These conditions are seen to be necessary up to multiplicative logarithmic factors. We will see in later sections what these conditions imply in terms of comparative performance.

The first two methods build a neighborhood graph on the data points using an affinity based on pairwise distances:

(4)

We assume the kernel is non-negative, continuous at 0 with , non-increasing on and is fast-decaying, in the sense that for any .

2.1 Clustering based on extracting connected components

The first algorithm we introduce, Algorithm 1, extracts connected components of the neighborhood graph and therefore requires a compactly supported kernel; let be the support of .

Input:
        : the data set
        : affinity scale
Output:
        A partition of the data into disjoint clusters
Steps:
1:

Compute the affinity matrix

, with .
2: Extract the connected components of .
3: Accordingly group the original points into disjoint clusters.
Algorithm 1 Pairwise clustering based on extracting connected components
Theorem 1.

Consider the generative model of Section 1.1 with surfaces . Assume that , with

(5)

Then, Algorithm 1 is perfectly accurate with high probability.

The proof of Theorem 1 is in Section 6.1. The condition means that distinct clusters are separated by and therefore disjoint in the neighborhood graph. The term in brackets on the right hand side of (5) is actually the order of magnitude for the maximin distance between points sampled from .

Remark. In the classical setting where each is a centroid, i.e. , the algorithm is accurate when

2.2 Spectral clustering

When using kernels that are not compactly supported, extracting connected components makes little sense as the neighborhood graph is fully connected. Instead, spectral methods perform an eigen-decomposition of the graph Laplacian. The spectral method introduced in [40] uses the Gaussian kernel . Note that kernels of compact support are considered in [41] in the context of spectral clustering. We describe the method of Ng, Jordan and Weiss [40] for a general kernel in Algorithm 2. The -means algorithm is initialized with centroids at nearly

angles, and then run with only one iteration. The initial centroids are chosen recursively, starting with any row vector of

and then choosing a row vector with largest minimal absolute angles with all the centroids previously chosen.

Input:
        : the data set
        : affinity scale
        : the number of clusters
Output:
        A partition of the data into disjoint clusters
Steps:
1: Compute the affinity matrix , with .
2: Compute the degree matrix , and .
3: Extract , orthogonal eigenvectors of for its

largest eigenvalues.

4: Renormalize each row of to have unit norm and let denote the resulting matrix.
5: Apply -means to cluster the row vectors of in .
6: Accordingly group the original points into disjoint clusters.
Algorithm 2 Pairwise spectral clustering
Theorem 2.

Consider the generative model described in Section 1.1 with surfaces . Let be such that for any . Assume and that (5) holds. Then, Algorithm 2 is perfectly accurate with high probability.

The proof of Theorem 2 is in Section 6.2. We see that Theorem 2 is very similar to Theorem 1; for example, with the Gaussian kernel, the separation condition is . The respective proofs are essentially parallel as well, though for the latter we follow the outline provided in [40]. Thus in theory and under our model, Algorithms 1 and 2 operate under similar conditions. In practice, however, it is well-known that Algorithm 1 is substantially more sensitive to the specification of the scale parameter .

2.3 Single linkage clustering

In the setting of Section 1.1, there is no hope for hierarchical clustering methods using complete or average linkage unless the clusters are separated by a distance comparable to their diameter, or larger. This is the classical setting, where the goal is typically to form clusters with small diameter [20]. On the other hand, the “chaining” property of hierarchical clustering with single linkage is desirable in our context, especially if the cluster is truly lower-dimensional (e.g. generated by sampling near a curve). In fact, if we stop the procedure whenever the closest distance between clusters exceeds , the resulting algorithm is equivalent to Algorithm 1 with kernel . The procedure is described in Algorithm 3.

Input:
        : the data set
        : maximum merging distance
Output:
        A partition of the data into disjoint clusters
Steps:
0: Set each point to be a cluster.
1: Recursively merge the two closest clusters in terms of minimal distance.
2: Stop when the distance between any pair of clusters exceeds .
Algorithm 3 Single linkage clustering
Corollary 3.

Under the conditions of Theorem 1, Algorithm 3 is perfectly accurate with high probability.

We mention the paper of Achlioptas and McSherry [1], which introduces an algorithm based on a combination of spectral clustering and single linkage clustering. Their analysis shows that their algorithm performs comparatively well in the classical setting.

3 Optimality in terms of separation between clusters

From Theorem 1 we see that Algorithm 1 is able to correctly identify clusters separated by a distance in the order of the term on the right hand side of (5). In the classical setting, Algorithm 1 is accurate when , with ; this is valid in any dimension, as explained in Section 5.3. The requirement is therefore comparable, actually weaker, than the lower bound achieved in [1, Th. 6]. This is assuming we can select an appropriate scale, which we do in Section 5.1. Note that the algorithm of Achlioptas and McSherry [1] requires selecting the correct number of clusters.

In our framework, the degree of separation required by Algorithm 1 to be perfectly accurate is close to optimal when the noise level is small, specifically,

Theorem 4.

For any clustering method and any probability , there are surfaces of diameter at least and separated by , with , such that, in the context of the generative model of Section 1.1, the method makes at least one mistake with probability at least .

The proof of Theorem 4 is in Section 6.3.

Remark. We avoided the case of surfaces of mixed dimensions since the use of more sophisticated tools, such as local density or dimension estimation [28, 35], could possibly narrow the separation.

The conclusion of Theorem 4 is rather weak, though, as it does not give conditions under which any clustering method has a substantial error rate (in terms of labeling the points). In dimensions one and two, we are able to prove such a result. In fact, we show that Algorithm 1 achieves the optimal separation rate, up to a constant factor in dimension one and up to a poly-logarithmic factor in dimension two. We were not able to prove such a result in higher dimensions.

Theorem 5.

For any clustering method, there are surfaces of diameter at least and separated by , with , on which the method has an error rate exceeding with high probability.

The proof of Theorem 5 is in Section 6.4.

Theorem 6.

For any clustering method, there are surfaces of diameter at least and separated by , with , on which the method has an error rate exceeding with high probability.

The proof of Theorem 6 is in Section 6.5.

4 Optimality in terms of robustness

4.1 Dealing with outliers

So far we only considered the case where the data is devoid of outliers. We now assume that some outliers may be included in the data. The outliers are sampled from a distribution with density with respect to the uniform measure on , again with . We assume this region is of -volume bounded below by . We denote by the number of outliers. We highlight the fact that outliers are away from surfaces by at least , the same lower bound on the distance that separates two distinct surfaces.

The algorithms considered here are based on pairwise distances, so we need to assume that the outliers are not as densely sampled as the actual clusters, for otherwise they will be indistinguishable from non-outlier points.

Proposition 1.

Assume the conditions of Theorem 1 hold, now in a setting that includes outliers, and, in addition, that . Then, Algorithm 3 is perfectly accurate with high probability if, when the algorithm stops, singletons are labeled as outliers.

The proof of Proposition 1 is in Section 6.6.

Algorithms 1 and 2 need to be modified in order to deal with outliers. We introduce an additional step which consists in discarding the data points with low connectivity in the neighborhood graph. This approach to removing outliers is very natural and was proposed in other works, such as [17, 38]. Specifically, fix a sequence such that ; then, between steps 1 and 2, compute the degree matrix and discard the points with degree .

Lemma 1.

For and , and ,

where

The proof of Lemma 1 is in Section B.1. Let ; by Lemma 1, .

Proposition 2.

Assume the conditions of Theorem 1 hold, now in a setting that includes outliers. Suppose:

(6)

Then, Algorithms 1 and 2 (modified) are both perfectly accurate with high probability.

The proof of Proposition 2 is in Section 6.7. With enough separation as assumed here, outliers are disconnected from non-outliers, and their degree is of order roughly . Therefore, they should be properly identified by the thresholding procedure. As for non-outliers, the term on the left hand side of (6) is the order of magnitude of the degree of points sampled from , so that (6) essentially guaranties that non-outliers survive the thresholding step.

In [38] outliers are sampled anywhere in space but away from clusters, which corresponds to having support . In that case, perfect accuracy is impossible, as the algorithms will confuse outliers within from with points belonging to . However, knowing that, with high probability, there are at most such outliers (in fact if , by Lemma 1), the algorithms make a mistake on a negligible fraction of outliers.

4.2 Clustering at the detection threshold

Assume each cluster is sufficiently sampled, which we rigorously define as:

(7)

Note that the related condition is equivalent to requiring that, within each cluster, the distance between a point and its nearest-neighbor is of order . With (7) holding, the choice implies both (5) and (6), so that Algorithms 1 and 2 (modified) are perfectly accurate with high probability, even in a setting including outliers.

Now, instead of clustering, consider the task of detecting the presence of a cluster hidden among a large number of outliers. We observe the data, , and want to decide between the following two hypotheses: under the null, the points are all outliers; under the alternative, there is a surface such that points are sampled from , while the rest of the points, of them, are sampled as outliers. Assuming that the parameters and are known, it is shown in [5, 3] that the scan statistic is able to separate the null from the alternative if

The author is not aware of a method that improves on those rates, and from translating recent results on detection in graphs [4], there is evidence that those rates are optimal up to a poly-logarithmic factor. This condition is essentially the same as (7), except for the factor. Hence, Algorithms 1 and 2 (modified) solve the clustering task perfectly within a poly-logarithmic factor of the best known signal-to-noise ratio required for the detection task.

5 Discussion

5.1 Selecting the scale parameter

Choosing the affinity scale is critical in all algorithms described here, and more generally in any method which uses a neighborhood graph. Assuming (7) holds, we already saw that the choice implies (5), so that, with enough separation, the Algorithms 1, 2 and 3 are accurate. In terms of separation, this allows the clusters to be as close as . As seen in Section 3, this is not optimal. Though the choice of may be made more precise with more information on the clusters, like the number of points sampled from them and their dimension, this information may not be available.

In practice, choosing the scale is still an ongoing line of research, with similarities with bandwidth selection in kernel smoothing. We focus on the local scaling method of Zelnik-Manor and Perona [51], where is defined as , with equal to the distance between and its th nearest neighbor. When is of compact support, this essentially means that if is not among the first nearest neighbors of and vice versa, and are not connected in the neighborhood graph, corresponding to a mutual -nearest neighbor graph. The parameter replaces as the tuning parameter, effectively setting the number of neighbors (degree) instead of the neighborhood range. This allows the scaling to adapt to the local sampling density.

Proposition 3.

Consider the generative model of Section 1.1 with surfaces . In terms of separation, for a sequence such that , assume that

Then, the local scaling version of Algorithm 1 with is perfectly accurate with high probability.

The proof of Proposition 3 is in Section 6.8. As a consequence of Proposition 3, with local scaling, Algorithm 1 essentially achieves the separation in (5). So in that sense local scaling offers a (near-)optimal way of building the neighborhood graph.

A weaker result, directly dealing with a -nearest neighbor graph and without the optimality implications on the amount of separation, appears in Brito, Ch vez, Quiroz and Yukich [14]. They find that, when the separation between clusters remains fixed, choosing of order makes Algorithm 1 work. However, assuming the underlying surfaces have diameter of order 1 and same dimension , Maier, Hein and von Luxburg [38] find that the optimal is of order , which is of order only when is of order . As they point out in their paper, it makes sense to use a larger if the separation between clusters is large. However, it is still not clear how to automatically choose an optimal without information on the separation between clusters.

5.2 Selecting the number of clusters

Algorithm 2 depends on choosing the number of clusters appropriately. Since the method relies on the few top eigenvectors of the matrix , a first approach consists in choosing by inspecting the eigenvalues of . We provide below an estimate for the gap between the and , which in theory may be used to select the correct number of clusters. Note that the bound we derive is very crude; for example, if the surfaces are affine subspaces and the sampling is exact (), a sharper bound of order holds [13].

Proposition 4.

Under the conditions of Theorem 2, with high probability,

The proof of Proposition 4 is in Section 6.9. In practice, this method is seen to work poorly; for example, in [36], choosing the number of clusters by cross-validation is observed to be more reliable. In [51], the authors suggest examining the few top eigenvectors instead of the eigenvalues. We do not study these methods here. In a slightly different context, Biau, Cadre and Pelletier [10] propose essentially to count the number of connected components found by Algorithm 1. The conditions stated in Theorem 1 of course guarantee this estimate is accurate with high probability. Their result is however more precise.

5.3 When the ambient dimension is large

In a number of modern applications, such as clustering of gene expression data [48, 8], document retrieval [33, 7]

or clustering 3D objects in computer vision 

[29], the ambient dimension is routinely several orders of magnitude larger than the number of points . Though we can always restrict ourselves to the subspace where the points live, which is of dimension or less, we consider here the situation where the ambient space is the unit ball in an infinite-dimensional space, for example a Hilbert space as considered in [11]

. As defining a uniform distribution in such a space is a non-trivial endeavor 

[25], we modify the model slightly. We assume that the points are generated from the surface as follows: , where , a probability measure equivalent to the uniform measure on , and , a probability measure with support in the unit ball. Outliers are directly sampled from .

Under this setting, Theorems 1 and 2 remain valid in the case where , where the condition (5) does not involve the ambient dimension:

The arguments are essentially identical. The case is not as straightforward, since this is the regime where, in some sense, the effective dimension of is the ambient dimension, and the specifics of the distribution come into play. Also, our arguments involve using packings of , so that the actual structure of the ambient space is critical. The same comments apply for the case where outliers are present in the data.

5.4 Computational Issues

We consider the computational complexity of each of the methods described earlier in the paper. Below, is a large enough constant.

Building the neighborhood graph may be done by brute force in flops, where is the cost of computing the distance between two points; for example, without further structure, in dimension . This may be done more effectively using an algorithm for range search, or -nearest neighbor search for the local scaling version. In low dimensions, , this may be done with kd-trees in flops. In higher dimensions, other alternatives may work better [15].

Once the neighborhood graph is built, Algorithm 1 extracts the connected components of the graph, which may be done in flops if using the local scaling version with as suggested in Section 5.1, since in that case the maximum degree is not larger than . Algorithm 2 extracts the leading eigenvectors of , which may be done in flops, using Lanczos-type algorithms [18] since, using again local scaling with , has about non-zero coefficients per row. So in both Algorithms 1 and 2 with local scaling, the total computational complexity is flops in low dimensions; and at most flops in higher dimensions. Algorithm 3 runs in flops in any dimension [52].

6 Proofs

In all the proofs that follow, we assume for concreteness that the different sampling distributions are in fact uniform distributions over their respective support and that the kernel has support ; we assume that and that all underlying surfaces are of diameter of order 1. The remaining cases may be treated similarly. We use to denote a generic positive constant, whose actual value may change from place to place.

6.1 Proof of Theorem 1

First, two distinct clusters and are disjoint in the graph. Indeed, for and , and therefore since is supported in .

Now consider a single cluster of size generated from a surface . Let be an -packing of . Because , , so that by Lemma 1, ; on the other hand, implies , so that by Lemma 1 again, . Hence, .

By condition (5), , and so by a simple modification of Lemma 2 in [6], with high probability each ball contains at least one (in fact, order ) data point(s). If were disconnected, we could group the data points into two groups in such a way that the minimum distance between the two groups would exceed . By the triangle inequality, this would imply a grouping of the ’s into two groups with a minimum pairwise distance of . The balls , would then be divided into two disjoint groups, which contradicts the fact that they cover the connected set .

6.2 Proof of Theorem 2

We follow the strategy outlined in [40] based on verifying the following conditions (where (A4) has been simplified). For , let denote the submatrix of corresponding to the index set . For , define , which is the degree of within the cluster it belongs to. Let denote the row vectors of .

  • (A1) For all , the second largest eigenvalue of is bounded above by .

  • (A2) For all , with ,

  • (A3) For all and all ,

  • (A4) For all and all , .

We present below a slightly modified version of Theorem 2 in [40].

Theorem 7 (Based on Th. 2 in [40]).

Under (A1)-(A4), there is an orthonormal set such that,

The proof of Theorem 7 is in Section A. It is partly based on information that Andrew Ng shared with the author and the proof of [16, Th 4.5] by Chen and Lerman. Note that the latter deals with the special case where the clusters are of comparable sizes () and of same dimension (), and the result they obtain is somewhat different.

We show below that , that for any , and that . Hence, the right hand side in the expression above is of order for any , since is assumed fixed. Hence, and therefore, since the ’s are themselves orthonormal, -means with mear-orthogonal initialization outputs the perfect clustering with high probability. We now turn to verifying (A1)-(A4), in reverse order.

(A4): We show that, with high probability and uniformly over ,

Consider a single cluster of size , generated from sampling near a surface of dimension . Assume that (5) holds, namely Given ,

are i.i.d. random variables in

, with mean . Note that by Lemma 1. Using Hoeffding’s Inequality, in the form of inequality (2.1) in [30], there is a constant , such that

By (5), so that, with Boole’s Inequality, we conclude that uniformly over all , with high probability.

(A3): Fix ; by applying the result in (A4) with as a kernel, we get the following order of magnitude, uniformly over ,

Therefore, by (A4) and then (5),

Now, take two points, and . Because , we get . With (A4) and (5), this implies that

Therefore, we can take , so that for any .

(A2): We apply the same arguments we just used to bound the sum on the left hand side of (A3). In particular, we can take , so that for any .

(A1): We prove that the spectral gap satisfies with high probability. As suggested in [40], we approach this through a lower bound on the Cheeger constant. Consider a single cluster of size , generated from sampling near a surface of dimension . Assume that (5) holds, namely That has eigenvalue 1 with multiplicity 1 results from the graph being fully connected. The Cheeger constant of is defined as:

where the minimum is over all subsets of size . The spectral gap of is then of order at least . Using (A4), we get the lower bound:

Let be an -packing of and define let . Not only are the cells non-empty with high probability, actually they all contain order points; see Lemma 2 in [6] or the proof of (A4). For a fixed such that , there are necessarily two cells and with such that and , for otherwise it would imply that is disconnected. Therefore, given that points in and are within distance , and that both and contain order points, we have

As this is true for any such that , we have that .

6.3 Proof of Theorem 4

We start with the one dimensional case, which is substantially simpler than the situation in higher dimensions, as the boundaries of one-dimensional sets are just points. We work with the supnorm for convenience and clarity. For two probability distributions

, let denote their Hellinger distance [34, Chap. 13].

6.3.1 The case

Consider the line segment within generated by the first canonical vector, which we identify with . For and , define

For , generate a cluster (resp. ) by sampling uniformly from (resp. ), where

The sampling is in proportion with the volume of these regions, i.e. . By sufficiency, we need only consider the first coordinate, effectively reducing the case to that of . From this perspective, the setting is that of points sampled from , the uniform distribution on .

Let and , and assume . Suppose we want to decide between and

. From a clustering method, we obtain a test in the following way: after grouping the points, we reject the null hypothesis if

separates the two clusters. Since the interval contains more than data points with high probability, the clustering method has an error rate of at least when as a test it makes an error. Fix a probability . As a consequence of [34, Th. 13.1.3], and

any test makes an error with probability at least if is small enough.

6.3.2 The case

Consider the -dimensional affine surface within generated by the first canonical vectors, which we identify with . For a function and , define

For , generate a cluster (resp. ) by sampling uniformly from (resp. ), where

Again, the sampling is in proportion with the volume of these regions. By sufficiency, we need only consider the first coordinates, effectively reducing the case to that of . Henceforth, the setting is that of points sampled from , the uniform distribution on , where

For , consider the function if and otherwise. For large enough relative to , we have . Let and consider testing versus . From a clustering method, we obtain a test in the same way: after grouping the points, we reject the null hypothesis if the graph of separates the two clusters. The region is non-empty with non-negligible probability if is bounded away from zero. When this happens, the clustering method ‘misclassifies’ the points falling in that region when as a test it makes an error. Fix a probability . As a consequence of [34, Th. 13.1.3], and

any test makes an error with probability at least if is small enough.

6.4 Proof of Theorem 5

In dimension one, we show that the logarithmic factor is needed. This seems quite intuitive, since the longest distance between any pair of consecutive points is of order  [2]. We build on the proof of Theorem 4. Define , assumed to be an integer for simplicity. Consider for , and define and . Suppose we want to decide between , and . With , the clustering method has an error rate of at least when as a test it makes an error.

Lemma 2.

Consider testing versus . If with

, then for any test, the sum of the probabilities of type I and type II errors tends to 1.

The proof of Lemma 2 is in Section B.2.

6.5 Proof of Theorem 6

We build on the proof of Theorem 4. Define , assumed to be an integer for simplicity. For a sequence , consider the function if , for . Similarly, define by replacing with . Suppose we want to decide between and