Spectral clustering based on local linear approximations

01/08/2010 ∙ by Ery Arias-Castro, et al. ∙ University of Minnesota Duke University University of California, San Diego 0

In the context of clustering, we assume a generative model where each cluster is the result of sampling points in the neighborhood of an embedded smooth surface; the sample may be contaminated with outliers, which are modeled as points sampled in space away from the clusters. We consider a prototype for a higher-order spectral clustering method based on the residual from a local linear approximation. We obtain theoretical guarantees for this algorithm and show that, in terms of both separation and robustness to outliers, it outperforms the standard spectral clustering algorithm (based on pairwise distances) of Ng, Jordan and Weiss (NIPS '01). The optimal choice for some of the tuning parameters depends on the dimension and thickness of the clusters. We provide estimators that come close enough for our theoretical purposes. We also discuss the cases of clusters of mixed dimensions and of clusters that are generated from smoother surfaces. In our experiments, this algorithm is shown to outperform pairwise spectral clustering on both simulated and real data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 25

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In a number of modern applications, the data appear to cluster near some low-dimensional structures. In the particular setting of manifold learning [51, 47, 7, 22, 17], the data are assumed to lie near manifolds embedded in Euclidean space. When multiple manifolds are present, the foremost task is separating them, meaning the recovery of the different components of the data associated with the different manifolds. Manifold clustering naturally occurs in the human visual cortex, which excels at grouping points into clusters of various shapes [41, 21]. It is also relevant for a number of modern applications. For example, in cosmology, galaxies seem to cluster forming various geometric structures such as one-dimensional filaments and two-dimensional walls [52, 39]

. In motion segmentation, feature vectors extracted from moving objects and tracked along different views cluster along affine or algebraic surfaces 

[35, 23, 53, 10]

. In face recognition, images of faces in fixed pose under varying illumination conditions cluster near low-dimensional affine subspaces 

[31, 6, 19], or along low-dimensional manifolds when introducing additional poses and camera views.

In the last few years several algorithms for multi-manifold clustering were introduced; we discuss them individually in Section 1.3.3. We focus here on spectral clustering methods, and in particular, study a prototypical multiway method relying on local linear approximations, with precursors appearing in [12, 1, 48, 2, 27]. We refer to this method as Higher-Order Spectral Clustering (HOSC). We establish theoretical guarantees for this method within a standard mathematical framework for multi-manifold clustering. Compared with all other algorithms we are aware of, HOSC is able to separate clusters that are much closer together; equivalently, HOSC is accurate under much lower sampling rate than any other algorithm we know of. Roughly speaking, a typical algorithm for multi-manifold clustering relies on local characteristics of the point cloud in a way that presupposes that all points, or at least the vast majority of the points, in a (small enough) neighborhood are from a single cluster, except in places like intersections of clusters. In contrast, though HOSC is also a local method, it can work with neighborhoods where two or more clusters coexist.

1.1 Higher-Order Spectral Clustering (HOSC)

We introduce our higher-order spectral clustering algorithm in this section, tracing its origins to the spectral clustering algorithm of Ng et al. [42] and the spectral curvature clustering of Chen and Lerman [12, 11].

Spectral methods are based on building a neighborhood graph on the data points and partitioning the graph using its Laplacian [22, 34], which is closely related to the extraction of connected components. The version introduced by Ng et al. [42] is an emblematic example—we refer to this approach as SC. It uses an affinity based on pairwise distances. Given a scale parameter and a kernel , define

(1)

( denotes the Euclidean norm.) Standard choices include the heat kernel and the simple kernel . Let denote the data points. SC starts by computing all pairwise affinities , with , for . It then computes the matrix , where is the degree of the th point in the graph with similarity matrix . Note that is the corresponding normalized Laplacian. Providing the algorithm with the number of clusters , SC continues by extracting the top eigenvectors of , obtaining a matrix , and after normalizing its rows, uses them to embed the data into . The algorithm concludes by applying -means to the embedded points. See Algorithm 1 for a summary.

Input:
        : the data points
        : the affinity scale
        : the number of clusters
Output:
        A partition of the data into disjoint clusters
Steps:
1:

Compute the affinity matrix

, with .
2: Compute the , where .
3: Extract , the top eigenvectors of .
4: Renormalize each row of to have unit norm, obtaining a matrix .
5: Apply -means to the row vectors of in to find clusters.
6: Accordingly group the original points into disjoint clusters.
Algorithm 1 Spectral Clustering (SC) [42]

Spectral methods utilizing multiway affinities were proposed to better exploit additional structure present in the data. The spectral curvature clustering (SCC) algorithm of Chen and Lerman [12, 11] was designed for the case of hybrid linear modeling where the manifolds are assumed to be affine, a setting that arises in motion segmentation [35]. Assuming that the subspaces are all of dimension —a parameter of the algorithm, SCC starts by computing the (polar) curvature of all -tuples, creating an

-tensor. The tensor is then flattened into a matrix

whose product with its transpose, , is used as an affinity matrix for the spectral algorithm SC. (In practice, the algorithm is randomized for computational tractability.) Kernel spectral curvature clustering (KSCC) [10] is a kernel version of SCC designed for the case of algebraic surfaces.

The SCC algorithm (and therefore KSCC) is not localized in space as it fits a parametric model that is global in nature. The method we study here may be seen as a localization of SCC, which is appropriate in our nonparametric setting since the manifolds resemble affine surfaces locally. This type of approach is mentioned in publications on affinity tensors 

[1, 48, 2, 27] and is studied here for the first time, to our knowledge. As discussed in Section 4, all reasonable variants have similar theoretical properties, so that we choose one of the simplest versions to ease the exposition. Concretely, we consider a multiway affinity that combines pairwise distances between nearest neighbors and the residual from the best -dimensional local linear approximation. Formally, given a set of points, , define

(2)

where for a subset and denotes the set of -dimensional affine subspaces in . In other words, is the width of the thinnest tube (or band) around a -dimensional affine subspace that contains . (In our implementation, we use the mean-square error; see Section 3.) Given scale parameters and a kernel function , define the following affinity: if are not distinct; otherwise:

(3)

where is the diameter of . See Figure 1 for an illustration.

Figure 1: The circle is of radius and the band is of half-width . Assuming we use the simple kernel, the -tuple on the left has affinity equal to one, while the other two -tuples have affinity equal to zero, the first one for having a diameter exceeding and the second one for being ‘thicker’ than .

Given data points and approximation dimension , we compute all -way affinities, and then obtain pairwise similarities by clique expansion [2] (note that several other options are possible [12, 48, 27]):

(4)

Though it is tempting to choose equal to , a larger allows for more tolerance to weak separation and small sampling rate. The down side is what appears to be an impractical computational burden, since the mere computation of in (4) requires order flops. In Section 1.4, we discuss how to reduce the computational complexity to flops, essentially without compromising performance.

Once the affinity matrix is computed, the SC algorithm is applied. We call the resulting procedure higher-order spectral clustering (HOSC), summarized in Algorithm 2. Note that HOSC is (essentially) equivalent to SC when , and equivalent to SCC when .

Input:
        : the data points
        : the approximation dimension and affinity order
        : the affinity scales
        : the number of clusters
Output:
        A partition of the data into disjoint clusters
Steps:
1: Compute the affinity matrix according to (4).
2: Apply SC (Algorithm 1).
Algorithm 2 Higher Order Spectral Clustering (HOSC)

1.2 Generative Model

It is time to introduce our framework. We assume a generative model where the clusters are the result of sampling points near surfaces embedded in an ambient Euclidean space, specifically, the -dimensional unit hypercube . For a surface and , define its -neighborhood as

The reach of is the supremum over such that, for each , there is a unique point realizing  [20]. It is well-known that, for submanifolds, the reach bounds the radius of curvature from below [20, Lem. 4.17]. For a connection to computational geometry, the reach coincides with the condition number introduced in [43] for submanifolds without boundary. Let denote the -dimensional Hausdorff measure, and the boundary of within . For an integer and a constant , let be the class of -dimensional, connected, submanifolds of and , and if has a boundary, is a -dimensional submanifold with . Given surfaces and , we generate clusters by sampling points uniformly at random in , the -neighborhood of in , for all . We call the jitter level. Except for Section 2.3, where we allow for intersections, we assume that the surfaces are separated by a distance of at least , i.e.

(5)

In that case, by the triangle inequality, the actual clusters are separated by at least , i.e.

We assume that the clusters are comparable in size by requiring that for all , for some finite constant . Let denote the data points thus generated. See Figure 2 for an illustration.

Figure 2: This figure illustrates the generative model. Left: Three surfaces (here curves) with their -neighborhood. The curves are separated by at least . Right: Points sampled within the tubular neighborhoods of the surfaces.

Given data , we aim at recovering the clusters . Formally, a clustering algorithm is a function taking data , and possibly other tuning parameters, and outputs a partition of . We say that it is ‘perfectly accurate’ if the output partition coincides with the original partition of into . Our main focus is on relating the sample size and the separation requirement in (5) (in order for HOSC to cluster correctly), and in particular we let and vary with . This dependency is left implicit. In contrast, we assume that are fixed. Also, we assume that are known throughout the paper (except for Section 2.1 where we consider their estimation). Though our setting is already quite general, we discuss some important extensions in Section 4.

We will also consider the situation where outliers may be present in the data. By outliers we mean points that were not sampled near any of the underlying surfaces. We consider a simple model where outliers are points sampled uniformly in for some , in general different from . That is, outliers are at least a distance away from the surfaces. We let denote the number of outliers, while still denotes the total number of data points, including outliers. See Figure 3 for an illustration.

Figure 3: This figure illustrates the generative model with outliers included in the data.

1.3 Performance in terms of Separation and Robustness

1.3.1 Performance of SC

A number of papers analyze SC under generative models similar to ours [3, 54, 44, 40], and the closely related method of extracting connected components of the neighborhood graph [3, 37, 9, 36]. The latter necessitates a compactly supported kernel and may be implemented via a union-of-balls estimator for the support of the density [16]. Under the weaker (essentially Lipschitz) regularity assumption

(6)

Arias-Castro [3] shows that SC with a compactly supported kernel is accurate if

(7)

( denotes the maximum of and and if as ). With the heat kernel, the same result holds up to a multiplicative factor. See also [37, 36], which prove a similar result for the method of extracting connected components under stronger regularity assumptions. At the very least, (7) is necessary for the union-of-balls approach and for SC with a compactly supported kernel, because is the order of magnitude of the largest distance between a point and its closest neighbor from the same cluster [45]. Note that (6) is very natural in the context of clustering as it prevents from being too narrow in some places and possibly confused with two or more disconnected surfaces. And, when in (6) is large enough and is small enough, it is satisfied by any surface belonging to . Indeed, such a surface resembles an affine subspace locally and (6) is obviously satisfied for an affine surface.

When outliers may be present in the data, as a preprocessing step, we identify as outliers data points with low connectivity in the graph with affinity matrix , and remove these points from the data before proceeding with clustering. (This is done between Steps 1 and 2 in Algorithm 1.) In the context of spectral clustering, this is very natural; see, e.g., [12, 37, 3]. Using the pairwise affinity (1), outliers are properly identified if satisfies the lower bound in (7) and if the sampling is dense enough, specifically [3],

(8)

When the surfaces are only required to be of Lipschitz regularity as in (6), we are not aware of any method that can even detect the presence of clusters among outliers if the sampling is substantially sparser.

1.3.2 Performance of HOSC

Methods using higher-order affinities are obviously more complex than methods based solely on pairwise affinities. Indeed, HOSC depends on more parameters and is computationally more demanding than SC. One, therefore, wonders whether this higher level of complexity is justified. We show that HOSC does improve on SC in terms of clustering performance, both in terms of required separation between clusters and in terms of robustness to outliers.

Our main contribution in this paper is to establish a separation requirement for HOSC which is substantially weaker than (7) when the jitter is small enough. Specifically, HOSC operates under the separation

(9)

where denotes the minimum of and , and is the separation required for SC with a compactly supported kernel, defined in (7). This is proved in Theorem 1 of Section 2. In particular, in the jitterless case (i.e. ), the magnitude of the separation required for HOSC is (roughly) the square of that for SC at the same sample size; equivalently, at a given separation, HOSC requires (roughly) the square root of the sample size needed by SC to correctly identify the clusters.

Figure 4: Left: data. Middle: output from SC. Right: output from HOSC. The sampling is much sparser than in the original paper of Ng et al. [42], which is why SC fails. This figure is part of Figure 13 in Section 3, which displays more numerical experiments.

That HOSC requires less separation than SC is also observed numerically. In Figure 4 we compare the outputs of SC and HOSC on the emblematic example of concentric circles given in [42] (here with three circles). While the former fails completely, the latter is perfectly accurate. Indeed, SC requires that the majority of points in an -ball around a given data point come from the cluster containing that point. In contrast, HOSC is able to properly operate in situations where the separation between clusters is so small, or the sampling rate is so low, that any such neighborhood is empty of data points except for the one point at the center. To further illustrate this point, consider the simplest possible setting consisting of two parallel line segments in dimension , separated by a distance , specifically, and . Suppose points are sampled uniformly on each of these line segments. It is well-known that the typical distance between a point on and its nearest neighbor on is of order ; see [45]. Hence, a method computing local statistics requires neighborhoods of radius at least of order , for otherwise some neighborhoods are empty. From (9), HOSC is perfectly accurate when , say. When the separation is that small, typical ball of radius of order around a data point contains about as many points from as from (thus SC cannot work). See Figure 5 for an illustration.

Figure 5: Clustering results obtained by SC (left) and HOSC (right) on a data set of two lines with small separation (). 100 points are sampled from each line, equally spaced (at a distance ). Note that the inter-point separation on the same cluster is twice as large as the separation between clusters. In this case, SC cannot separate the two lines correctly, as we have argued. In contrast, HOSC performs perfectly when clustering the data, which again agrees with the theory and our expectation. We have also tried increasing the separation from to , in which case both SC and HOSC perform correctly.

As a bonus, we also show that HOSC is able to resolve intersections in some (very) special cases, while SC is incapable of that. See Proposition 6 and also Figure 12.

To make HOSC robust to outliers, we do exactly as described above, identifying outliers as data points with low connectivity in the graph with affinity matrix , this time computed using the multiway affinity (3). The separation and sampling requirements are substantially weaker than (8), specifically, is required to satisfy the lower bound in (9) and the sampling

(10)

This is established in Proposition 5, and again, we are not aware of any method for detection that is reliable when the sampling is substantially sparser. For example, when and we are clustering curves () in the plane () (with background outliers), the sampling requirement in (8) is roughly , compared to in (10). In Figure 6

below we compare both SC and HOSC on outliers detection, using the data in Figure 

4 but further corrupted with outliers.

Figure 6: Left: data with outliers. Middle: outliers (black dots) detected by SC. Right: outliers (black dots) detected by HOSC. This figure is part of Figure 15 in Section 3, where more outliers-removal experiments are conducted.

1.3.3 Other Methods

We focus on comparing HOSC and SC to make a strong point that higher-order methods may be preferred to simple pairwise methods when the underlying clusters are smooth and the jitter level is small. In fact, we believe that no method suggested in the literature is able to compete with HOSC in terms of separation requirements. We quickly argue why.

The algorithm of Kushnir et al. [32]

is multiscale in nature and is rather complex, incorporating local information (density, dimension and principal directions) within a soft spectral clustering approach. In the context of semi-supervised learning, Goldberg et al. 

[25]

introduce a spectral clustering method based on a local principal components analysis (PCA) to utilize the unlabeled points. Both methods rely on local PCA to estimate the local geometry of the data and they both operate by coarsening the data, eventually applying spectral clustering to a small subset of points acting as representative hubs for other points in their neighborhoods. They both implicitly require that, for the most part, the vast majority of data points in each neighborhood where the statistics are computed come from a single cluster. Souvenir and Pless 

[49] suggest an algorithm that starts with ISOMAP and then alternates in EM-fashion between the cluster assignment and the computation of the distances between points and clusters—this is done in a lower dimensional Euclidean space using an MDS embedding. Though this iterative method appears very challenging to be analyzed, it relies on pairwise distances computed as a preprocessing step to derive the geodesic distances, which implicitly assumes that the points in small enough neighborhoods are from the same manifold. Thus, like the SC algorithm, all these methods effectively rely on neighborhoods where only one cluster dominates. This is strong evidence that their separation requirements are at best similar to that of SC. The methods of Haro et al. [30] and Gionis et al. [24] are solely based on the local dimension and density, and are powerless when the underlying manifolds are of same dimension and sampled more or less uniformly, which is the focus of this paper. The method of Guo et al. [29] relies on minimizing an energy that, just as HOSC, incorporates the diameter and local curvature of -tuples, with for curves and for surfaces in 3D, and the minimization is combinatorial over the cluster assignment. In principle, this method could be analyzed with the arguments we deploy here. That said, it seems computationally intractable.

1.4 Computational Considerations

Thus it appears that HOSC is superior to SC and other methods in terms of separation between clusters and robustness to outliers, when the clusters are smooth and the jitter is small. But is HOSC even computationally tractable?

Assume and are fixed. The algorithm starts with building the neighborhood graph (i.e., computing the matrix ). This may be done by brute force in flops. Clearly, this first step is prohibitive, in particular since we recommend using a (moderately) large . However, we may restrict computations to points within distance , which essentially corresponds to using a compactly supported kernel . Hence, we could apply a range search algorithm to reduce computations. Alternatively, at each point we may restrict computations to its nearest neighbors, with , or in a slightly different fashion, adapt the local scaling method proposed in [56] by replacing in by , where denotes the distance between and its th nearest neighbor. The reason is that the central condition (12) effectively requires that the degree at each point be of order (roughly), which is guaranteed if the -nearest neighbors are included in the computations; see [3, 36] for rigorous arguments leading to that conclusion. In low dimensions, , a range search and -nearest-neighbor search may be computed effectively with kd-trees in flops. In higher dimensions, it is essential to use methods that adapt to the intrinsic dimensionality of the data. Assuming that is small, the method suggested in [8] has a similar computational complexity. Hence, the (approximate) affinity matrix can be computed in order ; assuming , this is of order . This is within the possible choices for in Theorem 1.

Assume we use the -nearest-neighbor approximation to the neighborhood graph, with . Then computing may be done in flops, since the affinity matrix has at most non-zero coefficients per row. Then extracting the leading eigenvectors of may be done in flops, using Lanczos-type algorithms [15]. Thus we may run the -nearest neighbor version of HOSC in flops, and it may be shown to perform comparably.

We actually implemented the -nearest-neighbor variant of HOSC and tried it on a number of simulated datasets and a real dataset from motion segmentation. The results are presented in Section 3. The code is publicly available online [13].

1.5 Content

The rest of the paper is organized as follows. The main theoretical results are in Section 2 where we provide theoretical guarantees for HOSC, including in contexts where outliers are present or the underlying clusters intersect. We emphasize that HOSC is only able to separate intersecting clusters under very stringent assumptions. In the same section we also address the issue of estimating the parameters that need to be provided to HOSC. In theory at least, they may be chosen automatically. In Section 3 we implemented our own version of HOSC and report on some numerical experiments involving both simulated and real data. Section 4 discusses a number of important extensions, such as when the surfaces self-intersect or have boundaries, which are excluded from the main discussion for simplicity. We also discuss the case of manifolds of different intrinsic dimensions, suggesting an approach that runs HOSC multiple times with different . And we describe a kernel version of HOSC that could take advantage of higher degrees of smoothness. Other extensions are also mentioned, including the use of different kernels. The proofs are postponed to the Appendix.

2 Theoretical Guarantees

Our main result provides conditions under which HOSC is perfectly accurate with probability tending to one in the framework introduced in Section 

1.2. Throughout the paper, we state and prove our results when the surfaces have no boundary and for the simple kernel , for convenience and ease of exposition. We discuss the case of surfaces with boundaries in Section 4.2 and the use of other kernels in Section 4.5.

Theorem 1.

Consider the generative model of Section 1.2. For slowly (e.g., ), assume the parameters of HOSC satisfy

(11)
(12)

and

(13)

Assume that (5) holds with

(14)

Under these conditions, when is large enough, HOSC is perfectly accurate with probability at least .

To relate this to the separation requirement stated in the Introduction, the condition (9) is obtained from (14) by choosing and equal to their respective lower bounds in (12) and (13).

We further comment on the theorem. First, the result holds if and is sufficiently large. We state and prove the result when as a matter of convenience. Also, by (11) and (14), the weakest separation requirement is achieved when is at least of order slightly less than so that is of order . However, as discussed in Section 1.4, the algorithm is not computationally tractable unless . This is another reason why we focus on the case where . Regarding the constraints (12)-(13) on and , they are there to guarantee that, with probability tending to one, each cluster is ‘strongly’ connected in the neighborhood graph. Note that the bound on is essentially the same as that required by the pairwise spectral method SC [3, 36]. In turn, once each cluster is ‘strongly’ connected in the graph, clusters are assumed to be separated enough that they are ‘weakly’ connected in the graph. The lower bound (14) quantifies the required separation for that to happen. Note that it is specific to the simple kernel. For example, the heat kernel would require a multiplicative factor proportional to .

So how does HOSC compare with SC? When the jitter is large enough that , we have and the local linear approximation contribution to (3) does not come into play. In that case, the two algorithms will output the same clustering (see Figure 7 for an example).

Figure 7: Clustering results obtained by SC (left) and HOSC (right) on the data set of Figure 5, but with separation and jitter . In this example, neither SC nor HOSC can successfully separate the two lines. This example supports our claim that when the jitter is large enough (relative to separation), HOSC does not improve over SC and the two algorithms will output the same clustering.

When the jitter is small enough that , HOSC requires less separation, as demonstrated in Figure 5. Intuitively, in this regime the clusters are sampled densely enough relative to the thickness that the smoothness of the underlying surfaces comes into focus and each cluster, as a point cloud, becomes locally well-approximated by a thin band. We provide some numerical experiments in Section 3 showing HOSC outperforming SC in various settings.

Thus, HOSC improves on SC only when the jitter is small. This condition is quite severe, though again, we do not know of any other method that can accurately cluster under the weak separation requirement displayed here, even in the jitterless case. It is possible that some form of scan statistic (i.e., matched filters) may be able to operate under the same separation requirement without needing the jitter to be small, however, we do not know how to compute it in our nonparametric setting—even in the case of hybrid linear modeling where the surfaces are affine, computing the scan statistic appears to be computationally intractable. At any rate, the separation required by HOSC is essentially optimal when is of order or smaller. A quick argument for the case and goes as follows. Consider a line segment of length one and sample points uniformly at random in its -neighborhood, with . The claim is that this neighborhood contains an empty band of thickness of order slightly less than , and therefore cannot be distinguished from two parallel line segments. Indeed, such band of half-width inside that neighborhood is empty of sample points with probability , which converges to 1 if , and when , this is the case if .

In regards to the choice of parameters, the recommended choices depend solely on . These model characteristics are sometimes unavailable and we discuss their estimation in Section 2.1. Afterwards, we discuss issues such as outliers (Section 2.2) and intersection (Section 2.3).

2.1 Parameter Estimation

In this section, we propose some methods to estimate the intrinsic dimension of the data, the jitter and the number of clusters . Though we show that these methods are consistent in our setting, further numerical experiments are needed to determine their potential in practice.

Compared to SC, HOSC requires the specification of three additional parameters. This is no small issue in practice. In theory, however, we recommend choosing and consistent with their true values, and as functions of , and of order slightly less than . The true unknowns are therefore . We provide estimators for and that are consistent, and an estimator for that is accurate enough for our purposes. Specifically, we estimate and using the correlation dimension [28] and an adaptation of our own design. The number of clusters is estimated via the eigengap of the matrix .

2.1.1 The Intrinsic Dimension and the Jitter Level

A number of methods have been proposed to estimate the intrinsic dimensionality; we refer the reader to [33] and references therein. The correlation dimension, first introduced in [28], is perhaps the most relevant in our context, since surfaces may be close together. Define the pairwise correlation function

The authors of [28] recommend plotting versus and estimating the slope of the linear part. We use a slightly different estimator that allows us to estimate too, if it is not too small. The idea is to regress on and identify a kink in the curve. See Figure 8 for an illustration.

Figure 8: A correlation curve for a simulated data set of 240 points sampled from the -neighborhood of three disjoint one-dimensional curves () in dimension ten ( crossing all dimensions. The jitter is . We see that the linear part of the curve has slope (near) 1, which coincides with the intrinsic dimension of the curves. The kink appears near , a close approximation to .

Though several (mostly ad hoc) methods have been proposed for finding kinks, we describe a simple method for which we can prove consistency. Fix , with . Define

Let . If there is such that

then let be the smallest such ; otherwise, let . Define ; and also , if , and the closest integer to , otherwise.

Proposition 1.

Consider the generative model described in Section 1.2 with . Assume that and, if there are outliers, assume that . Then the following holds with probability at least : if , then ; if , then ; moreover, if , .

In the context of Proposition 1, the only time that is inconsistent is when is of order or larger, in which case ; this makes sense, since the region is in fact -dimensional if is of order 1. Also, is within a factor of if is not much smaller than .

We now extend this method to deal with a smaller . Consider what we just did. The quantity is the total degree of the -neighborhood graph built in SC. Fixing , we now consider the total degree of the -neighborhood graph built in HOSC. Define the multiway correlation function

Similarly, we shall regress on and identify a kink in the curve (Figure 9 displays such a curve).

Figure 9: Correlation curves corresponding to SC (left) and HOSC (right) for the data set of Figure 8, but with a much smaller . We see that the pairwise correlation function works poorly in this case, while the multiway correlation curve has a kink near , within a factor of of the true .

Using the multiway correlation function, we then propose an estimator as follows. We assume that the method of Proposition 1 returned , for otherwise we know that is accurate. Choose and . Note that this is the only time we require to be larger than . Let . If there is such that

then let be the smallest one; otherwise, let . We then redefine as .

Proposition 2.

In the context of Proposition 1, assume that . Then redefining as done above, the following holds with probability at least : if , then ; if , then .

Now, comes close to if is not much smaller than . Whether this is the case, or not, the statement of Theorem 1 applies with in place of in (13).

Though our method works in theory, it is definitely asymptotic. In practice, we recommend using other approaches for determining the location of the kink and the slope of the linear part of the pairwise correlation function (in log-log scale). Robust regression methods with high break-down points, like least median of squares and least trimmed squares, worked well in several examples. We do not provide details here, as this is fairly standard, but the figures are quite evocative.

2.1.2 The Number of Clusters

HOSC depends on choosing the number of clusters appropriately. A common approach consists in choosing

by inspecting the eigenvalues of

. We show that, properly tuned, this method is consistent within our model.

Proposition 3.

Compute the matrix in HOSC with the same choice of parameters as in Theorem 1, except that knowledge of is not needed. Set the number of clusters equal to the number of eigenvalues of (counting multiplicity) exceeding . Then with probability at least , this method chooses the correct number of clusters.

We implicitly assumed that and are known, or have been estimated as described in the previous section. The proof of Proposition 3 is parallel to that of [3, Prop. 4], this time using the estimate provided in part (A1) of the proof of Theorem 1. Details are omitted.

Figure 10 illustrates a situation where the number of clusters is correctly chosen by inspection of the eigenvalues, more specifically, by counting the number of eigenvalue in the spectrum of (up to numerical error). This success is due to the fact that the clusters are well-separated, and even then, the eigengap is quite small.

Figure 10: The top six eigenvalues of the weight matrix obtained by HOSC in Step 2 for the same data used in Figure 8. Though in this example the clusters are well-separated, the eigengap is still very small (about 0.005).

We apply this strategy to more data later in Section 3, and show that it can correctly identify the parameter in some cases (see Figure 14). In general we do not expect this method to work well when the data has large noise or intersecting clusters, though we do not know of any other method that works in theory under our very weak separation requirements.

2.2 When Outliers are Present

So far we have only considered the case where the data is devoid of outliers. We now assume that some outliers may be included in the data as described at the end of Section 1.2. As stated there, we label as outlier any data point with low degree in the neighborhood graph, as suggested in [12, 37, 3]. Specifically, we compute as in Step 2 of HOSC, and then label as outliers points with degree below some threshold. Let slower than any power of , e.g., . We propose two thresholds:

  1. Identify as outliers points with degree:

  2. Identify as outliers points with degree:

Taking up the task of identifying outliers, only the separation between outliers and non-outliers is relevant, so that we do not require any separation between the actual clusters. We first analyze the performance of (O1), which requires about the same separation between outliers and non-outliers as HOSC requires between points from different clusters in (14).

Proposition 4.

Consider the generative model described in Section 1.2. Assume that and that (11)-(13) hold. In terms of separation, assume that Then with probability at least , the procedure (O1) identifies outliers without error.

We now analyze the performance of (O2), which requires a stronger separation between outliers and non-outliers, but operates under very weak sampling requirements.

Proposition 5.

Assume that is as in (11), and

(15)

In terms of separation, assume that . In addition, suppose that

(16)

Then with probability at least , the procedure (O2) identifies outliers without error.

If , so that outliers are sampled everywhere but within the -tubular regions of the underlying surfaces, then both (O1) and (O2) may miss some outliers within a short distance from some . Specifically, (O1) (resp. (O2)) may miss outliers within (resp. within ) from some . Using Weyl’s tube formula [55], we see that there are order (resp. ) such outliers, a small fraction of all outliers.

The sampling requirement (16) is weaker than the corresponding requirement for pairwise methods displayed in (8). In fact, (16) is only slightly stronger than what is required to just detect the presence of a cluster hidden in noise. We briefly explain this point. Instead of clustering, consider the task of detecting the presence of a cluster hidden among a large number of outliers. Formally, we observe the data

, and want to decide between the following two hypotheses: under the null, the points are independent, uniformly distributed in the unit hypercube

; under the alternative, there is a surface such that points are sampled from as described in Section 1.2, while the rest of the points, of them, are sampled from the unit hypercube , again uniformly. Assuming that the parameters and are known, it is shown in [5, 4] that the scan statistic is able to separate the null from the alternative if

(17)

We are not aware of a method that is able to solve this detection task at a substantially lower sampling rate, and (16) comes within a logarithmic factor from (17). We thus obtain the remarkable result that accurate clustering is possible within a log factor of the best (known) sampling rate that allows for accurate detection in the same setting.

2.3 When Clusters Intersect

We now consider the setting where the underlying surfaces may intersect. The additional conditions we introduce are implicit constraints on the dimension of, and the incidence angle at, the intersections. We suppose there is an integer and a finite constant such that

(18)

(The subscript stands for ‘intersection’.) In addition, we assume that for some ,

(19)

(18) is slightly stronger than requiring that has finite -dimensional volume. If the surfaces are affine, it is equivalent to the condition (19), on the other hand, is a statement about the minimum angle at which any two surfaces intersect. For example, if the surfaces are affine within distance of their intersection, then (19) is equivalent to their maximum (principal) angle being bounded from below by . See Figure 11 for an illustration.

Figure 11:

Illustration of intersecting surfaces. Though the human eye easily distinguishes the two clusters, the clustering task is a lot harder for machine learning algorithms. The main issue is that there are too many data points at the intersection of the two tubular regions. However, in very special cases HOSC

is able to separate intersecting clusters (see Figure 12 for such an example).
Proposition 6.

Consider the setting of Theorem 1, with (5) replaced by (19). In addition, assume that (18) holds. Define

Then there is a constant such that, with probability at least , HOSC is perfectly accurate.

The most favorable case is when and . Then with our choice of and in Theorem 1, assuming increases slowly, e.g., , we have if , and partial results suggest this cannot be improved substantially. This constraint on the intersection of two surfaces is rather severe. Indeed, a typical intersection between two (smooth) surfaces of same dimension is of dimension , and if so, only curves satisfy this condition. Figure 12 provides a numerical example showing the algorithm successfully separating two intersecting one-dimensional clusters. Thus, even with no jitter and the surfaces intersecting at right angle, HOSC is only able to separate intersecting clusters under exceptional circumstances. Moreover, even when the conditions of Proposition 6 are fulfilled, the probability of success is no longer exponentially small, but is at best of order . That said, SC does not seem able to properly deal with intersections at all (see also Figure 12). It essentially corresponds to taking in HOSC, in which case never tends to zero.

Figure 12: Left: data. Middle: output from HOSC. Right: Output from SC. This example shows that HOSC is able to separate intersecting curvilinear clusters when the incidence angle is perpendicular and there is no jitter (). In particular, the conditions of Proposition 6 are satisfied. On the contrary, SC fails in this case.

Though the implications of Proposition 6 are rather limited, we do not know of any other clustering method which provably separates intersecting clusters under a similar generative model. This is a first small step towards finding such a method.

3 Software and Numerical Experiments

We include in this section a few experiments where a preliminary implementation of HOSC outperforms SC, to demonstrate that higher-order affinities can bring a significant improvement over pairwise affinities in the context of manifold clustering.

In our implementation of HOSC, we used the heat kernel . Following the discussion in Section 1.4, at each point we restrict the computations to its nearest neighbors so that we practically remove the locality parameter from the affinity function of (3) and obtain

(20)

where is the set of the nearest neighbors of . For computational ease, we used

(21)

which can be easily computed using the bottom singular values of the points. Note that, since the results we obtained apply, with changed by a factor, at most. (In the paper, the standard choice for is a power of , while is of order at most , so this factor is indeed negligible.) In practice, we always search a subinterval of for the best working (e.g.,

), based on the smallest variance of the corresponding clusters in the eigenspace (the row space of the matrix

), as suggested in [42]. When the given data contains outliers, the optimal choice of is based on the largest gap between the means of the two sets of degrees (associated to the inliers and outliers), normalized by the maximum degree. The code is available online [13].

3.1 Synthetic Data

We first generate five synthetic data sets in the unit cube ( or ), shown in Figure 13. In this experiment, the actual number of clusters (i.e. ) and dimension of the underlying manifolds (i.e. ) are assumed known to all algorithms. For HOSC, we fix , and use the subinterval as the search interval of . For SC, we considered two ways of tuning the scale parameter : directly, by choosing a value in the interval (SC-NJW); and by the local scaling method of [56] (SC-LS), with the number of nearest neighbors . The final choices of these parameters were also based on the same criterion as used by HOSC.

Figure 13 exhibits the clusters found by each algorithm when applied to the five data sets, respectively. Observe that HOSC succeeded in a number of difficult situations for SC, e.g., when the sampling is sparse, or when the separation is small at some locations.

Figure 13: Left column: data. (The third example shows a sphere containing an ellipsoid inside.) Middle column: best output from SC with the scale parameter chosen by both searching the interval and applying local scaling [56] with at most 15 nearest neighbors. Right column: output from HOSC. The optimal value of is selected from the interval . We also tried the simple kernel instead of the heat kernel, and obtained same results except in data set 3.

We also plot the leading eigenvalues of the matrix obtained by HOSC on each data set; see Figure 14. We see that in data sets 1, 2, 5, the number of eigenvalue 1 coincides with the true number of clusters, while in 3 and 4 there is some discrepancy between the th eigenvalue and the number 1. Though we do not expect the eigengap method to work well in general, Figure 14 shows that it can be useful in some cases.

Figure 14: Top eigenvalues of the matrix obtained by HOSC on each of the five data sets in Figure 13 (in same order).

Figure 15 displays some experiments including outliers. We simply sampled points from the unit square uniformly at random and added them as outliers to the first three data sets in Figure 13, with percentages 33.3%, 60% and 60%, respectively. We applied SC and HOSC assuming knowledge of the proportion of outliers, and labeled points with smallest degrees as outliers. Choosing the threshold automatically remains a challenge; in particular, we did not test the theory.

Figure 15: Outlier-removal experiments. Left column: data with outliers. The percentages of outliers are 33.3%, 60% and 60%, respectively. Middle: outliers (black dots) detected by pairwise spectral clustering (both SC-NJW and SC-LS, but only the better result is shown). Right: outliers (black dots) detected by HOSC. The use of the simple kernel (instead of the heat kernel) in HOSC gives very similar results.

We observe that HOSC could successfully remove most of the true outliers, leaving out smooth structures in the data; in contrast, SC tended to keep isolated high-density regions, being insensitive to sparse smooth structures. A hundred replications of this experiment (i.e., fixing the clusters and adding randomly generated outliers) show that the True Positive Rates (i.e., percentages of correctly identified outliers) for (SC, HOSC) are (58.1% vs 67.7%), (75.4% vs 86.8%) and (76.8% vs 88.0%), respectively.

3.2 Real Data

We next compare SC and HOSC using the two-view motion data studied in [10, 46]. This data set contains 13 motion sequences: (1) boxes, (2) carsnbus3, (3) deliveryvan, (4) desk, (5) lightbulb, (6) manycars, (7) man-in-office, (8) nrbooks3, (9) office, (10) parking-lot, (11) posters-checkerboard, (12) posters-keyboard, and (13) toys-on-table; and each sequence consists of two image frames of a 3-D dynamic scene taken by a perspective camera (see Figure 16 for a few such sequences). Suppose that several feature points have been extracted from the moving objects in the two camera views of the scene. The task is to separate the trajectories of the feature points according to different motions. This application, which lies in the field of structure from motion

, is one of the fundamental problems in computer vision.

Figure 16: Three exemplary two-view motion sequences (arranged in columns): (4) desk, (6) manycars and (7) man-in-office. The true clusters are displayed in different colors and markers (the black dots are outliers).

Given a physical point and its image correspondences in the two views , one can always form a joint image sample . It is shown in [46] that, under perspective camera projection, all the joint image samples corresponding to different motions live on different manifolds in , some having dimension 2 and others having dimension 4. Exploratory analysis applied to these data suggests that the manifolds in this dataset mostly have dimension 2 (see Figure 17). Therefore, we will apply our algorithm (HOSC) with to these data sets in order to compare with pairwise spectral clustering (SC-NJW, SC-LS).

Figure 17: The true clusters of the three sequences in Figure 16 (in same order), shown in top three principal dimensions. (The outliers have been removed from the data and thus are not displayed). These plots clearly indicate that the underlying manifolds are two dimensional.

We use the following parameter values for the two algorithms. In HOSC, we choose , while in SC we try both searching the interval (SC-NJW) and local scaling with at most 24 nearest neighbors (SC-LS).

The original data contains some outliers. In fact, 10 sequences out of the 13 are corrupted with outliers, with the largest percentage being about 32%. We first manually remove the outliers from those sequences and solely focus on the clustering aspects of the two algorithms. Next, we add outliers back and compare them regarding outliers removal. (Note that we need to provide both algorithms with the true percentage of outliers in each sequence.) By doing so we hope to evaluate the clustering and outliers removal aspects of an algorithm separately and thus in the most accurate way.

Data Clustering Errors # True Outliers Detected
seq. #samples #out. SC-NJW SC-LS HOSC SC-NJW SC-LS HOSC
1 115,121 2 0.85% 0.85% 0.85% 1 1 1
2 85,45,89 28 0% 0% 0% 24 24 24
3 62,192 0 30.3% 23.6% 30.3% N/A N/A N/A
4 50,50,55 45 0.65% 2.58% 1.29% 35 30 37
5 51,121,33 0 0% 0% 0% N/A N/A N/A
6 54,24,23,43 0 18.8% 0% 0% N/A N/A N/A
7 16,57 34 19.2% 19.2% 0% 17 12 26
8 129,168,91 32 22.9% 17.8% 22.9% 12 17 23
9 76,109,74 48 0% 0% 0% 36 28 36
10 19,117 4 0% 47.8% 0% 0 0 1
11 100,99,81 99 0% 1.79% 0% 42 39 73
12 99,99,99 99 0.34% 0.34% 0% 80 43 91
13 49,42 35 33.0% 15.4% 2.20% 7 6 21
Table 1: The misclassification rates and the numbers of true outliers detected by HOSC, SC-NJW and SC-LS. In the clustering experiment, the outliers-free data is used; then the outliers are added back so that each of these algorithms can be applied to detect them. For SC-NJW, the tuning parameter is selected from the interval ; for SC-LS, a maximum of nearest neighbors are used; for HOSC, nearest neighbors are used and the flatness parameter is selected from the interval .

Table 1 presents the results from the experiments above. Observe that HOSC achieved excellent clustering results in all but two sequences, with zero error on eight sequences, one mistake on sequence (13), and two mistakes on each of (1) and (4). We remark that HOSC also outperformed the algorithms in [10, Table 1], in terms of clustering accuracy, but due to the main aim of this paper, we do not include those results in Table 1. In contrast, each of SC-NJW and SC-LS failed on at least five sequences (with over 15% misclassification rates), both containing the two bad sequences for HOSC. As a specific example, we display in Figure 18 the clusters obtained by both HOSC and SC on sequence (7), demonstrating again that higher order affinities can significantly improve over pairwise affinities in the case of manifold data. Regarding outliers removal, HOSC is also consistently better than SC-NJW and SC-LS (if not equally good).

Figure 18: Clustering results of both HOSC and SC (left to right) on sequence (7). (The truth is shown in Figure 17, rightmost plot). In this example, HOSC correctly found the two clusters, using geometric information; in contrast, SC failed because it solely relies on pairwise distances.

4 Extensions

4.1 When the Underlying Surfaces Self-Intersect

In our generative model described in Section 1.2 we assume that the surfaces are submanifolds, implying that they do not self-intersect. This is really for convenience as there is essentially no additional difficulty arising from self-intersections. If we allow the surfaces to self-intersect, then we bound the maximum curvature (from above) and not the reach. We could, for example, consider surfaces of the form , where is locally bi-Lipschitz and has bounded second derivative. A similar model is considered in [38] in the context of set estimation. Clearly, proving that each cluster is connected in the neighborhood graph in this case is the same. The only issue is in situations where a surface comes within distance from another surface at a location where the latter intersects itself. The geometry involved in such a situation is indeed complex. If we postulate that no such situation arises, then our results generalize immediately to this setting.

4.2 When the Underlying Surfaces Have Boundaries

When the surfaces have boundaries, points near the boundary of a surface may be substantially connected with points on a nearby surface. See Figure 19 for an illustration. This is symptomatic of the fact that the algorithm is not able to resolve intersections in general, as discussed in Section 2.3, with the notable exception of clusters of dimension , as illustrated in the ‘two moons’ example of Figure 13.

Figure 19: An example of a surface with a boundary coming close to another surface. This is a potentially problematic situation for HOSC as the points near the boundary of one surface and close to the other surface may be strongly connected to points from both clusters. Numerically, we show in Figure 13 such an example where HOSC is successful.

If we require a stronger separation between the boundary of a surface and the other surfaces, specifically,

(22)

with , no point near the boundary of a cluster is close to a point from a different cluster. (A corresponding requirement in the context of outliers would be that outliers be separated from the boundary of a cluster by at least , with .)

4.3 When the Data is of Mixed Dimensions

In a number of situations, the surfaces may be of different intrinsic dimensions. An important instance of that is the study of the distribution of galaxies in space, where the galaxies are seen to cluster along filaments () and walls ([39]. We propose a top-down approach, implementing HOSC for each dimension starting at and ending at (or between any known upper and lower bounds for ).

At each step, the algorithm is run on each cluster obtained from the previous step, including the set of points identified as outliers. Indeed, when the dimension parameter of the algorithm is set larger than the dimension of the underlying surfaces, HOSC may not be able to properly separate clusters. For example, two parallel segments satisfying the separation requirement of Theorem 1 still belong to a same plane and HOSC with dimension parameter would not be able to separate the two line segments. Another reason for processing the outlier bin is the greater disparity in the degrees of the data points in the neighborhood graph often observed with clusters of different dimensions. At each step, the number of clusters is determined automatically according to the procedure described in Section 2.1, for such information is usually not available. The parameters and are chosen according to (15). Partial results suggest that, under some additional sampling conditions, this top-down procedure is accurate under weaker separation requirements than required by pairwise methods, which handle the case of mixed dimensions seamlessly [3]. The key is that an actual cluster , as defined in Section 1.2, is never cut into pieces. Indeed, properties (A1) and (A4) in the proof of Theorem 1, which guarantee the connectivity and regularity (in terms of comparable degrees) of the subgraph represented by , are easily seen to also be valid when the dimension parameter of the algorithm is set larger than . (This observation might explain the success of the SCC algorithm of [12] in some mixed settings when using an upper bound on the intrinsic dimensions.)

4.4 Clustering Based on Local Polynomial Approximations

For and an integer , let be the subclass of of -dimensional submanifolds such that, for every with tangent , the orthogonal projection is a -diffeomorphism with all partial derivatives of order up to bounded in supnorm by . For example, includes a subclass of surfaces of the form , where is locally bi-Lipschitz and has its first derivatives bounded. (We could also consider surfaces of intermediate, i.e., Hölder smoothness, a popular model in function and set estimation [18, 38].)

Given that surfaces in are well-approximated locally by polynomial surfaces, it is natural to choose an affinity based on the residual of the best -dimensional polynomial approximation of degree at most to a set of points . This may be implemented via the “kernel trick” with a polynomial kernel, as done in [10] for the special case of algebraic surfaces. The main difference with the case of surfaces that we consider in the rest of the paper is the degree of approximation to a surface by its osculating algebraic surface of order ; within a ball of radius , it is of order .

Partial results suggest that, under similar conditions, the kernel version of HOSC with known may be able to operate under a separation of the form (9), with the exponent replaced by and, in the presence of outliers, within a logarithmic factor of the best known sampling rate ratio achieved by any detection method [5, 4]:

(23)

Regarding the estimation of , defining the correlation dimension using the underlying affinity defined here allows to estimate accurately down to (essentially) , if the surfaces are all in . The arguments are parallel and we omit the details.

Thus, using the underlying affinity defined here may allow for higher accuracy, if the surfaces are smooth enough. However, this comes with a larger computational burden and at the expense of introducing a new parameter , which would need to be estimated if unknown, and we do not know a good way to do that.

4.5 Other Extensions

The setting we considered in this paper, introduced in Section 1.2, was deliberately more constrained than needed for clarity of exposition. We list a few generalizations below, all straightforward extensions of our work.

  • Sampling. Instead of the uniform distribution, we could use any other distribution with a density bounded away from and

    , or with fast decaying tails such as the normal distribution.

  • Kernel. The rate of decay of the kernel dictates the range of the affinity (3). Let be a non-decreasing sequence such that For a compactly supported kernel, , while for the heat kernel, we can take . As we will take , is essentially supported in so that points that are further than apart have basically zero affinity. Specifically, we use the following bounds:

    The results are identical, except that statements of the form are replaced with .

  • Measure of flatness. As pointed out in the introduction, any reasonable measure of linear approximation could be used instead. Our choice was driven by convenience and simplicity.

Appendix A Preliminaries

We gather here some preliminary results. Recall that, for , ; ; . For