The task of multi-manifold clustering, where the data are assumed to be located near surfaces embedded in Euclidean space, is relevant in a variety of applications. In cosmology, it arises as the extraction of galaxy clusters in the form of filaments (curves) and walls (surfaces) (Valdarnini, 2001; Martínez and Saar, 2002); in motion segmentation, moving objects tracked along different views form affine or algebraic surfaces (Ma et al., 2008; Fu et al., 2005; Vidal and Ma, 2006; Chen et al., 2009)
; this is also true in face recognition, in the context of images of faces in fixed pose under varying illumination conditions(Ho et al., 2003; Basri and Jacobs, 2003; Epstein et al., 1995).
We consider a stylized setting where the underlying surfaces are nonparametric in nature, with a particular emphasis on situations where the surfaces intersect. Specifically, we assume the surfaces are smooth, for otherwise the notion of continuation is potentially ill-posed. For example, without smoothness assumptions, an L-shaped cluster is indistinguishable from the union of two line-segments meeting at right angle.
Spectral methods (Luxburg, 2007)
are particularly suited for nonparametric settings, where the underlying clusters are usually far from convex, making standard methods like K-means irrelevant. However, a drawback of standard spectral approaches such as the well-known variation ofNg, Jordan, and Weiss (2002) is their inability to separate intersecting clusters. Indeed, consider the simplest situation where two straight clusters intersect at right angle, pictured in Figure 1 below.
The algorithm of Ng et al. (2002) is based on pairwise affinities that are decreasing in the distances between data points, making it insensitive to smoothness and, therefore, intersections. And indeed, this algorithm typically fails to separate intersecting clusters, even in the easiest setting of Figure 1.
As argued in (Agarwal et al., 2005; Agarwal et al., 2006; Shashua et al., 2006), a multiway affinity is needed to capture complex structure in data (here, smoothness) beyond proximity attributes. For example, Chen and Lerman (2009b) use a flatness affinity in the context of hybrid linear modeling, where the surfaces are assumed to be affine subspaces, and subsequently extended to algebraic surfaces via the ‘kernel trick’ (Chen, Atev, and Lerman, 2009)
. Moving beyond parametric models,Arias-Castro, Chen, and Lerman (2011) consider a localized measure of flatness; see also Elhamifar and Vidal (2011)
. Continuing this line of work, we suggest a spectral clustering method based on the estimation of the local linear structure (tangent bundle) via local principal component analysis (PCA).
The idea of using local PCA combined with spectral clustering has precedents in the literature. In particular, our method is inspired by the work of Goldberg, Zhu, Singh, Xu, and Nowak (2009)
, where the authors develop a spectral clustering method within a semi-supervised learning framework. Local PCA is also used in the multiscale, spectral-flavored algorithm ofKushnir, Galun, and Brandt (2006). This approach is in the zeitgeist. While writing this paper, we became aware of two very recent publications, by Wang, Jiang, Wu, and Zhou (2011) and by Gong, Zhao, and Medioni (2012), both proposing approaches very similar to ours. We comment on these spectral methods in more detail later on.
The basic proposition of local PCA combined with spectral clustering has two main stages. The first one forms an affinity between a pair of data points that takes into account both their Euclidean distance and a measure of discrepancy between their tangent spaces. Each tangent space is estimated by PCA in a local neighborhood around each point. The second stage applies standard spectral clustering with this affinity. As a reality check, this relatively simple algorithm succeeds at separating the straight clusters in Figure 1. We tested our algorithm in more elaborate settings, some of them described in Section 4.
Besides spectral-type approaches to multi-manifold clustering, other methods appear in the literature. The methods we know of either assume that the different surfaces do not intersect (Polito and Perona, 2001), or that the intersecting surfaces have different intrinsic dimension or density (Gionis et al., 2005; Haro et al., 2007). The few exceptions tend to propose very complex methods that promise to be challenging to analyze (Souvenir and Pless, 2005; Guo et al., 2007).
Our contribution is the design and detailed study of a prototypical spectral clustering algorithm based on local PCA, tailored to settings where the underlying clusters come from sampling in the vicinity of smooth surfaces that may intersect. We endeavored to simplify the algorithm as much as possible without sacrificing performance. We provide theoretical results for simpler variants within a standard mathematical framework for multi-manifold clustering. To our knowledge, these are the first mathematically backed successes at the task of resolving intersections in the context of multi-manifold clustering, with the exception of (Arias-Castro et al., 2011), where the corresponding algorithm is shown to succeed at identifying intersecting curves. The salient features of that algorithm are illustrated via numerical experiments.
The rest of the paper is organized as follows. In Section 2, we introduce our methods. In Section 3, we analyze our methods in a standard mathematical framework for multi-manifold learning. In Section 4, we perform some numerical experiments illustrating several features of our algorithm. In Section 5, we discuss possible extensions.
2 The methodology
We introduce our algorithm and simpler variants that are later analyzed in a mathematical framework. We start with some review of the literature, zooming in on the most closely related publications.
2.1 Some precedents
Using local PCA within a spectral clustering algorithm was implemented in four other publications we know of (Goldberg et al., 2009; Kushnir et al., 2006; Gong et al., 2012; Wang et al., 2011). As a first stage in their semi-supervised learning method, Goldberg, Zhu, Singh, Xu, and Nowak (2009) design a spectral clustering algorithm. The method starts by subsampling the data points, obtaining ‘centers’ in the following way. Draw at random from the data and remove its -nearest neighbors from the data. Then repeat with the remaining data, obtaining centers . Let denote the sample covariance in the neighborhood of made of its -nearest neighbors. An -nearest-neighbor graph is then defined on the centers in terms of the Mahalanobis distances. Explicitly, the centers and are connected in the graph if is among the nearest neighbors of in Mahalanobis distance
or vice-versa. The parameters and are both chosen of order . An existing edge between and is then weighted by , where
denotes the Hellinger distance between the probability distributionsand . The spectral graph partitioning algorithm of Ng, Jordan, and Weiss (2002) — detailed in Algorithm 1
— is then applied to the resulting affinity matrix, with some form of constrained K-means. We note thatGoldberg et al. (2009) evaluate their method in the context of semi-supervised learning where the clustering routine is only required to return subclusters of actual clusters. In particular, the data points other than the centers are discarded. Note also that their evaluation is empirical.
The algorithm proposed by Kushnir, Galun, and Brandt (2006) is multiscale and works by coarsening the neighborhood graph and computing sampling density and geometric information inferred along the way such as obtained via PCA in local neighborhoods. This bottom-up flow is then followed by a top-down pass, and the two are iterated a few times. The algorithm is too complex to be described in detail here, and probably too complex to be analyzed mathematically. The clustering methods of Goldberg et al. (2009) and ours can be seen as simpler variants that only go bottom up and coarsen the graph only once.
In the last stages of writing this paper, we learned of the works of Wang, Jiang, Wu, and Zhou (2011) and Gong, Zhao, and Medioni (2012), who propose algorithms very similar to our Algorithm 3 detailed below. Note that these publications do not provide any theoretical guarantees for their methods, which is one of our main contributions here.
2.2 Our algorithms
We now describe our method and propose several variants. Our setting is standard: we observe data points that we assume were sampled in the vicinity of smooth surfaces embedded in . The setting is formalized later in Section 3.1.
2.2.1 Connected component extraction: comparing local covariances
We start with our simplest variant, which is also the most natural. The method depends on a neighborhood radius , a spatial scale parameter and a covariance (relative) scale . For a vector , denotes its Euclidean norm, and for a (square) matrix , denotes its spectral norm. For , we denote by the set . Given a data set , for any point and , define the neighborhood
Data points ; neighborhood radius ; spatial scale , covariance scale .
1: For each , compute the sample covariance matrix of .
2: Compute the following affinities between data points:
3: Remove when there is such that and .
4: Extract the connected components of the resulting graph.
5: Points removed in Step 3 are grouped with the closest point that survived Step 3.
In summary, the algorithm first creates an unweighted graph: the nodes of this graph are the data points and edges are formed between two nodes if both the distance between these nodes and the distance between the local covariance structures at these nodes are sufficiently small. After removing the points near the intersection at Step 3, the algorithm then extracts the connected components of the graph.
In principle, the neighborhood size is chosen just large enough that performing PCA in each neighborhood yields a reliable estimate of the local covariance structure. For this, the number of points inside the neighborhood needs to be large enough, which depends on the sample size , the sampling density, intrinsic dimension of the surfaces and their surface area (Hausdorff measure), how far the points are from the surfaces (i.e., noise level), and the regularity of the surfaces. The spatial scale parameter depends on the sampling density and . It needs to be large enough that a point has plenty of points within distance , including some across an intersection, so each cluster is strongly connected. At the same time, needs to be small enough that a local linear approximation to the surfaces is a relevant feature of proximity. Its choice is rather similar to the choice of the scale parameter in standard spectral clustering (Ng et al., 2002; Zelnik-Manor and Perona, 2004). The orientation scale needs to be large enough that centers from the same cluster and within distance of each other have local covariance matrices within distance , but small enough that points from different clusters near their intersection have local covariance matrices separated by a distance substantially larger than . This depends on the curvature of the surfaces and the incidence angle at the intersection of two (or more) surfaces. Note that a typical covariance matrix over a ball of radius has norm of order , which justifies using our choice of parametrization. In the mathematical framework we introduce later on, these parameters can be chosen automatically as done in (Arias-Castro et al., 2011), at least when the points are sampled exactly on the surfaces. We will not elaborate on that since in practice this does not inform our choice of parameters.
The rationale behind Step 3 is as follows. As we just discussed, the parameters need to be tuned so that points from the same cluster and within distance have local covariance matrices within distance . Hence, and in Step 3 are necessarily from different clusters. Since they are near each other, in our model this will imply that they are close to an intersection. Therefore, roughly speaking, Step 3 removes points near an intersection.
Although this method works in simple situations like that of two intersecting segments (Figure 1), it is not meant to be practical. Indeed, extracting connected components is known to be sensitive to spurious points and therefore unstable. Furthermore, we found that comparing local covariance matrices as in affinity (3) tends to be less stable than comparing local projections as in affinity (4), which brings us to our next variant.
2.2.2 Connected component extraction: comparing local projections
We present another variant also based on extracting the connected components of a neighborhood graph that compares orthogonal projections onto the largest principal directions.
Data points ; neighborhood radius , spatial scale , projection scale .
1: For each , compute the sample covariance matrix of .
2: Compute the projection onto the eigenvectors of
with eigenvalue exceeding.
3: Compute the following affinities between data points:
4: Extract the connected components of the resulting graph.
We note that the local intrinsic dimension is determined by thresholding the eigenvalues of the local covariance matrix, keeping the directions with eigenvalues within some range of the largest eigenvalue. The same strategy is used by Kushnir et al. (2006), but with a different threshold. The method is a hard version of what we implemented, which we describe next.
2.2.3 Covariances or projections?
In our numerical experiments, we tried working both directly with covariance matrices as in (3) and with projections as in (4). Note that in our experiments we used spectral graph partitioning with soft versions of these affinities, as described in Section 2.2.4. We found working with projections to be more reliable. The problem comes, in part, from boundaries. When a surface has a boundary, local covariances over neighborhoods that overlap with the boundary are quite different from local covariances over nearby neighborhoods that do not touch the boundary. Consider the example of two segments, and , intersecting at an angle of at their middle point, specifically
Assume there is no noise and that the sampling is uniform. Assume so that the disc centered at does not intersect , and the disc centered at does not intersect . Let . For , let denote the local covariance at over a ball of radius . Simple calculations yield:
When (roughly, ), the difference in Frobenius norm between the local covariances at is larger than that at and . As for projections, however,
While in theory the boundary points account for a small portion of the sample, in practice this is not the case and we find that spectral graph partitioning is challenged by having points near the boundary that are far (in affinity) from nearby points from the same cluster. This may explain why the (soft version of) affinity (4) yields better results than the (soft version of) affinity (3) in our experiments.
2.2.4 Spectral Clustering Based on Local PCA
The following variant is more robust in practice and is the algorithm we actually implemented. The method assumes that the surfaces are of same dimension and that they are surfaces, with both parameters and known.
Data points ; neighborhood radius ; spatial scale , projection scale ; intrinsic dimension ; number of clusters .
0: Pick one point at random from the data. Pick another point among the data points not included in , and repeat the process, selecting centers .
1: For each , compute the sample covariance matrix of . Let denote the orthogonal projection onto the space spanned by the top eigenvectors of .
2: Compute the following affinities between center pairs:
3: Apply spectral graph partitioning (Algorithm 1) to .
4: The data points are clustered according to the closest center in Euclidean distance.
We note that forms an -packing of the data. The underlying rationale for this coarsening is justified in (Goldberg et al., 2009) by the fact that the covariance matrices, and also the top principal directions, change smoothly with the location of the neighborhood, so that without subsampling these characteristics would not help detect the abrupt event of an intersection. The affinity (5) is of course a soft version of (4).
2.2.5 Comparison with closely related methods
We highlight some differences with the other proposals in the literature. We first compare our approach to that of Goldberg et al. (2009), which was our main inspiration.
Neighborhoods. Comparing with Goldberg et al. (2009), we define neighborhoods over -balls instead of -nearest neighbors, and connect points over -balls instead of -nearest neighbors. This choice is for convenience, as these ways are in fact essentially equivalent when the sampling density is fairly uniform. This is elaborated at length in (Maier et al., 2009; Brito et al., 1997; Arias-Castro, 2011).
Mahalanobis distances. Goldberg et al. (2009) use Mahalanobis distances (1) between centers. In our version, we could for example replace the Euclidean distance in the affinity (3) with the average Mahalanobis distance
We actually tried this and found that the algorithm was less stable, particularly under low noise. Introducing a regularization in this distance — which requires the introduction of another parameter — solves this problem partially.
That said, using Mahalanobis distances makes the procedure less sensitive to the choice of , in that neighborhoods may include points from different clusters. Think of two parallel line segments separated by a distance of , and assume there is no noise, so the points are sampled exactly from these segments. Assuming an infinite sample size, the local covariance is the same everywhere so that points within distance are connected by the affinity (3). Hence, Algorithm 2 requires that . In terms of Mahalanobis distances, points on different segments are infinitely separated, so a version based on these distances would work with any . In the case of curved surfaces and/or noise, the situation is similar, though not as evident. Even then, the gain in performance guarantees is not obvious, since we only require that be slightly larger in order of magnitude that .
Hellinger distances. As we mentioned earlier, Goldberg et al. (2009) use Hellinger distances of the probability distributions and to compare covariance matrices, specifically
are full-rank. While using these distances or the Frobenius distances makes little difference in practice, we find it easier to work with the latter when it comes to proving theoretical guarantees. Moreover, it seems more natural to assume a uniform sampling distribution in each neighborhood rather than a normal distribution, so that using the more sophisticated similarity (7) does not seem justified.
As we mentioned above, our work was developed in parallel to that of Wang et al. (2011) and Gong et al. (2012). We highlight some differences. They do not subsample, but estimate the local tangent space at each data point . Wang et al. (2011) fit a mixture of -dimensional affine subspaces to the data using MPPCA (Tipping and Bishop, 1999), which is then used to estimate the tangent subspaces at each data point. Gong et al. (2012) develop some sort of robust local PCA. While Wang et al. (2011) assume all surfaces are of same dimension known to the user, Gong et al. (2012) estimate that locally by looking at the largest gap in the spectrum of estimated local covariance matrix. This is similar in spirit to what is done in Step 2 of Algorithm 3, but we did not include this step in Algorithm 4 because we did not find it reliable in practice. We also tried estimating the local dimensionality using the method of Little et al. (2009), but this failed in the most complex cases.
Wang et al. (2011) use a nearest-neighbor graph and their affinity is defined as
where if is among the -nearest neighbors of , or vice versa, while otherwise; are the principal (aka, canonical) angles (Stewart and Sun, 1990) between the estimated tangent subspaces at and . and are parameters of the method. Gong et al. (2012) define an affinity that incorporates the self-tuning method of Zelnik-Manor and Perona (2004); in our notation, their affinity is
where is the distance from to its -nearest neighbor. is a parameter.
Although we do not analyze their respective ways of estimating the tangent subspaces, our analysis provides essential insights into their methods, and for that matter, any other method built on spectral clustering based on tangent subspace comparisons.
3 Mathematical Analysis
While the analysis of Algorithm 4 seems within reach, there are some complications due to the fact that points near the intersection may form a cluster of their own — we were not able to discard this possibility. Instead, we study the simpler variants described in Algorithm 2 and Algorithm 3. Even then, the arguments are rather complex and interestingly involved. The theoretical guarantees that we thus obtain for these variants are stated in Theorem 1 and proved in Section 6. We comment on the analysis of Algorithm 4 right after that. We note that there are very few theoretical results on resolving intersecting clusters. In fact, we are only aware of (Chen and Lerman, 2009a) in the context of affine surfaces, (Soltanolkotabi and Candès, 2011) in the context of affine surfaces without noise and (Arias-Castro et al., 2011) in the context of curves.
The generative model we assume is a natural mathematical framework for multi-manifold learning where points are sampled in the vicinity of smooth surfaces embedded in Euclidean space. For concreteness and ease of exposition, we focus on the situation where two surfaces (i.e., ) of same dimension intersect. This special situation already contains all the geometric intricacies of separating intersecting clusters. On the one hand, clusters of different intrinsic dimension may be separated with an accurate estimation of the local intrinsic dimension without further geometry involved (Haro et al., 2007). On the other hand, more complex intersections (3-way and higher) complicate the situation without offering truly new challenges. For simplicity of exposition, we assume that the surfaces are submanifolds without boundary, though it will be clear from the analysis (and the experiments) that the method can handle surfaces with (smooth) boundaries that may self-intersect. We discuss other possible extensions in Section 5.
Within that framework, we show that Algorithm 2 and Algorithm 3 are able to identify the clusters accurately except for points near the intersection. Specifically, with high probability with respect to the sampling distribution, Algorithm 2 divides the data points into two groups such that, except for points within distance of the intersection, all points from the first cluster are in one group and all points from the second cluster are in the other group. The constant depends on the surfaces, including their curvatures, separation between them and intersection angle. The situation for Algorithm 3 is more complex, as it may return more than two clusters, but the main feature is that most of two clusters (again, away from the intersection) are in separate connected components.
3.1 Generative model
Each surface we consider is a connected, and compact submanifold without boundary and of dimension embedded in . Any such surface has a positive reach, which is what we use to quantify smoothness. The notion of reach was introduced by Federer (1959). Intuitively, a surface has reach exceeding if, and only if, one can roll a ball of radius on the surface without obstruction (Walther, 1997). Formally, for and , let
which is often called the -tubular neighborhood (or -neighborhood) of . The reach of is the supremum over such that, for each , there is a unique point in nearest . It is well-known that, for submanifolds, the reach bounds the radius of curvature from below (Federer, 1959, Lem. 4.17). For submanifolds without boundaries, the reach coincides with the condition number introduced in (Niyogi et al., 2008).
When two surfaces and intersect, meaning , we define their incidence angle as
where denote the tangent subspace of submanifold at point , and is the smallest nonzero principal (aka, canonical) angle between subspaces and (Stewart and Sun, 1990).
The clusters are generated as follows. Each data point is drawn according to
is drawn from the uniform distribution overand is an additive noise term satisfying — thus represents the noise or jitter level, and means that the points are sampled on the surfaces. We assume the points are sampled independently of each other. We let
and the goal is to recover the groups and , up to some errors.
3.2 Performance guarantees
Consider two connected, compact, twice continuously differentiable submanifolds without boundary, of same dimension, intersecting at a strictly positive angle, with the intersection set having strictly positive reach. Assume the parameters are set so that
and is large enough. Then with probability at least :
Algorithm 2 returns exactly two groups such that two points from different clusters are not grouped together unless one of them is within distance from the intersection.
Algorithm 3 returns at least two groups, and such that two points from different clusters are not grouped together unless one of them is within distance from the intersection.
We note that the constant depends on what configuration the surfaces are in, in particular their reach and intersection angle, but also aspects that are harder to quantify, like their separation away from their intersection.
We now comment on the challenge of proving a similar result for Algorithm 4. This algorithm relies on knowledge of the intrinsic dimension of the surfaces and the number of clusters (here ), but these may be estimated as in (Arias-Castro et al., 2011), at least in theory, so we assume these parameters are known. The subsampling done in Step 0 does not pose any problem whatsoever, since the centers are well-spread when the points themselves are. The difficulty resides in the application of the spectral graph partitioning, Algorithm 1. If we were to include the intersection-removal step (Step 3 of Algorithm 2) before applying spectral graph partitioning, then a simple adaptation of arguments in (Arias-Castro, 2011) would suffice. The real difficulty, and potential pitfall of the method in this framework (without the intersection-removal step), is that the points near the intersection may form their own cluster. For example, in the simplest case of two affine surfaces intersecting at a positive angle and no sampling noise, the projection matrix at a point near the intersection — meaning a point whose -ball contains a substantial piece of both surfaces — would be the projection matrix onto seen as a linear subspace. We were not able to discard this possibility, although we do not observe this happening in practice. A possible remedy is to constrain the K-means part to only return large-enough clusters. However, a proper analysis of this would require a substantial amount of additional work and we did not engage seriously in this pursuit.
4 Numerical Experiments
We tried our code§§§The code is available online at http://www.ima.umn.edu/~zhang620/. on a few artificial examples. Very few algorithms were designed to work in the general situation we consider here and we did not compare our method with any other. As we argued earlier, the methods of Wang et al. (2011) and Gong et al. (2012) are quite similar to ours, and we encourage the reader to also look at the numerical experiments they performed. Our numerical experiments should be regarded as a proof of concept, only here to show that our method can be implemented and works on some toy examples.
In all experiments, the number of clusters and the dimension of the manifolds are assumed known. We choose spatial scale and the projection scale automatically as follows: we let
Here, we implicitly assume that the union of all the underlying surfaces forms a connected set. In that case, the idea behind choosing as in (14) is that we want the -graph on the centers to be connected. Then is chosen so that a center remains connected in the -graph to most of its neighbors in the -graph.
The neighborhood radius is chosen by hand for each situation. Although we do not know how to choose automatically, there are some general ad hoc guidelines. When is too large, the local linear approximation to the underlying surfaces may not hold in neighborhoods of radius , resulting in local PCA becoming inappropriate. When is too small, there might not be enough points in a neighborhood of radius to accurately estimate the local tangent subspace to a given surface at that location, resulting in local PCA becoming inaccurate. From a computational point of view, the smaller , the larger the number of neighborhoods and the heavier the computations, particularly at the level of spectral graph partitioning. In our numerical experiments, we find that our algorithm is more sensitive to the choice of
when the clustering problem is more difficult. We note that automatic choice of tuning parameters remains a challenge in clustering, and machine learning at large, especially when no labels are available whatsoever. See(Zelnik-Manor and Perona, 2004; Zhang et al., 2012; Little et al., 2009; Kaslovsky and Meyer, 2011).
Since the algorithm is randomized (see Step 0 in Algorithm 4) we repeat each simulation times and report the median misclustering rate and number of times where the misclustering rate is smaller than , , and .
We first run Algorithm 4 on several artificial data sets, which are demonstrated in the LHS of Figures 2 and 3. Table 1 reports the local radius used for each data set ( is the global radius of each data set), and the statistics for misclustering rates. Typical clustering results are demonstrated in the RHS of Figures 2 and 3. It is evident that Algorithm 4 performs well in these simulations.
|dataset||median misclustering rate||5%||10%||15%|
|Three curves||0.02 (0.034)||4.16%||76||89||89|
|Self-intersecting curves||0.1 (0.017)||1.16%||85||85||86|
|Two spheres||0.2 (0.059)||3.98%||100||100||100|
|Mobius strips||0.1 (0.028)||2.22%||85||86||88|
|Monkey saddle||0.1 (0.069)||9.73%||0||67||97|
In another simulation, we show the dependence of the success of our algorithm on the intersecting angle between curves in Table 2 and Figure 4. Here, we fix two curves intersecting at a point, and gradually decrease the intersection angle by rotating one of them while holding the other one fixed. The angles are , , and . From the table we can see that our algorithm performs well when the angle is , but the performance deteriorates as the angle becomes smaller, and the algorithm almost always fails when the angle is .
|Intersecting angle||median misclustering rate||5%||10%||15%|
We distilled the ideas of Goldberg et al. (2009) and of Kushnir et al. (2006) to cluster points sampled near smooth surfaces. The key ingredient is the use of local PCA to learn about the local spread and orientation of the data, so as to use that information in an affinity when building a neighborhood graph.
In a typical stylized setting for multi-manifold clustering, we established performance bounds for the simple variants described in Algorithm 2 and Algorithm 3, which essentially consist of connecting points that are close in space and orientation, and then extracting the connected components of the resulting graph. Both are shown to resolve general intersections as long as the incidence angle is strictly positive and the parameters are carefully chosen. As is commonly the case in such analyzes, our setting can be generalized to other sampling schemes, to multiple intersections, to some features of the surfaces changing with the sample size, and so on, in the spirit of (Arias-Castro et al., 2011; Arias-Castro, 2011; Chen and Lerman, 2009a). We chose to simplify the setup as much as possible while retaining the essential features that makes resolving intersecting clusters challenging. The resulting arguments are nevertheless rich enough to satisfy the mathematically thirsty reader.
We implemented a spectral version of Algorithm 3, described in Algorithm 4, that assumes the intrinsic dimensionality and the number of clusters are known. The resulting approach is very similar to what is offered by Wang et al. (2011) and Gong et al. (2012), although it was developed independently of these works. Algorithm 4 is shown to perform well in some simulated experiments, although it is somewhat sensitive to the choice of parameters. This is the case of all other methods for multi-manifold clustering we know of and choosing the parameters automatically remains an open challenge in the field.
We start with some additional notation. The ambient space is unless noted otherwise. For a vector , denotes its Euclidean norm and for a real matrix , denotes the corresponding operator norm. For a point and , denotes the open ball of center and radius , i.e., . For a set and a point , define . For two points in the same Euclidean space, denotes the vector moving to . For a point and a vector in the same Euclidean space, denotes the translate of by . We identify an affine subspace with its corresponding linear subspace, for example, when saying that a vector belongs to .
For two subspaces and , of possibly different dimensions, we denote by the largest and by the smallest nonzero principal angle between and (Stewart and Sun, 1990). When is a vector and is a subspace, this is the usual definition of the angle between and .
For a subset and positive integer , denotes the -dimensional Hausdorff measure of , and is defined as , where is the Hausdorff dimension of . For a Borel set , let denote the uniform distribution on .
For a set with reach at least , and with , let denote the metric projection of onto , that is, the point on closest to . Note that, if is an affine subspace, then is the usual orthogonal projection onto . Let denote the class of connected, and compact -dimensional submanifolds without boundary embedded in , with reach at least . For a submanifold , let denote the tangent space of at .
We will often identify a linear map with its matrix in the canonical basis. For a symmetric (real) matrix , let denote its eigenvalues in decreasing order.
We say that is -Lipschitz if
For two reals and , and . Additional notation will be introduced as needed.
This section gathers a number of general results from geometry and probability. We took time to package them into standalone lemmas that could be of potential independent interest, particularly to researchers working in machine learning and computational geometry.
6.1.1 Smooth surfaces and their tangent subspaces
The following result is on approximating a smooth surface near a point by the tangent subspace at that point. It is based on (Federer, 1959, Th. 4.18(2)).
For , and any two points ,
and when ,
Moreover, for such that ,
We need a bound on the angle between tangent subspaces on a smooth surface as a function of the distance between the corresponding points of contact. This could be deduced directly from (Niyogi et al., 2008, Prop. 6.2, 6.3), but the resulting bound is much looser — and the underlying proof much more complicated — than the following, which is again based on (Federer, 1959, Th. 4.18(2)).
For , and any ,
By (19) applied twice, we have
for any vector and any linear subspace , we get
Noting that the LHS never exceeds 1, and applying the arcsine function — which is increasing — on both sides, yields
We then use the triangle inequality
and conclude. ∎
Below we state some properties of a projection onto a tangent subspace. A result similar to the first part was proved in (Arias-Castro et al., 2011, Lem. 2) based on results in (Niyogi et al., 2008), but the arguments are simpler here and the constants are sharper.
Take , and , and let be short for . is injective on and its image contains , where . Moreover, has Lipschitz constant bounded by over , for any .
and by (20),
Now, by the triangle inequality,
When , the RHS is bounded from below by , which then implies that , that is, . This precludes the situation where , so that is injective on when .
The same arguments imply that is an open map on . In particular, contains an open ball in centered at and , with since . Now take any ray out of within , which is necessarily of the form , where is a unit vector in . Let for . Let be the infimum over all such that . Note that and , so that there is such that . Let , which is well-defined on by definition of and the fact that is injective on . We have that is the unique vector in such that . Elementary geometry shows that
by (20), and when . Since , we have and this holds for all . So we can extend to into a Lipschitz function with constant . Together with the fact that , this implies that
Hence, and therefore contains as stated.
For the last part, fix , so there is a unique such that , where is redefined as . Take and let and . We saw that is Lipschitz with constant on any ray from of length , so that . The differential of at is itself, seen as a linear map between and . Then for any vector , we have
as before. Hence, , and we proved this for all . Since that set is convex, we can apply Taylor’s theorem and get that is Lipschitz on that set with constant . We then have
because and . ∎
6.1.2 Volumes and uniform distributions
Below is a result that quantifies how much the volume of a set changes when applying a Lipschitz map. This is well-known in measure theory and we only provide a proof for completeness.
Suppose is a measurable subset of and is -Lipschitz. Then for any measurable set and real , .
where is the class of countable sequences of subsets of such that and for all . Since is -Lipschitz, for any . Hence, for any , . This implies that
Taking the infimum over , we get , and we conclude by taking the limit as , noticing that . ∎
We compare below two uniform distributions. For two Borel probability measures and on , denotes their total variation distance, meaning,
Remember that for a Borel set , denotes the uniform distribution on .
Suppose and are two Borel subsets of . Then
If and are not of same dimension, say , then since while . And we also have