This article studies the clustering problem of partitioning data points to disjoint (smooth) Riemannian submanifolds with .
1.1. Problem formulation
Let be compact and connected Riemannian manifolds of dimension . Suppose that can be embedded as a submanifold of an ambient Euclidean space equipped with the Euclidean metric (i.e., there is an immersion such that the differential is injective for all and is a homeomorphism onto ; cf. ). In our clustering setting, we work with disjoint submanifolds in and denote as their disjoint union. Each smooth submanifold is endowed with the Riemannian metric induced from , and we denote as the Borel -algebra on (i.e., the -algebra generated by the open balls in with respect to ). Let
be a sequence of independent random variables taking values in. Suppose that there exists a clustering structure (i.e., a partition on satisfying ) such that each of the data points belongs to one of the clusters: if , then
for some probability distributionsupported on . Given the observations , the task of this paper is to develop computationally tractable algorithms with strong theoretical guarantees for recovering the true clustering structure .
Classical clustering methods such as -means  and mixture models  assume that data points from each cluster are sampled in the neighborhood (with the same dimension) of a centroid, where contains only one point in
. Such methods are effective for partitioning data with ellipsoidal contours, which implicitly implies that the similarity (or affinity) criteria of centroid-based clustering methods target on some notions of “compactness”. In modern applications such as image processing and computer vision[25, 27, 11], structured data with geometric features are commonly seen as clusters without necessarily being close together and having the same dimension. Figure 1 is an illustration for such observation on a synthetic data sampled from a noisy version of three clusters with one disk and two annuli. In this example, it is visually clear to distinguish the three clusters, however the -means method fails to correctly cluster the data points. There are two main reasons for the failure of -means. First, the north pole and south pole in the outer annulus have the largest Euclidean distance among all data points, even though they belong to the same cluster. Second, the annuli and the disk live in different dimensions. In particular, the annulus is a one-dimensional circle in
that is locally isometric to the real line and the disk has dimension two. Thus these geometric concerns motivate us to seek a more natural and flexible notion of closeness for clustering analysis. In this paper, we shall focus on the clustering criterion based on theconnectedness, which is suitable for simultaneously addressing the two issues. First, connectedness is a graph property that does not rely on the physical distance: two vertices are connected if there is a path joining them. This extends the closeness from the local neighborhood to the global sense. Second, connectivity is a viable notion for clustering components of mixed dimensions, as long as all clusters live in the same ambient space where the graph connectivity weights can be computed.
In the population version, a clustering component can be viewed as a smooth submanifold, embedded in . In Riemannian geometry, a Riemannian submanifold in is said to be connected if for any , there is a parameterized regular curve joining and . Thus an appealing notion of for being a cluster is that is a compact and connected component in . In our setting, a clustering model is the union of disjoint submanifolds , and each is equipped with a probability distribution . Thus for the underlying true clustering model, there are connected graphs that do not overlap. In the sample version, data points are randomly generated from with a clustering structure . Typically, weighted graphs computed from the observed data points are fully connected (e.g., based on the Gaussian kernel). Thus a fundamental challenge of clustering analysis is to recover the true clustering structure from a noisy and fully connected weighted graph on the data.
1.2. Our contributions
In this paper, we propose a new clustering method, termed as the diffusion -means, for manifold clustering. The diffusion
-means contains two key ingredients. First, it constructs a random walk (i.e., a Markov chain) on the weighted random graph with data points
as vertices and edge weights computed from a kernel representing the similarity of data in a local neighborhood. By running the Markov chain forward in time, the local geometry (specified by the kernel bandwidth) will be integrated at multiple time scales to reveal global (topological) structures such as the connectedness properties of the graph. In the limiting case as the sample size tends to infinity and the bandwidth tends to zero, the random walk becomes a diffusion process over the the manifold. By looking at the spectral decomposition of this limiting diffusion process, the evaluations of the eigenfunctions at verticescan be viewed as a continuous embedding of the data, called diffusion map, into a higher-dimensional Euclidean space. Second, once the diffusion map is obtained, we can compute the diffusion distance  and the -means algorithm (with the Euclidean metric) can be naturally extended with the diffusion affinity as the similarity measure. Since the diffusion distance/affinity captures the connectedness among vertices on the weighted random graph, the diffusion -means aims to maximize the within-cluster connectedness, which can be recast as an assignment problem via a 0-1 integer program.
Because 0-1 integer programming problems with a non-linear objective function is generally -hard, solving the diffusion -means is computationally intractable, i.e., polynomial-time algorithms with exact solutions only exists in special cases. This motivates us to consider semidefinite programming (SDP) relaxations. We propose two versions of SDP relaxations of the diffusion -means. The first one requires the knowledge of the number of clusters, and it can be viewed as an extension from Peng and Wei’s SDP relaxation  for the -means (as well as Chen and Yang’s SDP relaxation  for the generalized -means for non-Euclidean data in an inner product space) to the manifold clustering setting with diffusion distances. Figure 1 (on the right) shows that the SDP relaxed diffusion -means can correctly identify the three clusters in the previous example. The second SDP relaxation does not require the number of clusters as an input. The idea is to drop the constraint on the trace of the clustering membership matrix (which involves number of clusters ), and to add a penalization term on the diffusion -means objective function. Thus it can be seen as an nuclear norm regularized version of the SDP for diffusion -means that is adaptive to the number of clusters. For both SDP relaxations of the diffusion -means, we show that exact recovery can be achieved when the underlying submanifolds are well separated and subsamples within each submanifold are well connected.
Since the diffusion -means and its regularized version have only one (non-adaptive) bandwidth parameter to control the local geometry, they may fail for clustering problems with unbalanced sizes, mixed dimensions, and different densities. In such situations, a random walk on the vertices sampled from regions of low density mixes slower than that from regions of high density. This motivates us to consider a variant of diffusion -means, termed as the localized diffusion -means, by using data-dependent local bandwidth. We adopt the self-tuning procedure from  where local adaptive bandwidth is estimated from the nearest neighbors and we show that the localized diffusion -means is adaptive to the local geometry and the local sampling density for the purpose of exact recovery of the true clustering structure.
To summarize, our contributions are listed as below.
We introduce the diffusion -means clustering method for manifold clustering, which integrates the nonlinear embedding via the diffusion maps and the -means clustering.
We propose two versions of the SDP relaxations of the diffusion -means: one requires to know the number of clusters, and the other one does not require such knowledge as an input (and thus it is adaptive to the unknown number of clusters).
We derive the exact recovery property of the SDP relaxed diffusion -means in terms of two hardness parameters of the clustering problem: one reflects the separation of the submanifolds, and the other one quantifies the degree of connectedness of the submanifolds.
We combine the local scaling procedures with the diffusion -means and its regularized version, and derive their adaptivity when the clustering problems have unbalanced sizes, mixed dimensions, and different densities.
1.3. Related work
There is a large collection of clustering methods and algorithms in literature, which can be broadly classified into two categories: hierarchical clustering and partition-based clustering. Hierarchical clustering recursively divides data points into groups in either a top-down or bottom-up way. Such algorithms are greedy and they often get stuck into local optimal solutions.
Partition-based clustering methods such as -means clustering 30] directly assign each data point with a group membership. Perhaps one of the most widely used clustering methods is the -means method, due to the existence of algorithms with linear sample complexity (such as Lloyd’s algorithm ). However, the -means clustering converges locally to a stationary point that depends on the initial partition. Recent theoretical studies in  show that, given a proper initialization (such as spectral clustering), Lloyd’s algorithm for optimizing the -means objective function can consistently recover the clustering structures. Exact and approximate recovery of various convex relaxations for the -means and mixture models are studied in literature [22, 16, 12, 24, 3]. To the best of our knowledge, existing theoretical guarantees developed for the convex relaxed -means clustering assumes that the clusters are sampled in a neighborhood of a centroid, which in our setup corresponds to the situation that each submanifold contains only one point. Thus results derived for -means in literature cannot be directly compared with our results. On the other hand, spectral clustering methods [25, 21] take the similarity matrix as the input and solve the clustering problem by applying
-means to top eigenvectors of the graph Laplacian matrix or its normalized versions.  study the convergence of spectral properties of random graph Laplacian matrices constructed from sample points and they establish the consistency of the spectral clustering in terms of eigenvectors. However, they do not address the problem of the exact recovery property of the clustering structure.
On the contrary, literature on theoretical guarantees for manifold clustering is rather scarce, with the exception . Near-optimal exact recovery of some emblematic clustering methods based on pairwise distances of data is derived under a condition that the minimal signal separation strength over all pairs of submanifolds is larger than a threshold. Compared with our diffusion -means with local scaling, results established in  are non-adaptive to the local density and (geometric) structures of the submanifolds (cf. Theorem 4.8 and 4.10 ahead).
For a matrix and index subsets , we use notation to denote the submatrix of with rows being selected by and columns by , and the
-dimensional vector composed of all diagonal entries of. Let and denote the and the norm of the vectorization of matrix . Let and denote the nuclear norm and the operator norm of matrix . We shall use to denote positive and finite (non-random) constants whose values may depend the submanifolds and the probability distributions supported on and whose values may vary from place to place.
The rest of the paper is organized as follows. In Section 2, we discuss some related background on diffusion distances and nonlinear embeddings in Euclidean spaces, as well as the Laplace-Beltrami operator for the heat diffusion process on Riemannian manifolds. In Section 3, we introduce the diffusion -means and its SDP relaxations. Regularized and localized diffusion -means clustering methods are also proposed in this section. In Section 4, we present our main results on the exact recovery property of the SDP relaxed diffusion -means. Simulation studies are presented in Section 5. Proofs are relegated to Section 6.
Let and be a probability distribution on . Let be i.i.d. random variables in sampled from , and the empirical distribution. In Section 2, we discuss the Euclidean embedding via the diffusion distance in the general setting. Thus in this section, we do not assume to be a disjoint union of submanifolds in and the sample does not necessarily have a clustering structure.
2.1. Euclidean embedding and diffusion distances
Let be a positive semidefinite kernel that satisfies:
positivity preserving: .
A kernel is a similarity measure between points of . A widely used example is the Gaussian kernel:
where is the bandwidth parameter that captures the local similarity of points in . Given a kernel with property (i) and (ii), we can define a reversible Markov chain on via the normalized graph Laplacian constructed as follows. Specifically, for any , let
be the degree function of the graph on . For simplicity, we assume for all . Define
which satisfies the positivity preserving property (ii) and the conservation property
Thus can be viewed as the one-step transition probability of a (stationary) Markov chain on from to . We shall write this Markov chain (i.e., random walk) as , where is called the transition kernel of . Equivalently, we can describe by the bounded linear operator defined as
Here is the class of squared integrable functions on with respect to . In literature, is often called the diffusion operator for the following reason. If we denote as the -step transition probability of the Markov chain from to in , then
form a semi-group of bounded linear operators on . Let be a stationary distribution of the Markov chain over . Then is absolutely continuous with respect to
, and the probability density functionof with respect to is given by the Radon-Nikodym derivative
Since is the stationary measure of the Markov chain with transition , we have
for all bounded measurable functions , where . Note that, since the kernel is symmetric, is reversible and satisfies the detailed balance condition:
Lemma 2.1 (Spectral decomposition of the Markov chain ).
then the following statements hold.
There exists a sequence of nonnegative eigenvaluessuch that
where is the set of associated eigenfunctions to , and forms an orthonormal basis of .
The transition probability admits the following decomposition
where and .
The diffusion operator satisfies
In addition, and .
The proof of Lemma 2.1 is given in Appendix A, and our argument is similar to Lemma 12.2 in  in the finite-dimensional setting. If is irreducible (i.e., the graph on is connected in that for all , there is some such that ), then the stationary distribution is unique. Thus if we run this Markov chain forward in time, then the local geometry (captured by the kernel which is parameterized by the bandwidth ) will be integrated to reveal global structures of at multiple (time) scales. In particular, we can define a class of diffusion distances  on by
Roughly speaking, for each and , the diffusion distance quantifies the the total number of paths with length connecting and (see Figure 2), thereby reflecting the local connectivity at the time scale .
Lemma 2.2 (Spectral representation of diffusion distances).
If the Markov chain is irreducible, then we have
for all and .
The proof of Lemma 2.2 is given in Appendix A. For an irreducible Markov chain, the spectral gap is strictly positive (i.e., for all ). Based on the spectral decomposition in Lemma 2.2 and noting that , we see that the diffusion distance can be written as
In this case, the diffusion distance decays to zero as increases, provided that and belong to a connected component of the graph on . In particular, the decay rate of the spectrum quantifies the connectivity of points in the graph on . Given a positive integer , the diffusion maps are defined as
where the -th component is the -th diffusion coordinate in . Thus we obtain an embedding of into the Euclidean space in the limiting sense that
2.2. Empirical diffusion embedding
Recall that are i.i.d. random variables in sampled from , and is the empirical distribution. Given , we can consider finite sample approximations to the underlying population level quantities . More precisely, consider a weighted graph with nodes corresponding to the elements in , where the weight between a pair of nodes is , for . Define the (rescaled) empirical degree function by
where we added an extra -factor so that is also the degree of node in the weighted graph. Let denote the -by- diagonal matrix whose -th diagonal entry is . Consider the (empirical) random walk over with transition probability
The (empirical) stationary distribution of the random walk over becomes
For any vector , let denote a weighted space over . We define the empirical diffusion distances as
where are the nonnegative eigenvalues (due to the positive semidefiniteness of the kernel ) of the transition probability operator , which can be identified with a matrix in with as its -th element, and are the associated eigen-functions on with unit norm, which can be identified with vectors in with as the -th element of for and . The empirical diffusion distance between two nodes and is also the Euclidean distance between their embeddings and via the empirical diffusion map
2.3. The Laplace-Beltrami operator on Riemannian manifolds
The Laplace-Beltrami operator on Riemannian manifolds is a generalization of the Laplace operator on Euclidean spaces. Let be an (infinitely) differentiable function with continuous derivatives on a -dimensional compact and smooth Riemannian manifold and be the gradient vector field on (i.e., is the deepest direction of ascent for at the point ). The Laplace-Beltrami operator is defined as the divergence of the gradient vector
where the div operator is relative to the volume form of . Here we adopt the convention with the minus sign of the divergence such that is a positive-definite operator. With integration-by-parts, we have for any two differentiable functions and ,
where the inner product is taken in the -dimensional tangent space of (at the point ). In a Euclidean space (i.e., ), the Laplace-Beltrami operator is the usual Laplace operator
On a general -dimensional Riemannian manifold , the Laplace-Beltrami operator in a local coordinate system
with a metric tensoris given by
where are the entries of . In the special case , is the identity matrix. Note that is a self-adjoint positive-definite compact operator, its spectrum contains a sequence of nonnegative eigenvalues . If in addition is connected, then the second smallest eigenvalue . As we will show, depends on the connectivity of the manifold (Figure 2), thus characterizing the limiting mixing time of the empirical random walk over the as and , when is sampled from the manifold .
3. Diffusion -means
Recall that in our clustering model, is a sample of independent random variables taking values in , where is the union of disjoint Riemannian submanifolds embedded in the ambient space . The clustering problem is to divide these data points into clusters, so that points in the same cluster belongs to the same connected component in , based on certain similarity measures between the points. In particular, the (classical) -means clustering method minimizes the total intra-cluster squared Euclidean distances in
over all possible partitions on , where is the cardinality of . Dropping the sum of squared norms , we see that the -means clustering is equivalent to the maximization of the total within-cluster covariances
Here, can be viewed as a similarity measure specified by the Euclidean space inner product . In general, we can replace the Euclidean inner product with any other inner product over . For manifold clustering, we replace it with the inner product induced from the empirical diffusion distance, that is,
Henceforth, we will refer to as the diffusion affinity. Interestingly, we can obtain this diffusion affinity value without explicitly conducting eigen-decomposition (spectral decomposition) to the transition probability matrix (or the symmetrized matrix ), where recall that is the degree diagonal matrix, and is the empirical kernel matrix. In fact, we may use the following relation that links the empirical diffusion affinity with entries in matrix raising to power (see Appendix B for details),
This motivates a -means clustering method via diffusion distances, referred to as the diffusion -means as
for the tuning parameter interpreted as the number of steps in the empirical random walk
. Note that here the affinity matrixis symmetric. In light of the connections between the diffusion distance and the random walk over in Section 2.2, the diffusion -means attempts to maximize the total within-cluster connectedness.
Note that, for every partition , there is a one-to-one assignment matrix such that if and if . Thus the diffusion -means clustering problem can be recast as a 0-1 integer program:
where denotes the vector of all ones and .
The diffusion -means clustering problem (5) is often computationally intractable, namely, polynomial-time algorithms with exact solutions only exist in special cases . For instances, the (classical) -means clustering is an -hard integer programming problem with a non-linear objective function . Exact and approximate recovery of various SDP relaxations for the -means [22, 16, 12, 24, 14] are studied in literature. However, it remains a challenging task to provide statistical guarantees for clustering methods that can capture non-linear features of data taking values on manifolds.
3.1. Semidefinite programming relaxations
We consider the SDP relaxations for the diffusion -means clustering. Denote the size of the -th cluster as for . Note that every partition of can be represented by a partition function via . If we change the variable in the 0-1 integer program formulation (5) of the diffusion -means, then satisfies the following properties:
For the diffusion -means , the last constraint in (6) reduces to , which does not depend on the partition function . Thus we can relax the diffusion -means clustering to the SDP problem:
where means that is positive semidefinite and means that all entries of are non-negative, and matrix . We shall use to estimate the true “membership matrix” , where
Note that is a block diagonal matrix (up to a permutation) of rank . If (i.e., ) for some Hilbert space and is the inner product between and , then (7) is the SDP for kernel -means proposed in . In particular,  consider the special case for the (Euclidean) -means, where and . Observe that the SDP relaxation (7) does not require the knowledge of the cluster sizes other than the number of clusters . Thus it can handle the general case for unequal cluster sizes.
3.2. Regularized diffusion -means
In practice, the number of clusters is rarely known. Note that the SDP problem (7) depends on only through the constraint . Therefore we propose a regularized diffusion -means estimator by dropping the constraint on the trace and penalizing as follows,
where is the regularization parameter. Recall that denotes the nuclear norm of a matrix (i.e.,
is the sum of the singular values of). Since for , it is interesting to note that (9) is the same as the nuclear norm penalized diffusion -means
Recall that the true membership matrix has a block diagonal structure with rank , the nuclear norm penalized diffusion -means (10) can be thought as an norm convex relaxation of the -hard rank minimization problem (i.e., minimizing the number of non-zero eigenvalues). Thus the parameter in the clustering problem plays a similar role as the sparsity (or low-rankness) parameter in the matrix completion context . Hence the SDP problem (10) can be viewed as a (further) convex relaxation of the infeasible SDP problem (7) when is unknown.
It remains a question to choose the value of . Larger values of will lead to solutions containing less number of clusters (with larger sizes). In particular, when matrix is positive-definite, the following Lemma 3.1 shows that the solution reduces to a rank one matrix that assigns all points into a giant cluster when is large enough, and becomes the identity matrix that assigns points into distinct clusters when is small enough. In addition, the trace of the solution is nonincreasing in .
Suppose is a positive definite matrix, and let and denote its respective largest and smallest eigenvalues.
(1) If , then , where is the matrix of all ones, is the unique solution of the SDP (9).
(2) If , then , the identity matrix, is the unique solution of the SDP (9).
(3) If and are two solutions of the SDP (9) with the regularization parameter taking values and , respectively. If , then . Furthermore, if at least one of the two SDPs has a unique solution, then .
According to the interpretation of the SDP (7), the trace of the solution can be viewed as the fitted number of clusters. Consequently, Lemma 3.1 implies that smaller values of will result in more clusters (with smaller sizes) in . In practice, we need to properly select the tuning parameter . We propose the following decision rule for this purpose. For each , we run the SDP problem (9) and extract the value of . Then we plot the solution path versus and pick the values of which spend the longest time (on the logarithmic scale) with a flat value of . Here we recommend using the logarithmic scale since values of with non-trivial solutions tends to be close to zero. Algorithm 1 below summarizes this decision rule.
Figure 3 shows the empirical result on the three clusters (one disk and two annuli) example in Section 1. According to Lemma 3.1, the estimated number of clusters, proxied by , is a non-increasing function of . In particular, the trace in the solution path in the upper left panel of Figure 3 stabilizes around and , indicating that both and are candidate values for the number of clusters. In particular, the interval of (on the logarithmic scale) corresponding to value is much larger than that to value , indicating that is more likely to be the true number of clusters (cf. the (correct) case in Figure 3). In Section 4, we will use our theory to explain this stabilization phenomenon, which partially justifies our selection rule.
In addition, as we can see from the rest three panels in Figure 3, by gradually increasing , the adaptive diffusion -means method produces a hierarchical clustering structure. Unlike the top-down or bottom-up clustering procedures which are based on certain greedy rule and can incur inconsistency, the hierarchical clustering structure produced by our approach is consistent — it does not depend on the order of partitioning or merging due to the uniqueness of the global solution from the convex optimization via the SDP. Similar observations can be drawn on another example shown in Figure 4 containing a uniform sample on three rectangles (see DGP 2 in our simulation studies Section 5 for details).
Further, it is interesting to observe in Figure 3 that the regularized diffusion -means tuned with two clusters yields a merge between the outer annulus and the disk, which gives the largest total diffusion affinity in the objective function (4) among the three possible combinations of the true clusters. Since diffusion affinity decays exponentially fast to zero in the squared Euclidean distance (for the Gaussian kernel), the diffusion affinity matrix tends to have a block diagonal structure, as weights between points belonging to different clusters are exponentially small (cf. (23) and Lemma 6.3). Thus running SDP for examples with relatively well separated clusters, such as the one in Figure 3, tends to merge two clusters with largest within-cluster diffusion affinities that is irrespective of the between-cluster Euclidean distances. This may lead to a visually less appealing merge as in the Euclidean distance case (cf. the (under) case in Figure 3). On the other hand, the regularized diffusion -means is able to produce more reasonable partition in splitting the clusters (cf. the bottom-left panel in Figure 3). In particular, if the regularization parameter is chosen such that the corresponding number of clusters in the SDP solution is greater than , then this will cause a split in one of the true clustering structures that minimizes the between-cluster diffusion affinities after the splitting. Moreover, in our simulation studies (setup DGP 3 in Section 5), we observe that the SDP relaxed regularized diffusion -means performs much better in harder cases than the spectral clustering methods when the true clusters are not well separated.
3.3. Localized diffusion -means
For clustering problems with different sizes, dimensions and densities, the diffusion -means may have limitations since only one bandwidth parameter is used to control the local geometry on the domain. More precisely, according to our theory (for example, Theorem 4.1 below), the optimal choice of the bandwidth parameter for a -dimensional submanifold as one connected component in our clustering model is , where corresponds to the sample size within this cluster and thus depends on the local cluster size or density level. Figure 5 demonstrates such an example for a mixture of three bivariate Gaussians, which consist of one larger Gaussian component with low density and two smaller Gaussian components with high density. Empirically, the diffusion -means fails on this example (even after tuning) for the reason that the larger and smaller clusters have very different local densities. This motivates us to consider a variant of diffusion -means, termed as localized diffusion -means, by using local adaptive bandwidth for each , . In particular, we adopt the self-tuning procedure from  by setting to be , where denotes the -th nearest neighbor to , and replacing with (and accordingly replacing with corresponding to the diagonal degree matrix associated with ) given by
in the SDP (7). Note that is generally no longer a positive semidefinite matrix. Intuitively for , the local scaling automatically adapts to the local density about , the cluster size and the dimension for the -th Riemannian submanifold. Specifically, for each cluster , the -by- submatrix resembles the a Gaussian kernel matrix with a homogeneous bandwidth that adapts to the local geometry in . For points and belonging to distinct clusters that are properly separated, tends to be close to zero and is less affected by the choice of andis larger for lower density regions where the degree function of is smaller so that the random walk can speed up mixing at such lower density regions. Overall, such a locally adaptive choice of bandwidth improves the mixing time of the random walk within each cluster, while leaves the between cluster jumping probabilities remaining small. As a consequence, the pairwise diffusion affinity matrix in our SDP formulation (7) tends to exhibit a clearer block form reflecting the clustering structure. To compute , we only need to specify to replace the (non-adaptive) bandwidth parameter whose value depend on the unknown cluster sizes , dimensions of submanifolds , and the underlying probability density functions on . In contrast, the simple choice guarantees that the local scaling adapts to the local density (cf. Theorem