# Diffusion K-means clustering on manifolds: provable exact recovery via semidefinite relaxations

We introduce the diffusion K-means clustering method on Riemannian submanifolds, which maximizes the within-cluster connectedness based on the diffusion distance. The diffusion K-means constructs a random walk on the similarity graph with vertices as data points randomly sampled on the manifolds and edges as similarities given by a kernel that captures the local geometry of manifolds. Thus the diffusion K-means is a multi-scale clustering tool that is suitable for data with non-linear and non-Euclidean geometric features in mixed dimensions. Given the number of clusters, we propose a polynomial-time convex relaxation algorithm via the semidefinite programming (SDP) to solve the diffusion K-means. In addition, we also propose a nuclear norm (i.e., trace norm) regularized SDP that is adaptive to the number of clusters. In both cases, we show that exact recovery of the SDPs for diffusion K-means can be achieved under suitable between-cluster separability and within-cluster connectedness of the submanifolds, which together quantify the hardness of the manifold clustering problem. We further propose the localized diffusion K-means by using the local adaptive bandwidth estimated from the nearest neighbors. We show that exact recovery of the localized diffusion K-means is fully adaptive to the local probability density and geometric structures of the underlying submanifolds.

## Authors

• 17 publications
• 24 publications
• ### Clustering using Max-norm Constrained Optimization

We suggest using the max-norm as a convex surrogate constraint for clust...
02/25/2012 ∙ by Ali Jalali, et al. ∙ 0

• ### On Robustness of Kernel Clustering

Clustering is one of the most important unsupervised problems in machine...
06/06/2016 ∙ by Bowei Yan, et al. ∙ 0

• ### Capacity Releasing Diffusion for Speed and Locality

Diffusions and related random walk procedures are of central importance ...
06/19/2017 ∙ by Di Wang, et al. ∙ 0

• ### Superconvergence of gradient recovery on deviated discretized manifolds

This paper addresses open questions proposed by Wei, Chen and Huang [ SI...
10/30/2019 ∙ by Guozhi Dong, et al. ∙ 0

• ### Diffusion Maps for Embedded Manifolds with Boundary with Applications to PDEs

Given only a collection of points sampled from a Riemannian manifold emb...
11/28/2019 ∙ by Ryan Vaughn, et al. ∙ 0

• ### A diffusion approach to Stein's method on Riemannian manifolds

We detail an approach to develop Stein's method for bounding integral me...
03/25/2020 ∙ by Huiling Le, et al. ∙ 0

• ### Disentangling by Subspace Diffusion

We present a novel nonparametric algorithm for symmetry-based disentangl...
06/23/2020 ∙ by David Pfau, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

This article studies the clustering problem of partitioning data points to disjoint (smooth) Riemannian submanifolds with .

### 1.1. Problem formulation

Let be compact and connected Riemannian manifolds of dimension . Suppose that can be embedded as a submanifold of an ambient Euclidean space equipped with the Euclidean metric (i.e., there is an immersion such that the differential is injective for all and is a homeomorphism onto ; cf. [10]). In our clustering setting, we work with disjoint submanifolds in and denote as their disjoint union. Each smooth submanifold is endowed with the Riemannian metric induced from , and we denote as the Borel -algebra on (i.e., the -algebra generated by the open balls in with respect to ). Let

be a sequence of independent random variables taking values in

. Suppose that there exists a clustering structure (i.e., a partition on satisfying ) such that each of the data points belongs to one of the clusters: if , then

for some probability distribution

supported on . Given the observations , the task of this paper is to develop computationally tractable algorithms with strong theoretical guarantees for recovering the true clustering structure .

Classical clustering methods such as -means [19] and mixture models [13] assume that data points from each cluster are sampled in the neighborhood (with the same dimension) of a centroid, where contains only one point in

. Such methods are effective for partitioning data with ellipsoidal contours, which implicitly implies that the similarity (or affinity) criteria of centroid-based clustering methods target on some notions of “compactness”. In modern applications such as image processing and computer vision

[25, 27, 11], structured data with geometric features are commonly seen as clusters without necessarily being close together and having the same dimension. Figure 1 is an illustration for such observation on a synthetic data sampled from a noisy version of three clusters with one disk and two annuli. In this example, it is visually clear to distinguish the three clusters, however the -means method fails to correctly cluster the data points. There are two main reasons for the failure of -means. First, the north pole and south pole in the outer annulus have the largest Euclidean distance among all data points, even though they belong to the same cluster. Second, the annuli and the disk live in different dimensions. In particular, the annulus is a one-dimensional circle in

that is locally isometric to the real line and the disk has dimension two. Thus these geometric concerns motivate us to seek a more natural and flexible notion of closeness for clustering analysis. In this paper, we shall focus on the clustering criterion based on the

connectedness, which is suitable for simultaneously addressing the two issues. First, connectedness is a graph property that does not rely on the physical distance: two vertices are connected if there is a path joining them. This extends the closeness from the local neighborhood to the global sense. Second, connectivity is a viable notion for clustering components of mixed dimensions, as long as all clusters live in the same ambient space where the graph connectivity weights can be computed.

In the population version, a clustering component can be viewed as a smooth submanifold, embedded in . In Riemannian geometry, a Riemannian submanifold in is said to be connected if for any , there is a parameterized regular curve joining and . Thus an appealing notion of for being a cluster is that is a compact and connected component in . In our setting, a clustering model is the union of disjoint submanifolds , and each is equipped with a probability distribution . Thus for the underlying true clustering model, there are connected graphs that do not overlap. In the sample version, data points are randomly generated from with a clustering structure . Typically, weighted graphs computed from the observed data points are fully connected (e.g., based on the Gaussian kernel). Thus a fundamental challenge of clustering analysis is to recover the true clustering structure from a noisy and fully connected weighted graph on the data.

### 1.2. Our contributions

In this paper, we propose a new clustering method, termed as the diffusion -means, for manifold clustering. The diffusion

-means contains two key ingredients. First, it constructs a random walk (i.e., a Markov chain) on the weighted random graph with data points

as vertices and edge weights computed from a kernel representing the similarity of data in a local neighborhood. By running the Markov chain forward in time, the local geometry (specified by the kernel bandwidth) will be integrated at multiple time scales to reveal global (topological) structures such as the connectedness properties of the graph. In the limiting case as the sample size tends to infinity and the bandwidth tends to zero, the random walk becomes a diffusion process over the the manifold. By looking at the spectral decomposition of this limiting diffusion process, the evaluations of the eigenfunctions at vertices

can be viewed as a continuous embedding of the data, called diffusion map, into a higher-dimensional Euclidean space. Second, once the diffusion map is obtained, we can compute the diffusion distance [9] and the -means algorithm (with the Euclidean metric) can be naturally extended with the diffusion affinity as the similarity measure. Since the diffusion distance/affinity captures the connectedness among vertices on the weighted random graph, the diffusion -means aims to maximize the within-cluster connectedness, which can be recast as an assignment problem via a 0-1 integer program.

Because 0-1 integer programming problems with a non-linear objective function is generally -hard, solving the diffusion -means is computationally intractable, i.e., polynomial-time algorithms with exact solutions only exists in special cases. This motivates us to consider semidefinite programming (SDP) relaxations. We propose two versions of SDP relaxations of the diffusion -means. The first one requires the knowledge of the number of clusters, and it can be viewed as an extension from Peng and Wei’s SDP relaxation [22] for the -means (as well as Chen and Yang’s SDP relaxation [7] for the generalized -means for non-Euclidean data in an inner product space) to the manifold clustering setting with diffusion distances. Figure 1 (on the right) shows that the SDP relaxed diffusion -means can correctly identify the three clusters in the previous example. The second SDP relaxation does not require the number of clusters as an input. The idea is to drop the constraint on the trace of the clustering membership matrix (which involves number of clusters ), and to add a penalization term on the diffusion -means objective function. Thus it can be seen as an nuclear norm regularized version of the SDP for diffusion -means that is adaptive to the number of clusters. For both SDP relaxations of the diffusion -means, we show that exact recovery can be achieved when the underlying submanifolds are well separated and subsamples within each submanifold are well connected.

Since the diffusion -means and its regularized version have only one (non-adaptive) bandwidth parameter to control the local geometry, they may fail for clustering problems with unbalanced sizes, mixed dimensions, and different densities. In such situations, a random walk on the vertices sampled from regions of low density mixes slower than that from regions of high density. This motivates us to consider a variant of diffusion -means, termed as the localized diffusion -means, by using data-dependent local bandwidth. We adopt the self-tuning procedure from [33] where local adaptive bandwidth is estimated from the nearest neighbors and we show that the localized diffusion -means is adaptive to the local geometry and the local sampling density for the purpose of exact recovery of the true clustering structure.

To summarize, our contributions are listed as below.

1. We introduce the diffusion -means clustering method for manifold clustering, which integrates the nonlinear embedding via the diffusion maps and the -means clustering.

2. We propose two versions of the SDP relaxations of the diffusion -means: one requires to know the number of clusters, and the other one does not require such knowledge as an input (and thus it is adaptive to the unknown number of clusters).

3. We derive the exact recovery property of the SDP relaxed diffusion -means in terms of two hardness parameters of the clustering problem: one reflects the separation of the submanifolds, and the other one quantifies the degree of connectedness of the submanifolds.

4. We combine the local scaling procedures with the diffusion -means and its regularized version, and derive their adaptivity when the clustering problems have unbalanced sizes, mixed dimensions, and different densities.

### 1.3. Related work

There is a large collection of clustering methods and algorithms in literature, which can be broadly classified into two categories: hierarchical clustering and partition-based clustering. Hierarchical clustering recursively divides data points into groups in either a top-down or bottom-up way. Such algorithms are greedy and they often get stuck into local optimal solutions.

Partition-based clustering methods such as -means clustering [19]

[30] directly assign each data point with a group membership. Perhaps one of the most widely used clustering methods is the -means method, due to the existence of algorithms with linear sample complexity (such as Lloyd’s algorithm [17]). However, the -means clustering converges locally to a stationary point that depends on the initial partition. Recent theoretical studies in [18] show that, given a proper initialization (such as spectral clustering), Lloyd’s algorithm for optimizing the -means objective function can consistently recover the clustering structures. Exact and approximate recovery of various convex relaxations for the -means and mixture models are studied in literature [22, 16, 12, 24, 3]. To the best of our knowledge, existing theoretical guarantees developed for the convex relaxed -means clustering assumes that the clusters are sampled in a neighborhood of a centroid, which in our setup corresponds to the situation that each submanifold contains only one point. Thus results derived for -means in literature cannot be directly compared with our results. On the other hand, spectral clustering methods [25, 21] take the similarity matrix as the input and solve the clustering problem by applying

-means to top eigenvectors of the graph Laplacian matrix or its normalized versions

[8]. [31] study the convergence of spectral properties of random graph Laplacian matrices constructed from sample points and they establish the consistency of the spectral clustering in terms of eigenvectors. However, they do not address the problem of the exact recovery property of the clustering structure.

On the contrary, literature on theoretical guarantees for manifold clustering is rather scarce, with the exception [1]. Near-optimal exact recovery of some emblematic clustering methods based on pairwise distances of data is derived under a condition that the minimal signal separation strength over all pairs of submanifolds is larger than a threshold. Compared with our diffusion -means with local scaling, results established in [1] are non-adaptive to the local density and (geometric) structures of the submanifolds (cf. Theorem 4.8 and 4.10 ahead).

### 1.4. Notation

For a matrix and index subsets , we use notation to denote the submatrix of with rows being selected by and columns by , and the

-dimensional vector composed of all diagonal entries of

. Let and denote the and the norm of the vectorization of matrix . Let and denote the nuclear norm and the operator norm of matrix . We shall use to denote positive and finite (non-random) constants whose values may depend the submanifolds and the probability distributions supported on and whose values may vary from place to place.

The rest of the paper is organized as follows. In Section 2, we discuss some related background on diffusion distances and nonlinear embeddings in Euclidean spaces, as well as the Laplace-Beltrami operator for the heat diffusion process on Riemannian manifolds. In Section 3, we introduce the diffusion -means and its SDP relaxations. Regularized and localized diffusion -means clustering methods are also proposed in this section. In Section 4, we present our main results on the exact recovery property of the SDP relaxed diffusion -means. Simulation studies are presented in Section 5. Proofs are relegated to Section 6.

## 2. Preliminaries

Let and be a probability distribution on . Let be i.i.d. random variables in sampled from , and the empirical distribution. In Section 2, we discuss the Euclidean embedding via the diffusion distance in the general setting. Thus in this section, we do not assume to be a disjoint union of submanifolds in and the sample does not necessarily have a clustering structure.

### 2.1. Euclidean embedding and diffusion distances

Let be a positive semidefinite kernel that satisfies:

1. symmetry: ,

2. positivity preserving: .

A kernel is a similarity measure between points of . A widely used example is the Gaussian kernel:

 κ(x,y)=exp(−∥x−y∥22h2), (1)

where is the bandwidth parameter that captures the local similarity of points in . Given a kernel with property (i) and (ii), we can define a reversible Markov chain on via the normalized graph Laplacian constructed as follows. Specifically, for any , let

 d(x)=∫Sκ(x,y)dμ(y)

be the degree function of the graph on . For simplicity, we assume for all . Define

 p(x,y)=κ(x,y)d(x),

which satisfies the positivity preserving property (ii) and the conservation property

 ∫Sp(x,y)dμ(y)=1.

Thus can be viewed as the one-step transition probability of a (stationary) Markov chain on from to . We shall write this Markov chain (i.e., random walk) as , where is called the transition kernel of . Equivalently, we can describe by the bounded linear operator defined as

 Pf(x)=∫Sp(x,y)f(y)dμ(y).

Here is the class of squared integrable functions on with respect to . In literature, is often called the diffusion operator for the following reason. If we denote as the -step transition probability of the Markov chain from to in , then

 Ptf(x)=∫Spt(x,y)f(y)dμ(y),t=1,2,…,

form a semi-group of bounded linear operators on . Let be a stationary distribution of the Markov chain over . Then is absolutely continuous with respect to

, and the probability density function

of with respect to is given by the Radon-Nikodym derivative

 π(x)=dΠdμ(x)=d(x)∫Sd(y)dμ(y).

Since is the stationary measure of the Markov chain with transition , we have

 Π(Ptf)=Π(f)

for all bounded measurable functions , where . Note that, since the kernel is symmetric, is reversible and satisfies the detailed balance condition:

 π(x)p(x,y)=π(y)p(y,x),∀x,y∈S.
###### Lemma 2.1 (Spectral decomposition of the Markov chain W).

Let

 R(x,y)=κ(x,y)√π(x)√π(y),∀x,y∈S.

If

 ∫S∫SR(x,y)2dμ(x)dμ(y)<∞, (2)

then the following statements hold.

1. There exists a sequence of nonnegative eigenvalues

such that

 R(x,y)=∞∑j=0λjϕj(x)ϕj(y),

where is the set of associated eigenfunctions to , and forms an orthonormal basis of .

2. The transition probability admits the following decomposition

 p(x,y)=∞∑j=0λjψj(x)φj(y),

where and .

3. The diffusion operator satisfies

 Pψj=λjψj,j=0,1,….

The proof of Lemma 2.1 is given in Appendix A, and our argument is similar to Lemma 12.2 in [15] in the finite-dimensional setting. If is irreducible (i.e., the graph on is connected in that for all , there is some such that ), then the stationary distribution is unique. Thus if we run this Markov chain forward in time, then the local geometry (captured by the kernel which is parameterized by the bandwidth ) will be integrated to reveal global structures of at multiple (time) scales. In particular, we can define a class of diffusion distances [9] on by

 Dt(x,y):=∥pt(x,⋅)−pt(y,⋅)∥L2(dμ/π)={∫S[pt(x,z)−pt(y,z)]2dμ(z)π(z)}1/2.

Roughly speaking, for each and , the diffusion distance quantifies the the total number of paths with length connecting and (see Figure 2), thereby reflecting the local connectivity at the time scale .

###### Lemma 2.2 (Spectral representation of diffusion distances).

If the Markov chain is irreducible, then we have

 D2t(x,y)=∞∑j=0λ2tj[ψj(x)−ψj(y)]2

for all and .

The proof of Lemma 2.2 is given in Appendix A. For an irreducible Markov chain, the spectral gap is strictly positive (i.e., for all ). Based on the spectral decomposition in Lemma 2.2 and noting that , we see that the diffusion distance can be written as

 Dt(x,y)={∞∑j=1λ2tj[ψj(x)−ψj(y)]2}1/2. (3)

In this case, the diffusion distance decays to zero as increases, provided that and belong to a connected component of the graph on . In particular, the decay rate of the spectrum quantifies the connectivity of points in the graph on . Given a positive integer , the diffusion maps are defined as

 Ψ(q)t(x)=(λt1ψ1(x),…,λtqψq(x))T,

where the -th component is the -th diffusion coordinate in . Thus we obtain an embedding of into the Euclidean space in the limiting sense that

 Dt(x,y)=limq→∞∥Ψ(q)t(x)−Ψ(q)t(y)∥2.

### 2.2. Empirical diffusion embedding

Recall that are i.i.d. random variables in sampled from , and is the empirical distribution. Given , we can consider finite sample approximations to the underlying population level quantities . More precisely, consider a weighted graph with nodes corresponding to the elements in , where the weight between a pair of nodes is , for . Define the (rescaled) empirical degree function by

 dn(x)=n∫Snκ(x,y)dμn(y)=n∑i=1κ(x,Xi),∀x∈Sn,

where we added an extra -factor so that is also the degree of node in the weighted graph. Let denote the -by- diagonal matrix whose -th diagonal entry is . Consider the (empirical) random walk over with transition probability

 Pn(x,y)=κ(x,y)dn(x)∀x,y∈Sn.

The (empirical) stationary distribution of the random walk over becomes

 πn(x)=dn(x)∑ni=1dn(Xi)∀x∈Sn.

For any vector , let denote a weighted space over . We define the empirical diffusion distances as

 Dn,t(x,y)=∥Ptn(x,⋅)−Ptn(y,⋅)∥L2(diag(D−1n))={n∑i=1[Ptn(x,Xi)−Ptn(y,Xi)]21dn(Xi)}1/2,

for all and . Roughly speaking, provides an empirical estimate to . Similar to the spectral representation (3) for , we also have the following spectral representation of (see Appendix B),

 Dn,t(x,y) ={n−1∑j=0λ2tn,j[ψn,j(x)−ψn,j(y)]2}1/2,∀t∈N+ and x,y∈Sn,

where are the nonnegative eigenvalues (due to the positive semidefiniteness of the kernel ) of the transition probability operator , which can be identified with a matrix in with as its -th element, and are the associated eigen-functions on with unit norm, which can be identified with vectors in with as the -th element of for and . The empirical diffusion distance between two nodes and is also the Euclidean distance between their embeddings and via the empirical diffusion map

 Ψn,t:Sn→Rn,x↦(λtn,1ψn,1(x),…,λtn,nψn,n(x))T.

### 2.3. The Laplace-Beltrami operator on Riemannian manifolds

The Laplace-Beltrami operator on Riemannian manifolds is a generalization of the Laplace operator on Euclidean spaces. Let be an (infinitely) differentiable function with continuous derivatives on a -dimensional compact and smooth Riemannian manifold and be the gradient vector field on (i.e., is the deepest direction of ascent for at the point ). The Laplace-Beltrami operator is defined as the divergence of the gradient vector

 ΔMf=−div(∇Mf),

where the div operator is relative to the volume form of . Here we adopt the convention with the minus sign of the divergence such that is a positive-definite operator. With integration-by-parts, we have for any two differentiable functions and ,

 ∫Mg(x)ΔMfdVolM(x)=∫M⟨∇Mg(x),∇Mf(x)⟩dVolM(x),

where the inner product is taken in the -dimensional tangent space of (at the point ). In a Euclidean space (i.e., ), the Laplace-Beltrami operator is the usual Laplace operator

 Δf=−q∑j=1∂2f∂x2j.

On a general -dimensional Riemannian manifold , the Laplace-Beltrami operator in a local coordinate system

with a metric tensor

is given by

 ΔMf=−1√det(G)q∑j=1∂∂ej(√det(G)q∑i=1gij∂f∂ei),

where are the entries of . In the special case , is the identity matrix. Note that is a self-adjoint positive-definite compact operator, its spectrum contains a sequence of nonnegative eigenvalues . If in addition is connected, then the second smallest eigenvalue . As we will show, depends on the connectivity of the manifold (Figure 2), thus characterizing the limiting mixing time of the empirical random walk over the as and , when is sampled from the manifold .

## 3. Diffusion K-means

Recall that in our clustering model, is a sample of independent random variables taking values in , where is the union of disjoint Riemannian submanifolds embedded in the ambient space . The clustering problem is to divide these data points into clusters, so that points in the same cluster belongs to the same connected component in , based on certain similarity measures between the points. In particular, the (classical) -means clustering method minimizes the total intra-cluster squared Euclidean distances in

 minG1,…,GKK∑k=11|Gk|∑i,j∈Gk∥Xi−Xj∥2

over all possible partitions on , where is the cardinality of . Dropping the sum of squared norms , we see that the -means clustering is equivalent to the maximization of the total within-cluster covariances

 maxG1,…,GKK∑k=11|Gk|∑i,j∈Gkaij,with aij=XTiXj.

Here, can be viewed as a similarity measure specified by the Euclidean space inner product . In general, we can replace the Euclidean inner product with any other inner product over [7]. For manifold clustering, we replace it with the inner product induced from the empirical diffusion distance, that is,

 ⟨x,y⟩Dn,t=⟨Ψn,t(x),Ψn,t(y)⟩Rn=n∑j=1λ2tn,jψn,j(x)ψn,j(y),∀x,y∈Sn.

Henceforth, we will refer to as the diffusion affinity. Interestingly, we can obtain this diffusion affinity value without explicitly conducting eigen-decomposition (spectral decomposition) to the transition probability matrix (or the symmetrized matrix ), where recall that is the degree diagonal matrix, and is the empirical kernel matrix. In fact, we may use the following relation that links the empirical diffusion affinity with entries in matrix raising to power (see Appendix B for details),

 ⟨x,y⟩Dn,t=n∑j=1λ2tn,jψn,j(x)ψn,j(y)=[P2tnD−1n](x,y).

This motivates a -means clustering method via diffusion distances, referred to as the diffusion -means as

 maxG1,…,GKK∑k=11|Gk|∑i,j∈Gk[P2tnD−1n]ij, (4)

for the tuning parameter interpreted as the number of steps in the empirical random walk

. Note that here the affinity matrix

is symmetric. In light of the connections between the diffusion distance and the random walk over in Section 2.2, the diffusion -means attempts to maximize the total within-cluster connectedness.

Note that, for every partition , there is a one-to-one assignment matrix such that if and if . Thus the diffusion -means clustering problem can be recast as a 0-1 integer program:

 max{⟨P2tnD−1n,HBHT⟩:H∈{0,1}n×K,H1K=1n}, (5)

where denotes the vector of all ones and .

The diffusion -means clustering problem (5) is often computationally intractable, namely, polynomial-time algorithms with exact solutions only exist in special cases [26]. For instances, the (classical) -means clustering is an -hard integer programming problem with a non-linear objective function [22]. Exact and approximate recovery of various SDP relaxations for the -means [22, 16, 12, 24, 14] are studied in literature. However, it remains a challenging task to provide statistical guarantees for clustering methods that can capture non-linear features of data taking values on manifolds.

### 3.1. Semidefinite programming relaxations

We consider the SDP relaxations for the diffusion -means clustering. Denote the size of the -th cluster as for . Note that every partition of can be represented by a partition function via . If we change the variable in the 0-1 integer program formulation (5) of the diffusion -means, then satisfies the following properties:

 ZT=Z,Z⪰0,tr(Z)=K∑k=1nkbkk,(Z1n)i=K∑k=1nkbσ(i)k,i=1,…,n. (6)

For the diffusion -means , the last constraint in (6) reduces to , which does not depend on the partition function . Thus we can relax the diffusion -means clustering to the SDP problem:

 ^Z=argmax{⟨A,Z⟩:Z∈CK}with CK={Z∈Rn×n:ZT=Z,Z⪰0,tr(Z)=K,Z1n=1n,Z⩾0}, (7)

where means that is positive semidefinite and means that all entries of are non-negative, and matrix . We shall use to estimate the true “membership matrix” , where

 Z∗ij={1/nkif i,j∈G∗k0otherwise. (8)

Note that is a block diagonal matrix (up to a permutation) of rank . If (i.e., ) for some Hilbert space and is the inner product between and , then (7) is the SDP for kernel -means proposed in [7]. In particular, [22] consider the special case for the (Euclidean) -means, where and . Observe that the SDP relaxation (7) does not require the knowledge of the cluster sizes other than the number of clusters . Thus it can handle the general case for unequal cluster sizes.

### 3.2. Regularized diffusion K-means

In practice, the number of clusters is rarely known. Note that the SDP problem (7) depends on only through the constraint . Therefore we propose a regularized diffusion -means estimator by dropping the constraint on the trace and penalizing as follows,

 ~Z:=~Zλ=argmax{⟨A,Z⟩−nλtr(Z):Z∈C}with C={Z∈Rn×n:ZT=Z,Z⪰0,Z1n=1n,Z⩾0}, (9)

where is the regularization parameter. Recall that denotes the nuclear norm of a matrix (i.e.,

is the sum of the singular values of

). Since for , it is interesting to note that (9) is the same as the nuclear norm penalized diffusion -means

 (10)

Recall that the true membership matrix has a block diagonal structure with rank , the nuclear norm penalized diffusion -means (10) can be thought as an norm convex relaxation of the -hard rank minimization problem (i.e., minimizing the number of non-zero eigenvalues). Thus the parameter in the clustering problem plays a similar role as the sparsity (or low-rankness) parameter in the matrix completion context [5]. Hence the SDP problem (10) can be viewed as a (further) convex relaxation of the infeasible SDP problem (7) when is unknown.

It remains a question to choose the value of . Larger values of will lead to solutions containing less number of clusters (with larger sizes). In particular, when matrix is positive-definite, the following Lemma 3.1 shows that the solution reduces to a rank one matrix that assigns all points into a giant cluster when is large enough, and becomes the identity matrix that assigns points into distinct clusters when is small enough. In addition, the trace of the solution is nonincreasing in .

###### Lemma 3.1.

Suppose is a positive definite matrix, and let and denote its respective largest and smallest eigenvalues.
(1) If , then , where is the matrix of all ones, is the unique solution of the SDP (9).
(2) If , then , the identity matrix, is the unique solution of the SDP (9).
(3) If and are two solutions of the SDP (9) with the regularization parameter taking values and , respectively. If , then . Furthermore, if at least one of the two SDPs has a unique solution, then .

According to the interpretation of the SDP (7), the trace of the solution can be viewed as the fitted number of clusters. Consequently, Lemma 3.1 implies that smaller values of will result in more clusters (with smaller sizes) in . In practice, we need to properly select the tuning parameter . We propose the following decision rule for this purpose. For each , we run the SDP problem (9) and extract the value of . Then we plot the solution path versus and pick the values of which spend the longest time (on the logarithmic scale) with a flat value of . Here we recommend using the logarithmic scale since values of with non-trivial solutions tends to be close to zero. Algorithm 1 below summarizes this decision rule.

Figure 3 shows the empirical result on the three clusters (one disk and two annuli) example in Section 1. According to Lemma 3.1, the estimated number of clusters, proxied by , is a non-increasing function of . In particular, the trace in the solution path in the upper left panel of Figure 3 stabilizes around and , indicating that both and are candidate values for the number of clusters. In particular, the interval of (on the logarithmic scale) corresponding to value is much larger than that to value , indicating that is more likely to be the true number of clusters (cf. the (correct) case in Figure 3). In Section 4, we will use our theory to explain this stabilization phenomenon, which partially justifies our selection rule.

In addition, as we can see from the rest three panels in Figure 3, by gradually increasing , the adaptive diffusion -means method produces a hierarchical clustering structure. Unlike the top-down or bottom-up clustering procedures which are based on certain greedy rule and can incur inconsistency, the hierarchical clustering structure produced by our approach is consistent — it does not depend on the order of partitioning or merging due to the uniqueness of the global solution from the convex optimization via the SDP. Similar observations can be drawn on another example shown in Figure 4 containing a uniform sample on three rectangles (see DGP 2 in our simulation studies Section 5 for details).

Further, it is interesting to observe in Figure 3 that the regularized diffusion -means tuned with two clusters yields a merge between the outer annulus and the disk, which gives the largest total diffusion affinity in the objective function (4) among the three possible combinations of the true clusters. Since diffusion affinity decays exponentially fast to zero in the squared Euclidean distance (for the Gaussian kernel), the diffusion affinity matrix tends to have a block diagonal structure, as weights between points belonging to different clusters are exponentially small (cf. (23) and Lemma 6.3). Thus running SDP for examples with relatively well separated clusters, such as the one in Figure 3, tends to merge two clusters with largest within-cluster diffusion affinities that is irrespective of the between-cluster Euclidean distances. This may lead to a visually less appealing merge as in the Euclidean distance case (cf. the (under) case in Figure 3). On the other hand, the regularized diffusion -means is able to produce more reasonable partition in splitting the clusters (cf. the bottom-left panel in Figure 3). In particular, if the regularization parameter is chosen such that the corresponding number of clusters in the SDP solution is greater than , then this will cause a split in one of the true clustering structures that minimizes the between-cluster diffusion affinities after the splitting. Moreover, in our simulation studies (setup DGP 3 in Section 5), we observe that the SDP relaxed regularized diffusion -means performs much better in harder cases than the spectral clustering methods when the true clusters are not well separated.

### 3.3. Localized diffusion K-means

For clustering problems with different sizes, dimensions and densities, the diffusion -means may have limitations since only one bandwidth parameter is used to control the local geometry on the domain. More precisely, according to our theory (for example, Theorem 4.1 below), the optimal choice of the bandwidth parameter for a -dimensional submanifold as one connected component in our clustering model is , where corresponds to the sample size within this cluster and thus depends on the local cluster size or density level. Figure 5 demonstrates such an example for a mixture of three bivariate Gaussians, which consist of one larger Gaussian component with low density and two smaller Gaussian components with high density. Empirically, the diffusion -means fails on this example (even after tuning) for the reason that the larger and smaller clusters have very different local densities. This motivates us to consider a variant of diffusion -means, termed as localized diffusion -means, by using local adaptive bandwidth for each , . In particular, we adopt the self-tuning procedure from [33] by setting to be , where denotes the -th nearest neighbor to , and replacing with (and accordingly replacing with corresponding to the diagonal degree matrix associated with ) given by

 K†n(Xi,Xj)=exp(−∥Xi−Xj∥22hihj),

in the SDP (7). Note that is generally no longer a positive semidefinite matrix. Intuitively for , the local scaling automatically adapts to the local density about , the cluster size and the dimension for the -th Riemannian submanifold. Specifically, for each cluster , the -by- submatrix resembles the a Gaussian kernel matrix with a homogeneous bandwidth that adapts to the local geometry in . For points and belonging to distinct clusters that are properly separated, tends to be close to zero and is less affected by the choice of and

is larger for lower density regions where the degree function of is smaller so that the random walk can speed up mixing at such lower density regions. Overall, such a locally adaptive choice of bandwidth improves the mixing time of the random walk within each cluster, while leaves the between cluster jumping probabilities remaining small. As a consequence, the pairwise diffusion affinity matrix in our SDP formulation (7) tends to exhibit a clearer block form reflecting the clustering structure. To compute , we only need to specify to replace the (non-adaptive) bandwidth parameter whose value depend on the unknown cluster sizes , dimensions of submanifolds , and the underlying probability density functions on . In contrast, the simple choice