Learning by Unsupervised Nonlinear Diffusion

10/15/2018 ∙ by Mauro Maggioni, et al. ∙ Tufts University Johns Hopkins University 0

This paper proposes and analyzes a novel clustering algorithm that combines graph-based diffusion geometry with density estimation. The proposed method is suitable for data generated from mixtures of distributions with densities that are both multimodal and have nonlinear shapes. A crucial aspect of this algorithm is to introduce time of a data-adapted diffusion process as a scale parameter that is different from the local spatial scale parameter used in many clustering and learning algorithms. We prove estimates for the behavior of diffusion distances with respect to this time parameter under a flexible nonparametric data model, identifying a range of times in which the mesoscopic equilibria of the underlying process are revealed, corresponding to a gap between within-cluster and between-cluster diffusion distances. This analysis is leveraged to prove sufficient conditions guaranteeing the accuracy of the proposed learning by unsupervised nonlinear diffusion (LUND) algorithm. We implement the LUND algorithm numerically and confirm its theoretical properties on illustrative datasets, showing that the proposed method enjoys both theoretical and empirical advantages over current spectral clustering and density-based clustering techniques.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 9

page 22

page 27

page 29

page 30

page 31

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Unsupervised learning is a central problem in machine learning, requiring that data be analyzed without a priori knowledge of any class labels. One of the most common unsupervised problems is the problem of clustering, in which the data is to be partitioned into clusters so that each cluster contains similar points and distinct clusters are sufficiently separated. In general, this problem is ill-posed; various geometric, analytic, topological, and statistical assumptions on the data and measurement method are imposed to make the problem tractable. Developing conditions under which empirically effective clustering methods have performance guarantees is an active research topic [42, 62, 90, 11, 50, 69, 36, 37, 54], as is the development of broad analyses and characterizations of generic clustering methods [1, 16].

Clustering techniques abound. Some of the most popular include -means clustering and its variants [34, 9, 64], hierarchical methods [43, 34], density-based methods [31], and mode-based methods [35, 25, 18, 65, 41]

. Feature extraction is often combined with these standard methods to improve clustering performance. In particular,

spectral methods [70, 62, 23, 89] construct graphs representing data, and use the spectral properties of the resulting weight matrix or Laplacian to produce structure-revealing features in the data.

Graphs often encode pairwise similarities between points, typically local: this “spatial” scale is often determined by a parameter . For example only points within distance of each other may be connected, with weight

. From the graph global features on the data may be derived if needed, for example by considering the eigenfunctions of the random walk on the graph. Alternatively, graphs may be used to introduce data-adaptive distances, such as

diffusion distances, which are associated to random walks and diffusion processes on graphs [23, 61, 47, 24, 74, 66, 98, 48, 49, 26, 52]. Diffusion distances do not depend only on the graph itself, but also on a time parameter that determines a scale on the graph at which these distances are considered, related to the time of diffusion or random walk. Choosing in graph-based algorithms, and both and in the case of diffusion distances is important in both theory and applications [85, 66, 98, 57]. However their role is rather well-understood only in special cases (e.g. ) which are of interest in some cases (e.g. manifold learning) but not necessarily in the case of clustering.

We propose a clustering algorithm, called unsupervised learning by nonlinear diffusion (LUND), which combines diffusion distances and density estimation to efficiently cluster data generated from a nonparametric model. This method was first proposed in the empirical study of high-dimensional hyperspectral images [57, 58, 59], where it was shown to enjoy competitive performance with state-of-the-art clustering algorithms on specific data sets. At the same time, we advance the understanding of the relationship between the local “spatial” scale parameter and the diffusion time parameter in the context of clustering, demonstrating how the role of can be exploited to successfully cluster data sets for which -means, spectral clustering, or density-based clustering methods fail, and providing quantitative bounds and guarantees on the performance of the proposed clustering algorithm for data that may be highly nonlinear and of variable density. We moreover provide sufficient conditions under which LUND correctly estimates the number of clusters .

1.1 Major Contributions and Outline

This article makes two major contributions. First, explicit estimates on diffusion distances for nonparametric data are proved: we obtain lower bounds for the diffusion distance between clusters, and upper bounds on the diffusion distance within clusters, as a function of the time parameter and suitable properties of the clusters. Together, these bounds yield a mesoscopic – not too small, not too large – diffusion time-scale at which diffusion distances separate clusters clearly and cohere points in the same cluster. These results, among other things, show how the role of the time parameter, which controls the scale “on the data” of the diffusion distances, is very different from the commonly-used scaling parameter in the construction of the underlying graph, which is a local spatial scale measured in the ambient space. Relationships between , are well-understood in the asymptotic case of , (at an appropriate rate with [23, 47, 89]) and (essentially Varadhan’s lemma applied to diffusions on a manifold [27, 46]). These relationships imply that the choice of is essentially irrelevant, since in these limits diffusion distances are essentially geodesic distances. However the clustering phenomena we are interested in are far from this regime, and we show that the interplay between , and becomes crucial.

Second, the LUND clustering algorithm is proposed and shown to enjoy performance guarantees for clustering on a broad class of non-parametric mixture models. We prove sufficient conditions for LUND to correctly determine the number of clusters in the data and to have low clustering error. From the computational perspective, we present an efficient algorithm implementing LUND, which scales essentially linearly in the number of points , and in the ambient dimension , for intrinsically low-dimensional data. We test our algorithm on synthetic data, studying empirically the relationships between the different parameters in LUND, in particular between and , and comparing with popular and related clustering algorithms, including spectral clustering and fast search and find of density peaks clustering (FSFDPC), illustrating weaknesses of these methods and corresponding advantages of LUND. Our experiments illustrate how LUND combines benefits of spectral clustering and FSFDPC, allowing it to learn non-linear structure in data while also being guided by regions of high density.

The outline of the article is as follows. Background is presented in Section 2. In Section 3, motivational datasets and a summary of the theoretical results are presented and discussed. Theoretical comparisons with spectral clustering and density methods are also made in Section 3. Estimates on diffusion distances are proved in Section 4. Performance guarantees for the LUND algorithm are proved in Section 5. Numerical experiments and computational complexity are discussed in Section 6. Conclusions and future research directions are given in Section 7.

2 Background

2.1 Background on Clustering

Clustering is the process of determining groupings within data and assigning labels to data points according to these groupings—without supervision. Given the wide variety of data of interest to scientific practitioners, many approaches to clustering have been developed, whose performance is often wildly variable and data-dependent. Mathematical assumptions are placed on the data to prove guarantee performance.

2.1.1 -Means

A classical and popular clustering algorithm is -means [83, 34] and its variants [63, 9, 64], which is often used in conjunction with feature extraction methods. In -means the data is partitioned into (a user-specified parameter) groups, where the partition is chosen to minimize within-cluster dissimilarity: where is the mean of the cluster (for a given partition, it is the minimizer of the least squares cost in the inner sum). While very popular in practice,

-means and its variants are known to perform poorly for datasets that are not the union of well-separated, near-spherical clusters, and are often sensitive to outliers.

2.1.2 Spectral Methods

The family of clustering methods known as spectral methods or spectral clustering compute features that reveal the structure of data that may deviate from the spherical, Gaussian shapes ideal for -means, and in particular may be nonlinear or elongated in shape. This is done by building local connectivity graphs on the data that encode pairwise similarities between points, then computing a spectral decomposition of adjacency or random walk or Laplacian operators defined on this graph. Focusing on the graph Laplacian

(the other operators are related), one uses the eigenvectors of

as global features input to -means, enabling clustering of nonlinear data that -means alone would fail to cluster accurately.

More precisely, let be a set of points to cluster. Let be a graph with vertices corresponding to points of and edges stored in an symmetric weight matrix . Often one chooses for some (symmetric, often radial and rapidly decaying) nonnegative kernel , such as . The graph may be fully connected, or it may be a nearest neighbors graph with respect to some metric. Let be the diagonal matrix . The graph Laplacian is constructed as . One then normalizes to acquire either the random walk Laplacian or the symmetric Laplacian . We focus on in what follows. It can be shown that

has real eigenvalues

and corresponding eigenvectors . The original data can be clustered by clustering the embedded data for an appropriate choice of . In this step typically

-means is used, though Gaussian mixture models may (and perhaps should) be used, as they enjoy, unlike

-means, a suitably-defined statistical consistency guarantee in the infinite sample limit [10]. If clusters in the original data are sufficiently far apart and points within a cluster sufficiently nearby in a suitable sense, spectral clustering with an appropriate kernel can drastically improve over -means [62].

It is well-known that spectral clustering relaxes a graph-cut problem. For a collection of subsets , the corresponding normalized cut is where One can partition by finding a partition minimizing the Ncut quantity, yielding clusters that are simultaneously separated and balanced [70]. However, computing the minimal Ncut is NP-hard [91]. To relax this problem, one notes that where are as above and for , and for . This formulation may be relaxed to , which has solution given by the matrix consisting of the first eigenvectors of . Hence, one can approximate the NP-hard problem of minimizing Ncut with an time spectral decomposition problem.

Spectral clustering methods enjoy strong empirical performance on a variety of data sets. However, they are sensitive to clusters of different sizes and densities, and also to parameters in the construction of the underlying graph. There has been significant work on performance guarantees for spectral clustering, which we discuss in Section 3.2.

Instead of using the eigenfunctions of the Laplacian, it is equivalent to analyze the corresponding random walk, represented by

. The graph cut problem is related to transition probabilities between the partition elements by the following result, which is a discrete counterpart to the classical results in the continuous setting involving Brownian motion (see e.g.

Banuelos and Burdzy [12] and references therein):

Theorem 2.1

([55, 90]) Let

be an irreducible, aperiodic Markov chain. Let

be the graph with nodes corresponding to the states of , and edge weights between the and node given by . Let be an initial distribution for the Markov chain, and for disjoint subsets of the state space, let Then

Thus, the normalized graph cut generated by a subset is essentially the same as the probability of transitioning between the sets and in one time step, according to the transition matrix . A crucial aspect of the analysis proposed in this article is to study the behavior across many time steps, which makes the proposed method quite different from spectral clustering.

Weaknesses of spectral clustering were scrutinized in [60]. Their approach starts from the observation that a random walk matrix —defined on sampled from an underlying density for some —converges under a suitable scaling as to the stochastic differential equation (SDE) where is the Brownian motion [23]. Moreover, the eigenvectors of converge to the eigenfunctions of the Fokker-Planck operator The characteristic time scales of the SDE determine the structure of the leading eigenfunctions of [39]. These characteristic time scales correspond to the time for transition between different clusters and the equilibrium times within clusters, and the relationship between these quantities determine which eigenfunctions of —or —reveal the cluster structure in the data. Related connections between normalized cuts and exit times from certain clusters is analyzed in [40]. In the same work several examples are presented, including that of three Gaussian clusters of different sizes and densities shown in Figure 1. These data cannot be clustered by spectral clustering using either the second [70] or the second, third, and fourth eigenfunction [62].

Data to Cluster [60]
Spectral clustering [70]
Spectral clustering [62]
Eigenvector 2
Eigenvector 3
Eigenvector 4
Eigenvector 5
Eigenvector 6
Figure 1: In (a), three Gaussians of essentially the same density are shown. Results of spectral clustering are shown in (b) [70] and (c) [62]. In (d) - (h), the first five non-trivial eigenvectors are shown. As noted in [60], the underlying density for this data yields a Fokker-Planck operator whose low-energy eigenfunctions cannot distinguish between the two smaller clusters, thus preventing spectral clustering from succeeding: higher energy eigenfunctions are required. For this example, the sixth non-trivial eigenvector localizes sufficiently on the small clusters to allow for correct determination of the cluster structure, while spectral clustering would only use up to the fourth eigenvector.

2.1.3 Density and Mode-Based Methods

Density and mode-based clustering methods detect regions of high-density and low-density to determine clusters. The DBSCAN [31] and DBCLASD [95] algorithms assign to the same cluster points that are close and have many near neighbors, and flag as outliers points that lie alone in low-density regions. The mean-shift algorithm [35, 25] pushes points towards regions of high-density, and associate clusters with these high-density points. In this sense, mean-shift clustering computes modes in the data and assigns points to their nearest mode. Both DBSCAN and mean-shift clustering suffer from a lack of robustness to outliers and depend strongly on parameter choices.

The recent and popular fast search and find of density peaks clustering algorithm (FSFDPC) [65] proposes to address these weaknesses. This method characterizes class modes as points of high-density that are far in Euclidean distance from other points of high-density. This algorithm has seen abundant applications to scientific problems [82, 94, 84, 67, 92, 45]. However, little mathematical justification for this approach has been given. This article demonstrates that the standard FSFDPC method, while very popular in scientific applications, correctly clusters the data only under strong and unrealistic assumptions on the data. The main reason is that Euclidean distances are used to find modes, which is inappropriate for data drawn from mixtures of multimodal distributions or distributions nearly supported on nonlinear subsets of Euclidean space.

2.2 Background on Diffusion Distances

One of the primary tools in the proposed clustering algorithm is diffusion distances, a class of data-dependent distances computed by constructing Markov processes on data [23, 22] that capture its intrinsic structure. We consider diffusion on the point cloud via a Markov chain [51] with state space . Let be the corresponding transition matrix. The following shall be referred to as the usual assumptions on : is reversible, irreducible, aperiodic, and therefore ergodic. A common construction for , and the one we consider in the algorithmic sections of this article, is to first compute a weight matrix , where for some appropriate scale parameter , and is a metric, typically the norm. The parameter encodes the interaction radius of each point: large allows for long-range interactions between points that are far apart in norm, while small allows only for short-range interactions between points that are close in norm. is constructed from by row-normalizing so that . A unique stationary distribution satisfying is guaranteed to exist since is ergodic. In fact, where is the degree of .

Diffusion processes on graphs lead to a data-dependent notion of distance, known as diffusion distance [23, 22]. While the focus of the construction is on diffusion distances and the diffusion process itself, we mention that diffusion maps provide a way of computing and visualizing diffusion distances, and may be understood as a type of non-linear dimension reduction, in which data in a high number of dimensions may be embedded in a low-dimensional space by a nonlinear coordinate transformation. In this regard, diffusion maps are related to nonlinear dimension reduction techniques such as isomap [87], Laplacian eigenmaps [13], and local linear embedding [68], among many others. We focus on the (random walk) process itself.

Definition 2.2

Let and let be a Markov process on satisfying the usual assumptions and with stationary distribution . Let

be a probability distribution on

. For points , let for some . The diffusion distance at time between is defined by

For an initial distribution on

, the vector

is the probability over states at time . As increases, this diffusion process on evolves according to the connections between the points encoded by . The computation of involves summing over all paths of length connecting to , hence is small if are strongly connected in the graph according to , and large if are weakly connected in the graph. It is known that if the underlying graph is generated from data sampled from a low-dimensional structure, such as a manifold, then diffusion distances parametrizes this low-dimensional structure [23, 46, 77, 75, 86, 76]. Indeed, diffusion distances admit a formulation in terms of the eigenfunctions of :

(2.1)

where is the -normalized spectral decomposition of , ordered so that , and noting that is constant by construction.

Diffusion distances are parametrized by , which measures how long the diffusion process on has run when the distances are computed. Small values of allow a small amount of diffusion, which may prevent the interesting geometry of from being discovered, but provide detailed, fine scale information. Large values of allow the diffusion process to run for so long that the fine geometry may be washed out, leaving only coarse scale information. It is important to rigorously understand how the properties of the data relate to . This article develops such a theory.

3 Data Model and Overview of Main Results

Among the main results of this article are sufficient conditions for clustering certain discrete data . The data is modeled as a realization from a probability distribution

(3.1)

where each is a probability measure. Intuitively, our results require separation and cohesion conditions on . That is, each is far from , and connections are strong (in a suitable sense) within each . is generated by drawing, for each , one of the clusters, say , according to the multinomial distribution with parameters , and then drawing from . The clusters in the data are defined as the subsets of whose entries were drawn from a particular , that is, we let the cluster . Given , the goal of clustering is to estimate these accurately, and ideally also determine .

We consider a nonparametric model which makes few explicit assumptions on . In particular, may have nonlinear support and be multimodal, with multiple high-density regions. These features cause prominent clustering methods to fail, e.g. -means, which requires spherical or extremely separated clusters; spectral clustering, which often fails for highly elongated clusters or clusters of different sizes and densities; or density methods, which are highly sensitive to noise and selection of parameters. Two simple, motivating examples are in Figure 2. They feature variable densities, variable levels of connectivity, both within and across clusters, and (for the second example) nonlinear cluster shapes.

Bottleneck data
Nonlinear data
for data in (a).
for data in (b).
Figure 2: (a), (b) shows two datasets—linear and nonlinear—colored by density. In (c), (d), we show the corresponding Markov transition matrices , with entry magnitudes shown in scale. The Markov chains are ergodic, but close to being reducible. The transition matrix was constructed using the Gaussian kernel as in Section 2.2, with the Euclidean distance and

Our estimates for the behavior of diffusion distances are then leveraged to prove that the LUND algorithm correctly labels the points, and also estimates the number of clusters, while other clustering algorithm fail to cluster these data sets correctly.

3.1 Summary of Main Results

Our first result shows that the within-cluster and between-cluster diffusion distances can be controlled, as soon as is approximately block constant in some sense. Define the (worst-case) within-cluster and between-cluster diffusion distances as:

(3.2)

The following simplification of Theorem 4.9 holds:

Theorem 3.1

Let and let be a corresponding Markov transition matrix on , inducing diffusion distances . Then there exist constants such that the following holds: for any , and for any satisfying , we have

The constants depend on a flexible notion of clusterability of . To get a sense of these constants, let be an idealized version of , in which all edges from points between clusters are deleted, and redirected back into the cluster, in a sense that will be made precise in Section 4. If is constant on each diagonal block corresponding to a cluster, then and . If is a close approximation to , then will be large. If is approximately block constant with blocks of the same size, then and .

The LUND algorithm is a mode detection clustering algorithm, in which first modes are computed, then points are assigned to a mode. Our approach supposes modes of the population clusters should have high-density, and be far in diffusion distance from other points of high-density—regardless of the shape of the support of the distribution. Let

be some kernel density estimator on

, for example for some choice of and set of nearest neighbors , normalized by so that . Given defined on , let

(3.3)

The function measures the diffusion distance of a point to its -nearest neighbor of higher empirical density. LUND proceeds by estimating one representative mode from each , then assigning all labels based on these learned modes. The LUND algorithm is detailed in Algorithm 1.

Input: (data), (scaling parameter), (time parameter), (threshold)
Output: (cluster assignments), (estimated number of clusters)

1:  Build Markov transition matrix using scale parameter .
2:  Compute an empirical density estimate for all .
3:  Compute for all .
4:  Compute for all .
5:  Sort according to in descending order as .
6:  Compute
7:  Initialize as the zero vector of length .
8:  Assign .
9:  Sort according to in decreasing order as .
10:  for  do
11:     if =0 then
12:        .
13:     end if
14:  end for
Algorithm 1 Learning by Unsupervised Nonlinear Diffusion (LUND) Algorithm

The LUND algorithm combines density estimation (as captured by ) with diffusion geometry (as captured by ). The crucial parameter of LUND is the time parameter, which dictates what is considered -close.

Theorem 3.1 may be used to show that there is a wide range of for which applying the proposed LUND algorithm is provably accurate. The first concern is to understand conditions guaranteeing these modes are estimated accurately. Let be the set of cluster density maxima. Define . The following result summarizes Corollaries 5.2 and 5.3.

Theorem 3.2

Suppose as above. LUND labels all points accurately, and correctly estimates , provided that

(3.4)

Theorem 3.1 suggests that the condition (3.4) will hold for a wide range of for a variety of data , so that together with Theorem 3.2, the proposed method correctly labels the data and estimates the number of clusters correctly. Note that (3.4) implicitly relates the density of the separate clusters to their geometric properties. Indeed, if the clusters are well-separated and cohesive enough so that is very small, then a large discrepancy in the density of the clusters can be tolerated. Note that and are invariant to increasing , as long as the scale parameter in the kernel used for constructing diffusion distances and the kernel density estimator adjusts according to standard convergence results for graph Laplacians [14, 15, 37, 38]. In this sense these quantities are properties of the mixture model.

We note moreover that Theorem 3.2 suggests must be taken in a mesoscopic range, that is, sufficiently far from 0 but also bounded. Indeed, for small, is not necessarily small, as the Markov process has not mixed locally yet. For large, converges to the stationary distribution for all , and the ratio is not necessarily small, since will be small. In this case, clusters would only be detectable based on density, requiring thresholding which is susceptible to spurious identification of regions around local density maxima as clusters.

3.2 Comparisons with Related Clustering Algorithms

LUND combines graph-based methods with density-based methods, and it is therefore natural to compare it with spectral clustering and FSFDPC.

3.2.1 Spectral Clustering

The construction of the Markov transition matrix inducing diffusion distances is essentially the same as the construction of the random walk graph Laplacian. In Theorem 2.1 the graph-cut problem in spectral clustering is related to probability of transitioning between clusters in one time step, while LUND uses intermediate time scales to separate clusters – roughly the time scale at which the random walk has almost reached the stationary distribution conditioned on not leaving a cluster, and has not yet transitioned (with any sizeable probability) to a different cluster.

Spectral clustering enjoys performance guarantees under a range of model assumptions [19, 20, 6, 7, 88, 97, 30, 93, 78, 8, 54]. Under nonparametric assumptions on (3.1) with , Shi et al. [71] show that the principal eigenfunctions and eigenvalues of the associated kernel operator are closely approximated by the principal spectra of the kernel operators , possibly mixed up, depending on the spectra of and the weights . This allows for the number of classes to be estimated accurately in some situations, and for points to be labeled by determining which distribution certain eigenvectors come from.

The related work of Schiebinger et al. [69] provides sufficient conditions under the nonparametric model (3.1) for the low-dimensional embedding of spectral clustering to send well-separated, coherent regions in input space to approximately orthogonal regions in the embedding space. Subsequent analysis shows the spectral embedding is provably amenable to -means clustering with high probability, yielding guarantees on accuracy of spectral clustering. These results depend strongly on two quantities: distance between clusters and cohesiveness of clusters; the LUND algorithm similarly depends on . The distance between clusters and cohesiveness of clusters are quantified as follows. Consider as per (3.1) and let be a kernel. Define separation and cohesion quantities, respectively, as , where

A major result of Schiebinger et al. [69] is that spectral clustering is accurate with high probability depending on a confidence parameter and the number of data samples if

(3.5)

where is a “coupling parameter” that is not germane to the present discussion. The inequality (3.5) will hold only if is large relative to , that is, within-cluster coherence must be large relative to the similarity between clusters. Indeed, fixing the separation , (3.5) is more likely to hold if the clusters—the points generated from distinct —are relatively spherical in shape. For example, in Figure 3 we represent two data sets, each consisting of two clusters, which have the same value, but substantially different values. Also note that in the finite sample case when in (3.5) is non-negligible, the importance of being not too small only increases.

.
.
Figure 3:

In (a), unit variance, isotropic Gaussians with means (0,0) and (3,3), respectively are shown. Highly anisotropic Gaussian data appear in (b), which has the same measure of between cluster distance—

—but with a much lower measure of within-cluster coherence—. Spectral clustering will enjoy much stronger performance guarantees for the data in (a), compared to the data in (b) for a range of .

It is of related interest to compare LUND to spectral clustering by recalling (2.1). In the generic case that , the term dominates asymptotically as . Hence, as , LUND bears resemblance to spectral clustering with the second eigenvector alone [70]. On the other extreme, for , diffusion distances depend on all eigenvectors equally. Using the first or the through eigenvectors is the basis for many spectral clustering algorithms [62, 69], and is comparable to LUND for , combined with a truncation of (2.1). Note that clustering with the kernel alone relates to using all eigenvectors and . By allowing

to be a tunable parameter, LUND interpolates between the extremes of the

principal eigenvectors equally ( and cutting off the eigendecomposition after that or eigenvector), using the kernel matrix , and using only the second eigenvector . The results of Section 6 validate the importance of this flexibility.

An additional challenge when using spectral clustering is to robustly estimate . The eigengap

is a commonly used heuristic, but is often ineffective when Euclidean distances are used in the case of non-spherical clusters

[6, 54]. In contrast, Theorem 3.2 suggests LUND can robustly estimate , which is shown empirically for synthetic data in Section 6.

Computationally, LUND and spectral methods are essentially the same, with the bottleneck in complexity being either the spectral decomposition of a dense matrix ( where is the number of eigenvectors sought), or the computation of nearest neighbors when using a sparse diffusion operator or Laplacian (using an indexing structure for a fast nearest neighbors search, this is , where is the intrinsic dimension of the data).

3.2.2 Comparison With Local Graph Cutting Algorithms

The LUND algorithm bears some resemblance to local graph cutting algorithms [79, 4, 5, 3, 80, 81, 96, 33]. These methods compute a cluster around a given vertex such that the conductance of is high (see Definition 4.3), and which can be computed in sublinear time with respect to the total number of vertices in the graph , and in linear time with respect to . In order to avoid an algorithm that scales linearly or worse in , global features—such as eigenvectors of a Markov transition matrix or graph Laplacian defined on the data—must be avoided. The Nibble algorithm [80] and related methods [3] compute approximate random walks for points nearby , and truncate steps that take the random walker too far from already explored points. This accounts for the most important steps a random walker would take, and avoids considering all vertices of the graph. In this sense, Nibble and related methods approximate global diffusion with local diffusion in order to compute a local cluster around a prioritized vertex , while LUND uses diffusion to uncover multitemporal structure.

3.2.3 Comparison with FSFDPC

The FSFDPC algorithm [65] learns the modes of distributions in a manner similar to the method proposed in this article, except that the diffusion distance-based quantity is replaced with a corresponding Euclidean distance-based quantity:

Points are then iteratively assigned the same label as their nearest Euclidean neighbor of higher density. This difference is fundamental. Theoretical guarantees for the FDFDPC using Euclidean distances do not accommodate a rich class of distributions and the guarantees proved in this article fail when using for computing modes. This is because for clusters that are multimodal or nonlinear, there is no reason for high-density regions of one cluster to be well-separated in Euclidean distance. In Section 6, we shall see FSFDPC fails for the motivating data in Section 3; examples comparing FSFPDC and LUND for real hyperspectral imaging data appear in Murphy and Maggioni [57, 58].

4 Analysis of Diffusion Processes on Data

In this section, we derive estimates for diffusion distances. Let be as in (3.2). The main result of this section is to show there exists an interval so that that is, for , within cluster diffusion distance is smaller than between cluster diffusion distance. Showing that within-cluster distances is small and between-cluster distance is large is essential for any clustering problem. The benefit of using diffusion distance is its adaptability to the geometry of the data: it is possible that within cluster diffusion distance is less than between cluster diffusion distance, even in the case that the clusters are highly elongated and nonlinear. This property does not hold when points are compared with Euclidean distances or other data-independent distances.

4.1 Near Reducibility of Diffusion Processes

Let be a Markov chain defined on points satisfying the usual assumptions with stationary distribution . We will sometimes consider as a function with domain , other times as a vector with indices .

For any initial distribution , and moreover for any choice of uniformly as . One can quantify the rate of this convergence by estimating the convergence rate of to its stationary distribution in a normalized sense.

Definition 4.1

For a discrete Markov chain with transition matrix and stationary distribution , the relative pointwise distance at time is

The rate of decay of is regulated by the spectrum of [44, 73]. Indeed, let be the eigenvalues of ; note that follows from irreducible and follows from aperiodic [21]. Let

Theorem 4.2

[44, 73] Let be the transition matrix of a Markov chain on state space satisfying the usual assumptions. Then

Instead of analyzing , the conductance of may be used to bound .

Definition 4.3

Let be a weighted graph on and let . The conductance of S is . The conductance of is

When discussing the conductance of a graph associated to a Markov chain , we will write . The conductance is closely related to (see e.g. [21]):

Theorem 4.4

(Cheeger’s Inequality) Let be a weighted graph with nonnegative edge weight matrix . The second eigenvalue of satisfies

Combining Theorem 4.2 and Cheeger’s inequality relates to .

Theorem 4.5

[44, 73] Let be the transition matrix for a Markov chain on satisfying the usual assumptions. Suppose . Then

Note that any Markov chain can be made to satisfy , simply by replacing with . This keeps the same stationary distribution and reduces the conductance by a factor of . Whether Theorem 4.2 or 4.5 is used, the convergence of towards its stationary distribution is exponential, with rate determined by or , that is, to how close to being reducible the chain is. Note that pointwise majorizes the integrand in the definition of , so that for any , and any initial distribution,

Thus, as , uniformly at an exponential rate depending on the conductance of the underlying graph; a similar result holds for in place of . This gives a global estimate on the diffusion distance in terms of and . Note that a similar conclusion holds by analyzing (2.1), recalling that is constant and .

Unfortunately, a global estimate may be too coarse for unsupervised learning. Indeed, to get the desired separation of , we need to study not the global mixing time, but rather the mesoscopic mixing times, corresponding to the time it takes for convergence of points in each cluster towards their mesoscopic equilibria. For this, we consider the multitemporal behavior of the diffusion process, using results from the theory of nearly reducible Markov processes [72, 56]. In what follows, for a matrix , let be the maximal row sum.

Suppose the matrix is irreducible; write in block decomposition as

(4.1)

where each is square and . Let be the indices of the points corresponding to . Recall that if the graph corresponding to is disconnected, then is a reducible Markov chain; this corresponds to a block diagonal matrix in which , , for some . Consider instead that is small but nonzero for , that is, most of the interactions for points in are contained within . This suggests diffusion on the the blocks have dynamics that converge to their own, mesoscopic equilibria before the entire chain converges to a global equilibrium, depending on the weakness of connection between blocks. Interpreting the support sets as corresponding to the clusters of , this suggests there will be a time range for which points within each cluster are close in diffusion distance but far in diffusion distance from points in other clusters; such a state corresponds to a mesoscopic equilibrium. To make this precise, consider the notion of stochastic complement.

Definition 4.6

Let be an

irreducible Markov matrix partitioned into square block matrices as in (

4.1). For a given index , let denote the principal block submatrix generated by deleting the row and column of blocks from (4.1), and let and . The stochastic complement of is the matrix

One can interpret the stochastic complement as the transition matrix for a reduced Markov chain obtained from the original chain, but in which transitions into or out of are masked. More precisely, in the reduced chain , a transition is either direct in or indirect by moving first through points outside of , then back into at some future time. Indeed, the term in the definition of accounts for leaving (the factor ), traveling some time in (the factor ), then re-entering (the factor ). Note that the factor may be expanded in Neumann sum as clearly showing that it accounts for exiting from a cluster and returning to after an arbitrary number of steps outside of it.

The notion of stochastic complement quantifies the interplay between the mesoscopic and global equilibria of . We say is primitive if it is non-negative, irreducible and aperiodic. The following theorem indicates how may be analyzed when it is derived from cluster data sampled according to (3.1); a proof appears in the Appendix.

Theorem 4.7

([56]) Let be an

irreducible row-stochastic matrix partitioned into

square block matrices, and let be the completely reducible row-stochastic matrix consisting of the stochastic complements of the diagonal blocks of :

Suppose each is primitive, so that the eigenvalues of satisfy Let diagonalize , and let

where is the stationary distribution for . Then where and . Moreover, letting be any initial distribution and

Note that this result does not require the Markov chain to be reversible, and hence applies to diffusion processes defined on directed graphs. The assumption that is diagonalizable is not strictly necessary, and similar estimates hold more generally [56].

The estimate splits into two terms. The term corresponds to , which accounts for the approximation of by a reducible Markov chain. In the context of clustering, this term accounts for the between-cluster connections in . The term corresponds to , which accounts for propensity of mixing within a cluster. In the clustering context, this term quantifies the within-cluster distances.

It follows from Theorem 4.7 that, given sufficiently large, there is a range of for which the dynamics of are -close to the dynamics of the reducible Markov chain :

Corollary 4.8

Let be as in Theorem 4.7. Suppose that for some , Then and for every initial state configuration ,

In contrast with , the values may be understood as fixed geometric parameters of the dataset which determine the range of times at which mesoscopic equilibria are reached. More precisely, as , converge to natural continuous quantities independent of , and [38] proved that as , there is a natural scaling for in which the (random) empirical eigenvalues of converge in a precise sense to the (deterministic) eigenvalues of a corresponding continuous operator defined on the support of . Thus, the parameters of Theorem 4.8 may be understood as random fluctuations of geometrically intrinsic quantities depending on . In the context of the proposed data model, these quantities all have relatively straightforward interpretations.

  • is the largest eigenvalue of not equal to 1. Since is block diagonal and each is primitive, it follows that is the largest second eigenvalue among the The quantities can be controlled in a variety of ways, perhaps most classically in terms of the conductance using Cheeger’s inequality [32, 21]: