1 Introduction
1.1 Context and motivation
The analysis of large and possibly highdimensional datasets is becoming ubiquitous in the sciences. The longterm objective is to gain insight into the structure of measurement or simulation data, for a better understanding of the underlying physical phenomena at work. Clustering is one of the simplest ways of gaining such insight, by finding a suitable decomposition of the data into clusters such that data points within a same cluster share common (and, if possible, exclusive) properties.
Among the variety of existing approaches, mode seeking and spectral clustering are most relevant to our work. The first approach assumes the data points to be drawn from some unknown probability distribution and defines the clusters as the basins of attraction of the maxima of the density, requiring a preliminary density estimation phase
[2, 4, 5, 7, 10]. The second approach computes diffusion distances between data points within some neighborhood graph, then applies the classical Kmeans algorithm in the new ambient space [13].In the end, (hard) clustering methods such as above provide a fairly limited knowledge of the structure of the data. While the partition into clusters is well understood, their interplay (respective locations, proximity relations, interactions) remains unknown. Identifying interfaces between clusters is the first step towards a higherlevel understanding of the data, and it already plays a prominent role in some applications such as the study of the conformations space of a protein, where a fundamental question beyond the detection of metastable states is to understand when and how the protein can switch from one metastable state to another [6]. Hard clustering can be used in this context, for instance by defining the border between two clusters as the set of data points whose neighborhood (in the ambient space or in some neighborhood graph) intersects the two clusters, however this kind of information is by nature unstable with respect to perturbations of the data.
Soft clustering appears as the appropriate tool to deal with interfaces between clusters. Rather than assign each data point to a single cluster, it computes a degree of membership to each cluster for each data point. The promise is that points close to the interface between two clusters will have similar degrees of membership to these clusters and lower degrees of membership to the rest of the clusters. Thus, compared to hard clustering, soft clustering uses a fuzzier notion of cluster membership in order to gain stability on the locations of the clusters and their boundaries.
Spectral clustering can be turned into a soft clustering approach by using fuzzy means rather than the regular means on the reembedded data [12], or by considering soft normalized cuts instead of hard normalized cuts [9]. Meanwhile, mode seeking can be extended to do soft clustering in various ways, for instance by pertubing the density estimator and performing the clustering scheme multiple times, to detect the areas of instability and estimate the probability for each data point belonging to each cluster [17].
Although spectral clustering and mode seeking look quite different at first glance, they do share a common link. Spectral clustering is known to be related to random walks on graphs [15], and theoretical studies have highlighted the convergence of the Laplace operator of neighborhood graphs to continuous differential operators of the form: , where is the density associated to the probability distribution from which the data points are drawn [16, 20]. The presence of the gradient
is actually related to the mode seeking paradigm, as mode seeking studies the trajectories of the flow induced by the gradient vector field
. Thus, the two types of approaches (spectral clustering and mode seeking) are connected, and a natural interpolation between them is given by operators of the form , where is a strictly positive temperature parameter balancing between mode seeking () and pure diffusion (). This is what this work is about.1.2 Our approach
The diffusion process with infinitesimal generator and inital state corresponds to the solution of the following stochastic differential equation
(1) 
where is a dimensional Brownian motion.
Our soft clustering scheme proceeds then as follows. Suppose we are given as input a collection of subsets of corresponding to data we can reliably assign to a single cluster—these subsets are called cluster cores. We use the diffusion process solution to (1) to extend the clustering on the whole dataset in the following way: set the degree of membership of a data point to the th cluster as the probability for the diffusion process starting at to hit before any other . In particular, every data point that already belongs to a cluster core at the beginning of the process gets a membership of to the th cluster. The output is the collection of membership distributions over the dataset for clusters to .
In the discrete setting, assuming the input data points are drawn from a probability density satisfying certain regularity conditions, for any temperature parameter we can approximate the diffusion process solution of Equation 1 by a random walk on a neighborhood graph computed on the data (Theorem 1). Then, under suitable sampling conditions on the input data, and modulo the use of reliable estimators of the cluster cores , we show that the cluster membership distributions, defined from the solution of (1), are well approximated (Corollary 2). In practice, we use the output of a hard modeseeking algorithm as our cluster cores, see Section 3.2. Therefore our method only requires a single additional temperature parameter compared to the hard clustering scheme which is standard for softclustering algorithms.
In parallel to our work, a simple version of this approach was proposed in [3], where is set to and cluster cores are reduced to a single point which, as we shall see, can have a detrimental effect on the clustering output.
2 Random walks and diffusion processes
A quick summary of relevant diffusion processes properties can be found in Appendix A. For a more detailed presentation, we refer the reader to [8] or [18].
Let be a probability density on , and let . We assume once and for all that satisfies the following technical conditions:

is continuous over and continuous over ,

,

the SDE (1) over the domain is wellposed.
Conditions ensuring the wellposedness (particularly the nonexplosion) of the SDE can be found in [1, 11].
Let
be i.i.d random variables drawn from
, and let be a density estimator. We study the relationship between random walks on graphs built on using and the solution of (1), for a fixed temperature parameter .Let
be the Markov Chain whose initial state is the closest neighbour of
in (break ties arbitrarily), and whose transition kernel is(2) 
where is the appropriate renormalization factor:
Under some conditions on the estimator , this graphbased random walk approximates the diffusion process in the continuous domain in the following sense: there exists depending on such that, as tends to infinity, with high probability, converges weakly to the solution of (1). Formally, we prove the following theorem:
Theorem 1 (Appendix B).
Let be a decreasing function such that and . Suppose our estimator satisfies, for any compact set and any ,
Then, for any , for any compact set , and for any Borel set of such that for all ,
where .
This result can be generalized to different kinds of graph (such as knn graphs or weighted graphs) and to the manifold setting following the approach of [19].
3 Application to soft clustering
3.1 The Method
The results of Section 2 make it possible to derive a soft variant of mode seeking. In the continuous setting, our input is a collection of cluster cores corresponding to subsets of the data that are known to belong surely to a single cluster. We assume these cores satisfy the following conditions

The are compact sets of that are wellseparated (i.e. for all ).

For each , the boundary of is smooth or at least satisfies the following cone condition: for any and any there exists a truncated cone belonging to , where is the base of the cone, is its length and is its angle.
For any point of and any cluster core , we define the degree of membership of to the th cluster to be the probability for the solution of SDE (1) initialized at to hit before any other cluster core—in the event that it hits any cluster cores at all. Formally, we let be the stopping time of the solution, and we define
We also let be the probability that the solution does not hit any cluster cores at all. The output of the softclustering is the collection for every point . This is the continuous version of the algorithm.
In the discrete setting, the cluster cores are estimated using the input data points. For this we suppose we have access to estimators of the that satisfy the following property:
(3) 
where and . We then compute approximations of using trajectories of the Markov Chains and the instead of the solution of (1) and the . Under suitable assumptions on the estimators , the results of Section 2 imply that the are good approximations of the :
3.2 Practical considerations
Computation of the .
In practice, the value of for is computed by solving the linear system with if belongs to and if the intersection of the connected component of in the neighborhood graph and is zero. is defined in the following way:
where is defined as in (2). is then defined as .
When one has to deal with large amounts of data for which solving the linear system directly is too expensive, it is possible to use an iterative scheme. Set , then define the sequence by . This sequence converges to the solution of .
Neighborhood size.
Our method requires to compute an estimator of the underlying density. Many density estimation methods rely on the choice of a scale parameter (e.g. the window for a kernel density estimator), which we use also as our neighbourhood size.
Temperature parameter.
The choice of parameter depends on how much the clusters in the ground truth overlap. In cases where they are well separated, small values of should be preferred. On the contrary, large values of should be used when there are large overlapping areas. Another way to interpret is as a tradeoff between the respective influence of the metric and of the density in the diffusion process. When is small, the output of our algorithm is mostly guided by the density and therefore close to the output of mode seeking algorithms. By contrast, when is large, our algorithm becomes oblivious to the density and and thus has a behaviour close to that of means algorithms. We will elaborate on these points in Section 4.
Several heuristics for choosing
can be considered in practice. For instance, in the context of biclustering, one can look for transition phases in the plot of the mean lower membership as a function of .Cluster cores.
In practice we use the hard mode seeking algorithm ToMATo [2] to select the cluster cores. ToMATo considers the local maxima of the logarithm of the density estimator in , and for each one of them it computes a measure of prominence. It then selects the subset of the local maxima with prominence higher than a userdefined threshold , and it outputs clusters where the th cluster is obtained by merging the basin of attraction of with the basins of attraction of some of the discarded local maxima. It is then natural to select the central part of these basins of attraction as the cluster cores.
Consider the subgraph of spanned by those vertices such that . We define the th cluster core as the connected component of this subgraph that contains . It can be readily checked from the analysis of ToMATo that is indeed included in the th cluster. Moreover, convergence guarantees associated to these parts have been shown in Theorem 10.1 of [2].
Alternative methods to select the cluster cores can be considered as well, however, as a rule of thumb, it is important to select large enough cores. Indeed, too small of cluster cores imply that the time required by the diffusion process to hit any one of them is larger than its mixing time, which leads to a poor clustering output. This is related to the fact that a diffusion process has zero probability to hit a given point in if . In the limit case where each cluster core is a single point, the time required to hit any of them increases with the size of the input and eventually becomes unrelated to the starting position of the random walk [14]. As an illustration, we generated samples from an unbalanced mixture of two gaussians, and we only took two points to represent the two cluster cores. The probability to belong to the gaussian with the higher weight is represented in Figure 1, where it is clear that its corresponding cluster is overflowing the other. Moreover, using too small of a cluster core is computationally expensive as the random will end up requiring a long time to hit any of them.
4 Experiments
We first illustrate the effect of the temperature parameter on the clustering output using synthetic data. We then apply our method on simulated protein conformations data.
Synthetic data.
The first dataset is presented in Figure 2(a). The ground truth tells us that the output should be composed of two clusters. Moreover, the underlying density function being symmetric with respect to the horizontal line , it is desireable that the output be also symmetric. Note that there is no symmetry of the density with respect to the vertical line : indeed, there is a large density gap between the clusters on the lefthand side, while there is a large overlapping area between them on the righthand side. The soft clustering output is expected to show a small area of transition between the two clusters on the lefthand side, and on the contrary a large transition area on the righthand side.
We display the values of our density estimator in Figure 2(b). Not surprisingly, the density possesses two high local maxima corresponding to the two clusters. However, the overlap of between both clusters creates a small density peak on the righthand side. Standard mode seeking algorithms can be misled by this peak: for instance, the ToMATo algorithm [2] merges the basins of attraction of the small peak on the right into one of the two clusters, resulting in a nonsymetrical clustering output, see Figure 2(c). We display the results of our algorithm for three values of in Figure 2(d), in Figure 2(e) and in Figure 2(f). As we can see from the output of the algorithm for the small value of , the amount of noise injected in our trajectory is not large enough to counter the influence of the third density peak, so the result obtained is really close to hard clustering. Much larger values of do not give enough weight to the density function, so a highly smoothed transition between the two clusters on the left part. Intermediate values of seem to give more satisfying results.
The second dataset we consider is composed of two interleaved spirals. An interesting property of this dataset is that the head of each spiral is close (in Euclidean distance) to the tail of the other spiral. Thus, the two clusters are wellseparated by a density gap but not by the Euclidean metric. We use our algorithm with two different values of and . We also run the spectral clustering algorithm on this dataset using the same graph. The first thing we want to emphasize is that the result of spectral clustering and our algorithm using are similar, this is to be expected as both algorithms are directed by the same diffusion operator, this also means that other soft clustering techniques based on spectral clustering will fail on this dataset. Moreover, we can see that for , the density gap between the two spirals is not strong enough to counter the proximity of the two clusters in the Euclidean metric. On the other hand, for we recover the two different clusters as we give more weight to the density.
UCI datasets
In order to have quantitative results regarding our soft clustering scheme, we use it in a classification scenario on a couple of datasets from the UCI repository: the pendigits, K points and clusters, and the waveform datasets, K points and clusters. We preprocess each datasets by renormalizing the various coordinates so they lie in . Then, for each dataset, we run our algorithm with various values of the parameter and we report the accuracy of the output. For comparison, we provide the mean accuracy obtained by means, spectral clustering (using the same graph as our algorithm) over runs. We also provide the accuracy obtained by soft clustering algorithms: means and Spectral means clustering with different values for their respective temperature parameters. The results are displayed in Figure 4. As we can see, our softclustering method outperforms all other clustering methods under the right choice of . Moreover, we can see that the resulting accuracy depends smoothly on this and is thus not extremely sensitive to its choice either. On the other hand, changing the fuzziness parameter for means has little impact on the classification performance.
We also evaluate the impact of the choice of the cluster cores on the pendigits dataset, following the approach described in the previous section. For this dataset, the output of the ToMATo algorithm gives us a possible value of between and . Hence, we run the algorithm for different values of for different values for . We also run the algorithm when the cluster cores are reduced to a point per cluster. We show the results of our experiments in Table 1. As we can see, the actual value of has close to no impact on the performances obtained thus the choice of the cluster cores is completely reduced to having a performing a proper hard clustering. It is interesting to remark that the impact of is more important for large values of , which is natural as the lower the value of is the closer the random walk is to a deterministic gradient ascent for which the size of the cluster cores does not matter. On the other hand, as we previously discussed, taking a single point as a cluster core leads to bad performances, especially for large values of .
/  0.5  0.6  0.7  0.8  0.9  1  Single point cluster cores 

0.5  0.842  0.842  0.842  0.842  0.842  0.842  0.803 
1  0.857  0.857  0.857  0.857  0.857  0.857  0.584 
5  0.878  0.879  0.879  0.880  0.880  0.880  0.208 
Alanine dipeptide conformations.
We now turn to the problem of clustering protein conformations. We consider the case of the alaninedipeptide molecule. Our dataset is composed of protein conformations, each one represented as a
dimensional vector. The metric used on this type of data is the rootmeansquared deviation (RMSD). The purpose of softclustering in this case is twofold: first, to find the right number of clusters corresponding to metastable states of the molecule; second, to find the conformations lying at the border between different clusters, as these represent the transition phases between metasable states. It is wellknown that the conformations of alaninedipeptide have only two relevant degrees of freedom, so it is possible to project the data down to two dimensions (called a Ramachadran plot) to have a comfortable view of the clustering output. See Figure
5 for an illustration, and note that the clustering is performed in the original space. In order to highlight interfaces between clusters, we only display the second highest membership function. As we can see there are clusters and to interfaces.In order to deal with this large dataset, we implemented a parallelized outofcore version of our algorithm. Bruteforce computation of pairwise RMSD distances took about a day, while the actual clustering process only took a couple of hours using a computer with two Intel Xeon Processors E52630, each one having 6 cores clocked at 2.30Ghz. The computation used less than 1GB of RAM.
5 Conclusion
We have motivated the use of diffusion processes guided by density in the context of soft clustering, both from a theoretical and from a practical point of view. Our approach allows to interpolate between spectral clustering and mode seeking, moreover it is related to Kmeans clustering when the temperature is infinite.
The main question raised by our algorithm is how to automatically set the value of . Indeed, as we have seen, the other inputs, graph, density estimator and cluster cores, are the usual input and output of a modeseeking clustering algorithm which can be chosen with guarantees. On the other hand, the choice of might gives different result and depends on the task at hand. Still, the clustering result depends continuously on and one can still choose it in order to optimize a given quantity representing the quality of the clustering output.
Acknowledgements.
The authors wish to thank Cecilia Clementi and her student Wenwei Zheng for providing the alaninedipeptide conformation data used in Figure 5. We thank Direction Générale de l’Armement (DGA) for a financial support to Thomas Bonis.
References
 [1] Sergio Albeverio, Yuri Kondratiev, and Michael Röckner. Strong feller properties for distorted brownian motion and applications to finite particle systems with singular interactions. In Finite and infinite dimensional analysis in honor of Leonard Gross (New Orleans, LA, 2001), volume 317 of Contemp. Math., pages 15–35. Amer. Math. Soc., Providence, RI, 2003.
 [2] Frédéric Chazal, Leonidas J. Guibas, Steve Y. Oudot, and Primoz Skraba. Persistencebased clustering in riemannian manifolds. J. ACM, 60(6):41, 2013.
 [3] Y.C. Chen, C. R. Genovese, and L. Wasserman. Enhanced Mode Clustering. ArXiv eprints, June 2014.
 [4] Yizong Cheng. Mean shift, mode seeking, and clustering. IEEE Trans. Pattern Anal. Mach. Intell., 17(8):790–799, August 1995.

[5]
Minsu Cho and Kyoung Mu Lee.
Authorityshift clustering: Hierarchical clustering by authority seeking on graphs.
In CVPR, pages 3193–3200. IEEE, 2010.  [6] John D. Chodera, William C. Swope, Jed W. Pitera, and Ken A. Dill. Longtime protein folding dynamics from shorttime molecular dynamics simulations. Multiscale Modeling & Simulation, 5(4):1214–1226, 2006.
 [7] Dorin Comaniciu and Peter Meer. Mean shift: A robust approach toward feature space analysis. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(5):603–619, 2002.
 [8] Richard Durrett. Stochastic calculus : a practical introduction. Probability and stochastics series. CRC Press, 1996.
 [9] Rong Jin, Chris Ding, and Feng Kang. A probabilistic approach for optimizing spectral clustering. In In Advances in Neural Information Processing Systems 18, 2005.

[10]
W.L.G. Koontz, P.M. Narendra, and K. Fukunaga.
A graphtheoretic approach to nonparametric cluster analysis.
IEEE Transactions on Computers, 25(9):936–944, 1976.  [11] N.V. Krylov and M. Röckner. Strong solutions of stochastic equations with singular time dependent drift. Probab. Theory Relat. Fields, 131(2):154–196, 2005.
 [12] Rocco Langone, Raghvendra Mall, and Johan A. K. Suykens. Soft kernel spectral clustering. In IJCNN, pages 1–8. IEEE, 2013.
 [13] Ulrike Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416, December 2007.
 [14] Ulrike V. Luxburg, Agnes Radl, and Matthias Hein. Getting lost in space: Large sample analysis of the resistance distance. In Advances in Neural Information Processing Systems 23, pages 2622–2630. 2010.
 [15] Marina Maila and Jianbo Shi. A random walks view of spectral segmentation. In AI and STATISTICS (AISTATS) 2001, 2001.

[16]
Boaz Nadler, Stéphane Lafon, Ronald R. Coifman, and Ioannis G. Kevrekidis.
Diffusion maps, spectral clustering and eigenfunctions of fokkerplanck operators.
In in Advances in Neural Information Processing Systems 18, pages 955–962. MIT Press, 2005.  [17] P. Skraba, M. Ovsjanikov, F. Chazal, and L. Guibas. Persistencebased segmentation of deformable shapes. In CVPR Workshop on NonRigid Shape Analysis and Deformable Image Alignment, June 2010.
 [18] Daniel W. Stroock and S. R. Srinivasa Varadhan. Multidimensional diffusion processes. Die Grundlehren der mathematischen Wissenschaften. SpringerVerlag, Berlin, Heidelberg, New York, 1979.
 [19] Daniel Ting, Ling Huang, and Michael I. Jordan. An analysis of the convergence of graph laplacians. In ICML, 2010.
 [20] U. von Luxburg, M. Belkin, and O. Bousquet. Consistency of spectral clustering. Annals of Statistics, 36(2):555–586, April 2008.
Appendix A Background on diffusion processes
A diffusion process on an open subset of is the solution of a Stochastic Differential Equation (SDE) of the following form:
(4) 
where is the initial probability distribution on , is a dimensional brownian motion, is a drift term guiding the trajectories, and is the diffusion coefficient controlling the amount of noise added to the trajectories. The solution of such an equation is a random variable taking values in the space of trajectories , where is called the explosion time and corresponds to the (random) time at which a trajectory reaches the boundary of . A stochastic differential equation is said to be wellposed when there exists a unique solution and this solution has an infinite explosion time with probability .
Convergence of Markov chains occurs in the Skorokhod space , composed of the trajectories that are rightcontinuous and have left limits, for some fixed . It is equipped with the following metric:
where denotes the space of strictly increasing automorphisms of the unit segment , and where is the quantity:
Standard results show that Markov chains converge weakly in to diffusion processes truncated at time . Recall that a stochastic process converges weakly in to as tends to if
(5) 
for any continuous and bounded function .
The convergence result that we will be using in the article is the following one. Consider a family of Markov chains defined on discrete state spaces with transition kernels and initial states . For and , let
where is the complementary of the ball of radius centered at .
Theorem 3 (Adapted from Theorem 7.1 [9).
] Let be a compact subset of . Assume the stochastic differential equation (4) is wellposed for any Dirac measure with . Assume also that the maps and in (4) are continuous, and let . Assume further that for any ,
Then, for any , the continuous time process converges weakly in to the solution of (4) with initial condition as tends to . Furthermore, the convergence is uniform with respect to .
Weak convergence can be characterized in different ways via the Portmanteau theorem. In particular, a stochastic process converges weakly to a diffusion process in as tends to if and only if
(6) 
for any Borel set such that . This equivalence allows to rewrite Theorem 3 as follows:
Corollary 4.
Let be a compact subset of . Assume the stochastic differential equation (4) is wellposed for any Dirac measure with . Assume also that the maps and in (4) are continuous, and let . Let be the solution of the SDE (4) with initial condition . Let also be a Borel set in for some such that for all , and let . Then, there exist parameters and such that
whenever the following conditions are met:
Appendix B Proof of Theorem 1
We give an overview of the proof in Section B.1, then we deal with the technical details in Section B.2.
b.1 Proof overview
We denote by the i.i.d sampling which is also the state space of . Let be the superlevelset of .
Throughout the course of the proof, the notation stands for the continuous time process . Let and be strictly positive reals, . For , let . Using Lemma 5, there exists such that, for any , .
To obtain a good approximation of the trajectories in using , we only need to check assumptions (i)(iv) of Corollary 4 on for the diffusion process solution of 4. Let us show that these assumptions are verified.
is closed as is continuous and it is also bounded as , it is thus compact. Since is continuous, there exists such that is strictly positive on the offset of : . Since is also compact, there exists such that if then for all with we have .
Let be the transition kernel of , by definition we have:
Combined with our assumption on and , we can use Lemma 8 along with a concentration inequality and a union bound on the points of the compact set and obtain that,
Thus, the assumptions (i)(iv) of Corollary 4 are verified on for the diffusion process solution of (4).
Since is continuous, is an open set. Therefore using Portmanteau’s Theorem we can derive, in a similar way that we derived Corollary 4, that for
in high probability. Therefore, for any Borel set ,
Thus, we only need to approximate trajectories that do not leave to obtain a good approximation of . So we can apply Corollary 4 on these trajectories with an accuracy of to obtain, in high probability,
Every step of the proof hold with a probability that decreases to as tends to infinity, thus the proof of Theorem 1 is complete.
b.2 Technical Lemmas
Lemma 5.
Let be a compact set of such that is bounded away from zero on . Let . For any and , there exists such that
Proof.
This is just another way to say that the diffusion process does not explode in finite time. ∎
Lemma 6.
For all such that , we have that:
Similarly, for all such that ,
Proof.
If , the result is trivial. If , Consider an orthonormal basis of . The coordinate of in this basis is given by:
For symmetry reasons, for , we have . Now consider the first coordinate, using the volume of the sphere, we have:
We renormalize, let , we have and
We change variables, let , we have and , thus we have:
Using the function,
Using the relationship between and ,
∎
Lemma 7.
Let be a compact set such that and are bounded away from on . Let , . Then, for any , we have:
Proof.
We have:
∎
Lemma 8.
Consider a compact set such that is bounded away from zero on . Let . Let and . Let and , for any we note:
We have that,
Comments
There are no comments yet.