A Java implementation of random forest machine learning algorithm / classifier
A natural way to characterize the cluster structure of a dataset is by finding regions containing a high density of data. This can be done in a nonparametric way with a kernel density estimate, whose modes and hence clusters can be found using mean-shift algorithms. We describe the theory and practice behind clustering based on kernel density estimates and mean-shift algorithms. We discuss the blurring and non-blurring versions of mean-shift; theoretical results about mean-shift algorithms and Gaussian mixtures; relations with scale-space theory, spectral clustering and other algorithms; extensions to tracking, to manifold and graph data, and to manifold denoising; K-modes and Laplacian K-modes algorithms; acceleration strategies for large datasets; and applications to image segmentation, manifold denoising and multivalued regression.READ FULL TEXT VIEW PDF
We provide initial seedings to the Quick Shift clustering algorithm, whi...
Many clustering algorithms exist that estimate a cluster centroid, such ...
In this paper, we study how the mean shift algorithm can be used to deno...
The mean shift algorithm is a popular way to find modes of some probabil...
Epanechnikov Mean Shift is a simple yet empirically very effective algor...
The nonparametric formulation of density-based clustering, known as moda...
Many distance-based algorithms exhibit bias towards dense clusters in
A Java implementation of random forest machine learning algorithm / classifier
One intuitive way of defining clusters is to assume that the data points are a sample of a probability density function, and then to define the clusters through this density. For example, fig.1
shows a 2D dataset and a density estimate for it, whose contours clearly suggest that there are two clusters of a complex shape. The first step, then, is to learn an estimate of the density for the data points. This can be done with a parametric model, such as a Gaussian mixture, typically trained with an EM algorithm to maximize the likelihood. Such an approach is often computationally efficient and can give good results with clusters of elliptical shape, but it has several disadvantages. The likelihood function will typically have local optima, and finding a global optimum is, in general, very difficult; thus, the result is dependent on the initialization, and in practice a user will try different initializations (usually random restarts). The selection of the model (what kernel and how many components) is left to the user, as well as the number of clusters to find. And when the clusters have complex shapes, as for example in image segmentation, many components will be required to approximate them well, increasing the training time and the number of local optima.
We focus on nonparametric, kernel density estimates (KDE). A KDE is a generalization of histograms to define density estimates in any dimension that are smooth. They simplify the mathematical and computational treatment of densities and, crucially, enable one to use continuous optimization to find maxima of the density. With a kernel such as the Gaussian kernel, a KDE requires a single user parameter, the bandwidth (also referred to as scale). Given the bandwidth, the KDE is uniquely determined and, as seen below, so will be its clusters, which can take complex nonconvex shapes. Hence, the user need not select the number of clusters or try random restarts. We will focus on clusters defined by the modes of the KDE (although this is not the only way to define the clusters). A mode is a local maximum of the density. A natural algorithm to find modes of a KDE is the mean-shift iteration, essentially a local average, described in section 2. The basic idea in mean-shift clustering is to run a mean-shift iteration initialized at every data point and then to have each mode define one cluster, with all the points that converged to the same mode belonging to the same cluster. Section 2.3 reviews theoretical results regarding the number and location of modes of a KDE, the convergence of mean-shift algorithms and the character of the cluster domains. Section 2.4 discusses relations of mean-shift algorithms with spectral clustering and other algorithms. Sections 2.5 and 2.6 describe extensions of mean-shift for clustering and manifold denoising, respectively. One disadvantage of mean-shift algorithms is their computational cost, and section 2.7 describes several accelerations. Section 3 describes another family of KDE-based clustering algorithms which are a hybrid of -means and mean-shift, the -modes and Laplacian -modes algorithms, which find exactly
clusters and a mode in each, and work better with high-dimensional data. Section4 shows applications in image segmentation, inverse problems, denoising, and other areas. We assume multivariate, continuous (i.e., not categorical) data throughout.
In mean-shift clustering, the input to the algorithm are the data points (multivariate, continuous feature vectors) and the bandwidth or scale. Callthe data points to be clustered. We define a kernel density estimate 
with bandwidth and kernel , e.g. for the Gaussian kernel or if and if for the Epanechnikov kernel. Many of the results below carry over to kernels where each point has its own weight and its own bandwidth, which can be an isotropic, diagonal or full covariance matrix. To simplify the presentation we focus on the case where all points have the same, scalar bandwidth (the isotropic, homoscedastic case) and the same weight unless otherwise noted. This is the case found most commonly in practice. Also, we mostly focus on Gaussian kernels, which are easier to analyze and give rise to simpler formulas.
We can derive a simple iterative scheme for to find a mode of by equating its gradient to zero and rearranging terms (section 2.7 discusses other ways to find modes). We obtain
where and the vector is the mean shift, since it averages the individual shifts with weights as above. For a Gaussian kernel,
and this simplifies to the following, elegant form (where, by Bayes’ theorem,
is the posterior probability of the component centered atgiven point ) :
As discussed below, under mild conditions this scheme converges to modes of from nearly any initial . Intuitively, each step moves the iterate to a local average of the data, in that data points closer to have larger weight, and this increases the density. Eq. 2 is called the mean-shift iteration, and it can be used in two distinct ways: to find modes, and to filter (or smooth) a dataset. This gives rise to two different clustering algorithms, as follows. We will refer to them as mean shift (MS) (where modes are found) and blurring mean shift (BMS) (where the dataset is filtered).
|A. Gaussian mean-shift (MS) algorithm for repeat : until stop end connected-components(,)||B. Gaussian blurring mean-shift (BMS) algorithm repeat for : end : until stop connected-components(,)|
|C. Gaussian MS algorithm in matrix form repeat until stop connected-components(,)||D. Gaussian BMS algorithm in matrix form repeat until stop connected-components(,)|
Here, we declare each mode of as representative of one cluster, and assign data point to the mode it converges to under the mean-shift iteration, . The algorithm is given in fig. 2A for the Gaussian kernel. We can also estimate error bars for each mode from the local Hessian [8, 10], given in eq. (5), which is related to the local covariance.
Some practical problems need to be solved. Firstly, some points (minima and saddle points) do not converge to modes. It is unlikely that this will happen with a finite sample, but if so such points can be detected by examining the Hessian or by a postprocessing step that checks for small clusters.
Second, the mean-shift iteration is stopped after a finite number of steps, for example when the relative change in the value of is smaller than a set tolerance tol . This means that data points that in theory would converge to the same mode actually stop at numerically slightly different points. A postprocessing step is necessary to merge these into a unique mode. This can be done by finding the connected components111The connected-components algorithm is described in appendix A. of a graph that has vertices, one for every convergence point, and has an edge between any pair of vertices lying within a small distance . The graph need not be explicitly constructed. The user should set tol small enough to converge to the modes with good accuracy, while limiting the computational cost incurred; and should be set quite larger than tol, but smaller than the distance between different true modes.
Here, each point of the dataset actually moves to the point given by eq. (2). That is, given the dataset , for each we obtain a new point by applying one step of the mean-shift algorithm: . Thus, one iteration of blurring mean-shift results in a new dataset which is a blurred (smoothed) version of . By iterating this process we obtain a sequence of datasets (and a sequence of kernel density estimates ) where is the original dataset and is obtained by blurring with one mean-shift step (see fig. 6).
As will be shown below, Gaussian BMS can be seen as an iterated filtering (in the signal processing sense) that eventually leads to a dataset with all points coincident for any starting dataset and bandwidth. However, before that happens, the dataset quickly collapses into meaningful, tight clusters which depend on (see fig. 6), and then these point-like clusters continue to move towards each other relatively slowly. A stopping criterion that detects this situation quite reliably is based on whether the entropy of the dataset changes  (a simpler criterion would be to stop when the update to is small, but this does not always give good clusters). As with MS clustering, a connected-components postprocessing step merges the points into actual clusters. The BMS algorithm is given in fig. 2B.
Although both MS and BMS are based on the same mean-shift iteration, they are different algorithms and can produce different clustering results. Specifically, given a value of the bandwidth, the number of clusters resulting from MS and BMS is usually different. However, the collection of clusterings produced over a range of bandwidths can be quite similar.
BMS is quite faster than MS in number of iterations and in runtime, particularly if using the accelerated BMS algorithm (section 2.7), which introduces essentially no approximation error. However, MS (and also BMS) can be considerably accelerated if a small clustering error is tolerated (section 2.7).
In MS, the optimizations (one for each data point) proceed independently, but they could be done synchronously, as in BMS, without altering the result. However, practically this is wasteful, because the number of iterations required varies considerably among points, and a synchronous scheme would have to run the largest number of iterations. Conversely, it is possible to run BMS with asynchronous iterations, for example moving points as soon as their update is computed. However, this makes the result dependent on the order in which points are picked, and is unlikely to be faster than the accelerated algorithm described below.
The fundamental parameter in mean-shift algorithms is the bandwidth , which determines the number of clusters. The statistics literature has developed various ways to estimate the bandwidth of a KDE [77, 83]
, mostly in the 1D setting, for example based on minimizing a suitable loss function (such as the mean integrated squared error), or more heuristic rules (such as making the bandwidth proportional to the average distance of each point to itsth nearest neighbor). While these bandwidth estimators are useful and can give reasonable results, they should be used with caution, because the bandwidth that gives the best density estimate (in a certain sense) need not give the best clustering—clustering and density estimation are, after all, different problems. Besides, clustering is by nature exploratory, and it is best to explore a range of bandwidths. Computationally, this is particularly easy to do in MS (see the scale-space discussion below).
It is also possible to use a different bandwidth for each point (called adaptive KDE), which can help with areas where points are sparse, for example. A good way to do this is with entropic affinities [45, 82], where the user sets a global number of neighbors and then, for each data point , the bandwidth is computed so that point has a distribution over neighbors with a perplexity (log-entropy) , i.e., each point sets its own bandwidth to have effective neighbors. One could then vary to achieve different clusterings. Other ways to construct adaptive KDEs for MS have been proposed . The mean-shift update with adaptive bandwidths has the form :
where the values are the posterior probabilities
reweighted by the inverse variance and renormalized (compare with the single-bandwidth equation (3)).
Mean-shift has also been used to track moving objects (“blobs”) through a sequence of images . Since the blobs can change size, using a fixed bandwidth (scale) becomes problematic. Collins  used ideas from scale-space theory (see below) to select a scale that adapts to the blob size. An “image feature” is defined as a point in scale-space where a certain differential operator achieves a local maximum (i.e., a mode) with respect to both space and scale. This mode is then tracked by a two-step mean-shift procedure that convolves in image space with a filterbank of spatial difference-of-Gaussian filters, and convolves in scale space with an Epanechnikov kernel, in an efficient way.
Different kernels give rise to different versions of the mean-shift and blurring mean-shift algorithms. Much previous work (including [37, 24]) uses the Epanechnikov kernel for computational efficiency, since the kernel evaluations involve only pairs of neighboring points (at distance ) rather than all pairs of points (though the neighbors must still be found at each iteration), and convergence occurs in a finite number of iterations. However, in practice, the Gaussian kernel produces better results than the Epanechnikov kernel , which generates KDEs that are only piecewise differentiable and can contain spurious modes (see below).
The behavior of modes and other critical points (minima, saddles) as a function of the bandwidth (or scale) is the basis of scale-space theory91, 50, 54]. Here, one studies the evolution of an image under Gaussian convolution (blurring). If we represent the image as a sum of delta functions centred at the pixel locations and with value equal to the grayscale, convolving this with an isotropic Gaussian kernel of scale gives a KDE (with weighted components). As the scale increases, the image blurs, structures in the image such as edges or objects lose detail, and for large scales the image tends to uniformly gray. Some important structures, such as objects, can be associated with modes of the KDE, and the lifetime of a mode—defined by the scale interval between the creation and destruction (merging with another mode) of that mode—is taken as an indication of its importance in the image.
An ideal convolution kernel should blur structure in the image but never create new structure (which would then reflect properties of the kernel rather than of the image). The result would then be a tree of modes, where there are modes for (one mode at each component) and modes merge as increases until there is a single mode. Unfortunately, this is only true for the Gaussian kernel and only in dimension one (section 2.3). In dimension two, i.e., with images, new modes can be created as the scale decreases. However, in practice these creation events seem rare and short-lived, in that the mode created usually merges with another mode or critical point at a slightly larger scale. Thus, they are generally ignored. Computationally, one starts with a mode at each pixel (or data point) for and then tracks the location of the modes as in a numerical continuation method , by running mean-shift at a new, larger scale using as initial points the modes at the previous scale. By construction, this results in a tree—although, unlike in agglomerative clustering, the resulting set of clusterings at each scale need not be nested.
Another notion of lifetime, topological persistence, has been explored in computational geometry , and used to define a hierarchical mean-shift image segmentation [62, 23]. The tree of modes has also been proposed in the statistical literature as a tool for visualization of KDEs 
, but this is practical only for small datasets in 1D or 2D. The mode tree is sensitive to small changes in the data and gives no way to differentiate between important modes and those caused by, for example, outliers, so it can help to combine several trees constructed by jittering or resampling the original dataset.
As indicated in fig. 2D, we can equivalently write each Gaussian BMS iteration in matrix form  as in terms of the random-walk matrix , an stochastic matrix with elements and . Here, is the matrix of data points,
is a Gaussian affinity matrix, which defines a weighted graph where eachis a vertex, and is the degree matrix of that graph. This establishes a connection with spectral clustering, which we describe in section 2.4.
, where the filter depends on the data and is updated at each iteration as well, hence resulting in a nonlinear filtering. This in turn suggests that one could use other filters constructed as a function, where is a scalar function, so it modifies the spectrum of . Carreira-Perpiñán  studied several such filters, including explicit, implicit, power and exponential ones, depending on a step size parameter , resulting in generalized Gaussian blurring mean-shift algorithms. He gave convergence conditions on and and found that the different filters tend to find similar clusters, i.e., over a range of bandwidths they can obtain the same clustering (at possibly different values). However, their runtime varies widely. Implicit filters (which involve solving a linear system) or power filters (which involve iterating a matrix product) have a strong clustering effect in each iteration, but their iterations themselves are more costly. When one considers both the number of iterations and the cost of each iteration, the method found fastest was a slightly overrelaxed explicit function of the form with . However, its runtime was very close to that of the standard BMS (). An interesting extension of this work would be to be able to design the function so that the resulting generalized BMS algorithm is optimal (in some sense) for clustering.
It is also possible to write MS in matrix form (fig. 2C), where we write the random-walk matrix with the symbol to differentiate it from the standard random-walk matrix , since is defined on two sets of points ( and ) while is defined on one (). However, the matrix form implies that the MS updates for all points are synchronous, which, as mentioned before, is slow.
Having run MS or BMS on a dataset, how should we deal with new data points not in the original dataset? The purist option (and the only one for BMS) is to run the clustering algorithm from scratch on the entire dataset (old and new points), but this is computationally costly. A faster option with MS is to use the original KDE and simply run MS on each of the new points, assigning them to the mode they converge to. This reflects the fact that MS clusters not just the points in the dataset, but (implicitly) the whole space. However, this point of view implies no new clusters are created when new points arrive.
The advantages of mean-shift algorithms stem from the nonparametric nature of the KDE: (1) It makes no model assumptions (other than using a specific kernel), unlike Gaussian mixture models or-means, for example. (2) It is able to model complex clusters having nonconvex shape (although this does not imply that all shapes can be modeled well), unlike -means. (3) The user need only set one parameter, the bandwidth, which has an intuitive physical meaning of local scale, and this determines automatically the number of clusters. This is often more convenient than having to select the number of clusters explicitly. (4) It has no local minima, thus the clustering it defines is uniquely determined by the bandwidth, without the need to run the algorithm with different initializations. (5) Outliers, which can be very problematic for Gaussian mixtures and -means, do not overly affect the KDE (other than creating singleton clusters).
Using KDEs and equating modes with clusters as in mean-shift also has some disadvantages. The most important one is that KDEs break down in high dimensions, where the number of clusters changes abruptly from one for large to many, with only a minute decrease in . Indeed, most successful applications of mean-shift have been in low-dimensional problems, in particular image segmentation (using a few features per pixel, such as color in LAB space). A way of using modes in high-dimensional spaces is the -modes algorithm described in section 3.
Other possible disadvantages, depending on the case, are as follows. (1) In some applications (e.g. figure-ground or medical image segmentation) the user may seek a specific number of clusters. However, in mean-shift clustering we have no direct control over the number of clusters: to obtain clusters, one has to search over . This is computationally costly (and sometimes not well defined, since the number of clusters might not be a monotonic function of ). (2) We do not differentiate between meaningful and non-meaningful modes. For example, outliers will typically create their own mode; or, the density in a cluster may genuinely contain multiple modes (especially with clusters that have a nonconvex or manifold structure, as in fig. 1D). Some of these problems may be partially corrected by postprocessing the results from mean-shift (e.g. to remove low-density modes and clusters with few points, which are likely outliers), or by the -modes algorithm. Finally, mean-shift is slow computationally, and this is addressed in section 2.7.
The mean-shift algorithm is so simple that it has probably been discovered many times. In 1975, Fukunaga and Hostetler  were perhaps the first to propose its idea and also introduced the term “mean shift”. They derived the blurring version of the algorithm (BMS) for a KDE with the Epanechnikov kernel as gradient ascent on with a variable step size, without proving convergence or giving a stopping criterion. They observed that it could be used for clustering and dimensionality reduction (or denoising), since points converge locally to cluster centroids or medial lines for appropriate values of the bandwidth. Since 1981, the algorithm was also independently known in the statistics literature as “mean update algorithm” (see  pp. 167ff and references therein). The term “blurring process” is due to Cheng , who discussed the convergence of both the blurring (BMS) and the non-blurring (MS) versions of mean-shift. Carreira-Perpiñán , motivated by the problem of finding all the modes of a Gaussian mixture for multivalued regression [8, 10], independently rediscovered the algorithm for the Gaussian kernel and proved its convergence for arbitrary covariance matrices . Since the early 2000s, the non-blurring mean-shift received much attention thanks to the work of Comaniciu and Meer , who demonstrated its success in image filtering, image segmentation and later in tracking . This was followed by work by many researchers in both theoretical, computational and application issues. Algorithms similar to mean-shift have appeared in scale-space clustering [90, 92, 21, 68], in clustering by deterministic annealing  and in pre-image finding in kernel-based methods .
Although MS and BMS are defined by very simple iterative schemes, they exhibit some remarkable and somewhat counterintuitive properties regarding whether they converge at all, how fast they converge, and the character of the convergence domains. The geometry of the modes of a KDE is also surprising. The relevant literature is scattered over different areas, including computer vision, statistics and machine learning. We give a summary of results, with particular attention to Gaussian kernels, without proof. More details can be found in the references cited.
Intuitively, one might expect that a sum of unimodal kernels with bandwidth as in eq. (1) would have at most modes, and that the number of modes should decrease monotonically as increases from zero and the different components coalesce. In general, this is only true for the Gaussian kernel in dimension one. Motivated by scale-space theory, several papers [50, 3, 95] showed that, in dimension one, the Gaussian kernel is the only kernel that does not create modes as the scale increases. It is easy to see that, in KDEs with non-gaussian kernels (Epanechnikov, Cauchy, etc.), modes can appear as increases . The creation of modes need not occur often, though, and some kernels (such as the Epanechnikov kernel) are more likely than others to create modes. It is less easy to see, but nonetheless true, that modes can also appear with the Gaussian kernel in dimension two, i.e., with images, and thus in any larger dimension, as shown by an example in .
The scale-space theory results were restricted to a single bandwidth , and also the creation of modes does not necessarily imply that a mixture with components may have more than modes. The results for the general case of Gaussian mixtures (GMs) are as follows. Again, there is a qualitative difference between 1D and 2D or more. A GM (with possibly different bandwidths ) can have at most modes in 1D  but more than modes in 2D or above, even if the components are isotropic . In 2D, if the components have diagonal or full covariance matrices, it is easy to construct examples with more modes than components [10, 19]. It is far harder to achieve this if all the components are isotropic with equal bandwidth , but still possible. This was first shown by a construction suggested by Duistermaat and studied by Carreira-Perpiñán and Williams , consisting of 3 Gaussians in the vertices of an equilateral triangle. For a narrow interval of scales (lifetime), an extremely shallow fourth mode appears at the triangle barycenter. They generalized this construction to dimension as a regular simplex with a Gaussian in each vertex, i.e., a GM with components, one at each vertex of the simplex . They showed that the number of modes is either (for small scale, modes near the vertices), (for large scale, at the simplex barycenter) or (for a narrow range of intermediate scales, modes at the barycenter and near the vertices). Interestingly, the lifetime of the mode that is created in the barycenter peaks at and then slowly decreases towards zero as increases. Small perturbations (but not vanishingly small) of the construction prevent the creation of the extra mode, which suggests that it may be unlikely for isotropic GMs to have more modes than components. However, apart from a few isolated studies, the geometry of the modes of GMs in high dimensions is poorly understood.
As for the location of the modes of a GM, they are in the interior of the convex hull of the data points if the components are isotropic (with possibly different bandwidths ), for any dimension [10, 19], as seen in fig. 1B. With KDEs, it is easy to see this from the fact that eq. (2) is a convex sum if . If the components have diagonal or full covariance matrices, the modes can be outside the convex hull [10, 18, 19].
Most work on mean shift and scale space theory tacitly assumes that the modes of a GM are finite in number or at least are isolated. Although sometimes this is claimed to be a consequence of Morse theory, no proof seems to exist for an arbitrary dimension222In dimension one, this is simple to prove assuming infinite differentiabilty (as for the Gaussian kernel), by setting to the gradient of the mixture in the following result. Let be an infinitely differentiable real function of . Then either is identically zero or its zero crossings are isolated. Indeed, if is zero in a nonempty interval , then we have . Repeating this argument for , , etc. we obtain that all derivatives at any are zero, hence by Taylor’s theorem .. Indeed, kernels such as the uniform or triangular kernel do create continuous ridges of modes even in 1D.
An attractive feature of MS is that it is defined without step sizes, which makes it deterministic given the dataset and bandwidth. However, this is only useful if it converges at all, and whether this is the case depends on the kernel used. Not all kernels give rise to convergent MS updates, even if they are valid kernels (i.e., they are nonnegative and integrate to one); an example where MS diverges appears in . For kernels where MS diverges, it is of course possible to devise optimization algorithms that will converge to the modes, e.g. by introducing a line search, but we lose the simplicity of the MS iteration. For kernels where MS converges, the convergence is to a mode for most initial points, but in general to a stationary point (minima and saddle points in addition to modes). Here, we review convergence results for MS, with a focus on the most common kernels (Gaussian and Epanechnikov).
With the Epanechnikov kernel, convergence occurs in a finite number of iterations . The intuitive reason for this is that the KDE, which is the sum of kernels, is piecewise quadratic. There is a finite number of “pieces” (each corresponding to a possible subset of the kernels), and, within one piece, a single Newton iteration would find the maximum (indeed, the MS update coincides with a Newton step for the Epanechnikov kernel ).
With the Gaussian kernel, MS also converges, but in an infinite number of iterations, and the convergence rate is generally linear (sublinear or superlinear in limit cases) [10, 14]. One convenient way to see this is by relating MS with the EM algorithm , as follows.
In general for mixtures of Gaussians (using arbitrary covariance matrices) or other kernels (and thus in particular for KDEs), Gaussian MS is an expectation-maximization (EM) algorithm and non-gaussian MS is a generalized EM algorithm[10, 14]. This can be seen by defining a certain dataset and a certain probabilistic model with hidden variables (missing data), deriving the corresponding EM algorithm to maximize the log-likelihood for it, and verifying it coincides with the MS algorithm. For the GM case, the model consists of a constrained mixture of Gaussians, where each component has the given weight, mean and covariance matrix (constant, data point and bandwidth for KDEs, respectively), but the whole mixture can be freely (but rigidly) translated. The dataset consists solely of one point located at the origin. Thus, maxima of the likelihood occur whenever the translation vector is such that a mode of the mixture coincides with the origin. The missing data is the index of the mixture component that generated the origin. The resulting E step computes the posterior probabilities of eq. (3), and the M step maximizes a lower bound on the log-likelihood in closed form giving the MS update. With non-gaussian kernels, the M step cannot be solved in closed form, and the MS update corresponds to a single iteration to solve this M step. (Whether this iteration actually increases the likelihood, leading to convergence, depends on the kernel.)
Viewing MS as an EM algorithm has several consequences. Convergence for Gaussians MS is assured by the EM convergence theorems  (which apply because the function in the EM algorithm is continuous), and each iterate increases or leaves unchanged the density. General results for EM algorithms also indicate that the convergence order will typically be linear (with a rate dependent on the ratio of information matrices). Finally, MS can be seen as a bound optimization (see also ).
The convergence order can be studied in detail  by linearizing the MS update of eq. (3) around a mode , so the update can be written as , where is the Jacobian of . The Jacobian of and the Hessian of are related as follows:
where is the local covariance matrix at , since (from and the definition of ) the local mean is
itself. The eigenvalues ofare in and the convergence rate is given by the largest one, with the iterates approaching along the eigenvector associated with the largest eigenvalue (assuming distinct eigenvalues). This shows that the convergence order depends on the bandwidth: it is nearly always linear, approaches superlinear convergence when or , and converges sublinearly at mode merges. The practically useful cases of MS use an intermediate value, for which the rate of linear convergence (i.e., the ratio of distances to the mode after and before the update) can be close to , thus convergence will be slow. The MS iterates smoothly approach the mode along the principal component of the local covariance matrix of the data points, from within the convex hull of the data points (see fig. 1B).
We focus on Gaussian MS on a KDE, i.e., with isotropic covariances of bandwidth . As seen in fig. 1B, if initialized at points far from a mode, the first MS steps are often large and make considerable progress towards the mode. This is an advatageous property generally observed in alternating optimization algorithms (such as EM). After that, the steps become small, in agreement with the linear convergence rate.
The path followed by the MS iterates has the following properties (illustrated in fig. 1B). (1) Near convergence, the path follows the direction of the principal eigenvector of the local covariance at the mode . (2) Each iterate is a convex linear combination of the data points, so the path lies in the interior of the convex hull of the data points [9, 14]. This is also true of non-gaussian kernels if they satisfy . (3) The path is smooth in the sense that consecutive steps (consecutive mean-shift vectors ) always make an angle in . (4) The mean shift vector is proportional to the gradient of , so the MS iteration is a gradient step, but the step size is not the best one (i.e., does not maximize ) along the gradient.
In -means, the centroids partition the space into Voronoi cells, which are convex polytopes. In MS, it is harder to characterize the regions that the modes partition the space into, i.e., their domains of convergence, which can in fact have surprising properties . In general, they can be curved, nonconvex and disconnected (for example, one domain can be completely surrounded by another). This allows MS to represent complex-shaped clusters and is an advantage over -means. A less desirable aspect is that the domain boundaries can show fractal behavior, although this seems confined to cluster boundaries, and could be removed if necessary by postprocessing the clusters. This fractal behavior is due to the iterated MS mapping , and would not occur if we defined the clusters purely based on flow lines of the gradient.
Cheng  proved convergence of blurring mean-shift, as follows. (1) For kernels broad enough to cover the dataset (e.g. infinite-support kernels such as the Gaussian) convergence is to a dataset with all points coincident (), regardless of the value of . This can be seen by noting that the diameter of the data set decreases at least geometrically. (2) For finite-support kernels and small enough , convergence is to several clusters with all points coincident in each of them; the clusters depend on the value of .
Another proof can be obtained from the matrix formulation of BMS , since at each iteration the dataset is multiplied times a stochastic matrix . By the Perron-Frobenius theorem [47, ch. 8], with broad kernels this will have a single eigenvalue equal to 1 and all other eigenvalues with magnitude less than 1. Since a fixed point verifies then for some , i.e., all points coincide. For non-broad kernels, the unit eigenvalue is multiple, resulting in multiple clusters where all points coincide.
While, for Gaussian BMS, the dataset converges to a size (therefore variance) zero, its variance need not decrease monotonically, in fact it is easy to construct examples in 1D where it increases at some steps.
completely characterized the behavior with the Gaussian kernel (Gaussian BMS), assuming that the dataset is infinite with a Gaussian distribution. In this case, one can work with distributions rather than finite samples, and the iterationcan be written as a Gaussian integral and solved in closed form. The result is that the data distribution
is Gaussian after each iteration, with the same mean, and it shrinks towards its mean independently along each principal axis and converges to it with cubic order. Specifically, the standard deviationalong a given principal axis evolves as with , where is the BMS bandwidth, and converges to 0 cubically. The reason for this extremely fast convergence is that, since is kept constant but the dataset shrinks, effectively increases. Thus, at each iteration both and decrease. Note that the smaller the initial is, the faster the convergence and so the direction of largest variance (principal component) collapses much more slowly (in relative terms) than all other directions.
This explains the practical behavior shown by Gaussian BMS (see fig. 6): (1) clusters collapse extremely fast (in a handful of iterations, for a suitable bandwidth); (2) after a few iterations only the local principal component survives, resulting in temporary linearly-shaped clusters (that quickly straighten). These two behaviors make BMS useful for clustering and denoising, respectively.
The same proof technique applies to the generalized Gaussian BMS algorithms that use an update of the form . With a Gaussian dataset, each iteration produces a new Gaussian with a standard deviation (separately along each principal axis) . This allows a complete characterization of the conditions and order of convergence in terms of the real function , , instead of a matrix function. Convergence occurs if for and . Depending on , the convergence order can vary from linear to cubic and beyond .
As noted earlier, putting Gaussian BMS in matrix form in terms of a Gaussian affinity matrix uncovers an intimate relation with spectral clustering. Each BMS iteration is a product of the data times , the stochastic matrix of the random walk in a graph [25, 58], which in BMS represents the posterior probabilities of each point under the kernel density estimate (1). is closely related to the matrix (equivalent to the normalized graph Laplacian) commonly used in spectral clustering, e.g. in the normalized cut . The eigenvalue/eigenvector pairs and of and satisfy and . In spectral clustering, given and one computes the eigenvectors associated with the top eigenvalues of (if clusters are desired); in this spectral space the clustering structure of the data is considerably enhanced and so a simple algorithm such as -means can often find the clusters. In BMS, we iterate the product . If were kept constant, this would be the power method  and each column of would converge to the leading left eigenvector of (the vector of ones, i.e., a single cluster), with a rate of convergence given by the second eigenvalue (the Fiedler eigenvalue in spectral clustering). However, the dynamics of BMS is more complex because also changes after each iteration. In practice and quickly reach a quasistable state where points have collapsed in clusters which slowly approach each other and remains almost constant (at which point BMS is stopped). Thus, BMS can be seen as refining the original affinities into a matrix consisting of blocks of (nearly) constant value and then (trivially) extracting piecewise-constant eigenvectors for each cluster with the power method. With the generalized BMS algorithm, one uses instead the matrix , which has the same eigenvectors as but eigenvalues . However, this manipulation of the spectrum of is performed implicitly, without actually having to compute the eigenvectors as in spectral clustering.
While both spectral clustering and BMS rely on the random-walk matrix , they differ in several respects. They do not give the same clustering results. In spectral clustering, the user sets the desired number of clusters and the bandwidth (if using Gaussian affinities), while in BMS the user sets only and is determined by this. Computationally, spectral clustering solves an eigenproblem (and then runs -means), while BMS (especially if using its accelerated version) performs a small number of matrix products (and then runs connected-components), thus it is considerably faster.
Many image processing algorithms (for example, for image denoising) operate on range variables (intensity, color) defined on the space variables (of the image lattice). This includes bilateral filtering , nonlinear diffusion  and others. Mean-shift is basically an iterated, local averaging that operates jointly on range and space, and bears both similarities and differences with those algorithms, which are described in . For example, in bilateral filtering the spatial component is fixed during iterations, so only the range variables are updated, and a stopping criterion is necessary to prevent excessive smoothing.
Mean-shift algorithms appear whenever we have expressions having the form of a sum over data points of a function of squared distances of the parameter (whether the latter is a vector or a matrix or a set thereof). Equating the gradient of this expression to zero we can solve for and obtain a fixed-point iteration with the form of a weighted average of the data points, where the weights depend on . One example mentioned below are Riemannian centers of mass. Another example are Laplacian objective functions, of the form , where is of are coordinates of the data points in a low-dimensional space, is an affinity matrix and is its graph Laplacian. Alternating optimization over each can be done with a mean-shift algorithm. Laplacian objectives appear in algorithms for (among other problems) dimensionality reduction and clustering, such as Laplacian eigenmaps , the elastic embedding  or spectral clustering . However, as noted in section 2.3, the resulting mean-shift iteration need not converge in general. This is the case, for example, when the weights or the kernels can take negative values, as in the elastic embedding. In this case, one should use a line search or a different optimization algorithm altogether.
In some applications, the distribution of the data changes over time, and we want to track clusters and their location, in order to predict their location at a future time (assuming the distribution changes slowly). One example is tracking a nonrigid object over a sequence of images, such as a video of a moving car. A robust way to represent the object is by the color histogram of all the pixels within the object. In its simplest form, one can use a region of fixed shape and size but variable location, initialized by the user in the first image. Then, the most likely location of the object in the next image can be defined as the region (near the current region) having the histogram closest to the current histogram. Comaniciu et al.  noted that we can use a differentiable KDE instead of a histogram, and that any differentiable similarity function of two KDEs also has the form (up to constant factors) of a weighted KDE to first order (a good approximation if the object moves slowly). Hence, we can use mean-shift iterations to maximize the (linearized) similarity over the location, thus finding the location where pixels look most like in the previous image’s region. The resulting algorithm is fast enough to track several objects in real-time in video, and is robust to partial occlusion, clutter, distractors and camera motion.
Hall et al.  define a spatiotemporal KDE with a product kernel . The kernel over the spatial variables is typically Gaussian and the kernel over the temporal variable considers only past observations and gives more importance to recent ones. At each new value of , one finds modes over by starting from the modes of the previous time value (as in scale-space theory, but using time rather than scale).
The original mean shift algorithm is defined on the Euclidean space
. Sometimes, the data to be clustered lies on a low-dimensional manifold, so the mean-shift iterates and the modes should also lie on this manifold. In some applications (such as motion segmentation, diffusion tensor imaging and other computer vision problems) this is a known Riemannian manifold, such as rotation matrices, Grassmann manifolds or symmetric positive definite matrices. Subbarao and Meer  extended mean-shift to Riemannian manifolds by defining the squared distances appropriately with respect to the manifold (although this does not result in a proper KDE on the manifold). However, most times the manifold is not known a priori. Shamir et al.  extended mean-shift to 3D point clouds in computer graphics applications, by constraining the mean shift steps to lie on the surfaces of a triangulated mesh of the data. The Laplacian -modes described later uses a different approach to find modes that are valid patterns with data lying on nonconvex manifolds.
The mean-shift algorithm operates on data where each data point is defined by a feature vector in . In some applications, the data is best represented as a weighted graph, where each vertex is a data point , and an edge represents a neighborhood relation between a pair of data points, with a real-valued weight representing distance. This allows one to work with data that need not live in a Euclidean space, and can be used to represent distances along a manifold (for example, by approximating geodesics through shortest paths over the graph).
Some algorithms have been proposed for graph data that are based on a KDE. They work by assigning to each data point a “parent” in the graph, which is another data point in their neighborhood having a higher density (roughly speaking). This results in a forest of directed trees that spans the graph, whose roots are “modes” in the sense of not having any neighboring point with higher density. Each tree corresponds to a cluster. The parent of a data point is defined as the data point that optimizes a criterion based on the KDE and the distances between data points. The crucial aspect is that this optimization is defined only over the data points rather than over the space (as happens with -medoid algorithms).
One of the earliest approaches is the algorithm of . This uses as criterion , which is a numerical approximation to the directional gradient of the KDE. This is maximized over the data points that are in a neighborhood of (e.g. the nearest neighbors or the points within a distance ). The output of this algorithm can be improved with cluster-merging based on topological persistence .
The medoidshift algorithm of  is more directly related to mean-shift. It uses as criterion the Riemannian center of mass (sometimes called Fréchet mean [1, 64]) , where is evaluated at the current iterate . The algorithm then alternates computing the Riemannian center of mass (over the data points) with updating the weights. We recover mean-shift by using the squared Euclidean distance for and optimizing over . Unlike mean-shift (which finds maxima of the KDE), this does not seem to maximize a global objective function, but rather maximizes a local objective at each iteration and data point. The algorithm gives better clusterings and is faster in its “blurring” version, where at every iteration the entire dataset is updated. In matrix form (using the random-walk matrix and the matrix of pairwise distances, and updating all points synchronously), this takes the form , where the minimization is columnwise. Hence, medoidshift can be seen as a discrete filter, while the BMS iteration is a continuous filter. As in the accelerated BMS, each iteration contracts the graph, reducing the number of points. It is possible to define other variations by using different types of distances , such as the local Tukey median in the medianshift algorithm of .
The mean-shift update as originally proposed  results from estimating the gradient of a density in two steps: first, one estimates the density with a KDE, and then one differentiates this. Sasaki et al.  directly estimate the gradient of the log-density by using score matching  and derive a mean-shift algorithm based on this (by equating the gradient to zero and obtaining a fixed-point iteration). In score matching, one does a least-squares fit to the true log-density gradient using a model. If the model is a linear combination of basis functions (one per data point), one obtains an update that is identical to the mean-shift update of eq. 3 but with weighted kernels (although, since these weights can be negative, this fixed-point iteration is not guaranteed to converge). A faster, parametric method is obtained by using fewer basis functions than data points. They observe better results than the original mean-shift with high-dimensional data. Note that, as seen in eq. (4), using an adaptive KDE (e.g. with bandwidths obtained using the entropic affinities) also gives a weighted mean-shift update, but the weights obey a different criterion.
Ranking data consists of permutations of a given set of items, and are discrete. Meilă and Bao  extended blurring mean-shift to ranking data by using the Kendall distance rather than the Euclidean distance, and by rounding the continuous average of eq. (3) to the nearest permutation after each iteration (so the algorithm stops in a finite number of steps).
Functional data, such as a collection of curves or surfaces, live in an infinite-dimensional space. Mean-shift clustering can be extended to functional data, where it corresponds to a form of adaptive gradient ascent on an estimated surrogate density .
The mean-shift iteration (2) or (3) is essentially a local smoothing, which provides a smooth or denoised version of a data point as a weighted average of its nearby data points. Clustering is just one specific problem that can make use of this smoothing. Here we describe work based on mean-shift for manifold denoising. This was already pointed out in . Consider data lying on a low-dimensional manifold in , but with noise added (as in fig. 3). If we run one mean-shift iteration for each data point in parallel, and then replace each point with its denoised version, we obtain a denoised dataset, where points have moved towards the manifold. Repeating this eventually compresses the dataset into clusters, and this is the basis of the BMS algorithm, although of course it destroys the manifold structure. However, if we stop BMS very early, usually after one or two iterations, we obtain an algorithm that can remove noise from a dataset with manifold structure.
The denoising ability of local smoothing was noted independently in the computer graphics literature for the problem of surface fairing or smoothing. Earlier, the Laplacian had also been used to smooth finite element meshes . In surface smoothing, 3D point clouds representing the surface of an object can be recorded using LIDAR, but usually contain noise. This noise can be eliminated by Laplacian smoothing, which replaces the 3D location of each point with the average of its neighbors [79, 80]. The neighbors of each point are obtained from a triangulated graph of the cloud, which is usually available from the scanning pattern of LIDAR. Typically, the Laplacian is constructed without the notion of a KDE or a random-walk matrix, and both the values of its entries and the connectivity pattern are kept constant during iterations.
One important problem of BMS (or Laplacian smoothing) is that points lying near the boundary of a manifold or surface will move both orthogonally to and tangentially along the manifold, away from the boundary and thus shrinking the manifold. While this is good for clustering, it is not for surface smoothing because it distorts the object shape, or for manifold learning because it distorts the manifold structure. For example, in the MNIST dataset of handwritten digit images , motion along the manifold changes the style of a digit (e.g. slant, thickness). Various approaches, such as rescaling to preserve the object volume, have been proposed in computer graphics . In machine learning, the manifold blurring mean-shift (MBMS) algorithm  is an extension of BMS to preserve the local manifold structure. Rather than applying to each point the mean-shift vector directly, this vector is corrected to eliminate tangential motion. The mean-shift vector is first projected onto the local tangent space to the manifold at that point (estimated using local PCA on the nearest neighbors of the point), and this projection is removed from the motion. Hence, the motion is constrained to be locally orthogonal to the manifold. Comparing the local PCA eigenvalues between the orthogonal and tangent spaces gives a criterion to stop iterating.
Fig. 4 shows MNIST images before and after denoising with MBMS. The digits look smoother (as if they had been anti-aliased to reduce pixelation) and easier to read (compare the original vs the denoised
), and indeed a classifier trained on the denoised data performs better. While this smoothing homogenizes the digits somewhat, it preserves distinctive style aspects of each digit. MBMS performs a sophisticated denoising (very different from simple averaging or filtering) by intelligently closing loops, removing or shortening spurious strokes, enlarging holes, removing speckle noise and, in general, subtly reshaping the digits while respecting their orientation, slant and thickness. MBMS has also been applied to the identification of the structure of tectonic faults from seismic data .
(Manifold) denoising can be used as a preprocessing stage for several tasks, leading to more robust models, such as dimensionality reduction , classification [44, 86], density estimation  or matrix completion .
In some algorithms, one can establish a continuum between clustering and dimensionality reduction, with clustering being the extreme case of dimensionality reduction where the manifold dimensionality is zero. This is the case with MBMS, where a tangent space of dimension zero recovers BMS. Another case is spectral clustering and spectral dimensionality reduction, e.g. normalized cut  vs Laplacian eigenmaps 
: both operate by extracting eigenvectors from the graph Laplacian, but in spectral clustering the number of eigenvectors is the number of clusters, while in spectral dimensionality reduction one cluster is assumed and the eigenvectors correspond to degrees of freedom along the manifold defined by the cluster.
Computationally, mean-shift algorithms are slow, because their complexity is quadratic on the number of points (which would limit their applicability to small data sets), and many iterations may be required to converge (for MS). Significant work has focused on faster, approximate mean-shift algorithms, using a variety of techniques such as subsampling, discretization, search data structures , numerical optimization , and others, which we review below. Some of them are specialized to image segmentation while others apply to any data. A practical implementation of a fast mean-shift algorithm could combine several of these techniques. In addition, although it does not seem to have been studied in detail, mean-shift algorithms benefit from embarrasing parallelism, since the iterations for each point proceed independently from the rest of the points.
With the Epanechnikov kernel, mean-shift converges in a finite number of iterations. If the neighborhood structure of the data is known a priori, as is the case in image segmentation , MS can be very efficient without any approximations, since the iteration for each point involves only its neighbors.
With the Gaussian kernel and with arbitrary data, for which the neighborhood structure is not known and varies over iterations, the runtime of MS can be large, particularly with large datasets, and much work has tried to accelerate Gaussian MS. Computationally, MS has two bottlenecks : (1) Accurately converging to a mode can require many iterations, since its convergence is typically linear and can approach sublinearity in regions where two modes are close. (2) Each iteration is linear on the number of points , because it is an average of the data. Thus, running MS for the entire dataset is , where is the average number of iterations per point and the dimension. For example, in image segmentation, typical values are 10 to 100 for (depending on the accuracy sought and the bandwidth ), (higher if using texture features) and thousands to millions for (the number of pixels). Acceleration algorithms should attack either or both bottlenecks, while keeping a low approximation error, i.e., producing a clustering (assignment of points to modes found) that is close to that of the naive MS—otherwise one would really be running a different clustering algorithm.
As for accelerating the convergence, the obvious answer is using Newton’s method (with a modified Hessian to ensure ascent), because it has quadratic convergence, computing the Hessian of a Gaussian mixture is simple  and Newton iterations are typically not much costlier than those of MS. The reason is that, for the low-dimensional problems for which MS is best suited, computing the Newton direction (which involves solving a linear system) is not costly compared to computing the gradient or Hessian themselves. However, Newton’s method does have other problems with MS: (1) it introduces user parameters such as step sizes or damping coefficients; (2) it need not be effective far from a mode, where the Hessian is undefined; (3) since Newton’s method is very different from MS, it can make a point converge to a different mode than under MS. A solution to this is to use a MS-Newton method  that starts with MS iterations (which often suffice to move the iterate relatively close to a mode) and then switches to Newton’s method.
As for reducing the cost of one iteration below , the most obvious strategy is to approximate the average in each iteration with a small subset of the data, namely the points closest to the current iterate. This can be done by simply ignoring faraway points in the average, which introduces an approximation error in the modes, or by updating faraway points infrequently as in the sparse EM algorithm of , which guarantees convergence to a true mode. Unfortunately, either way this requires finding nearest neighbors, which is itself unless we know the neighborhood structure a priori. And, if the bandwidth is relatively large (which is the case if we want a few clusters), then a significant portion of the data have non-negligible weights in the average. Another approach is to use -body methods such as the (improved) fast Gauss transform [40, 93] or various tree structures . These can reduce the cost of one iteration for every data point to or , respectively. However, this implies we run synchronous iterations (all points move at once), and does not scale to high dimensions (because the data structures involved grow exponentially with the dimension), although it is generally appropriate for the low dimensions used in image segmentation.
A final approach consists of eliminating many of the iterations for many of the points altogether, in effect predicting what mode a given point will converge to. This cannot be simply done by running MS for a small subset of the points and then assigning the remaining points to the mode of their closest point in the subset, because mean-shift clusters have complex shapes that significantly differ from a nearest-neighbor clustering, and the clustering error would be large. A very effective approach for image segmentation is the spatial discretization approximation of . This is based on the fact that the trajectories followed by different points as they converge towards a given mode collect close together (see section 2.3 and fig. 1B). Thus, we can run the naive MS for a subset of the points, keeping track of the trajectories they followed. For a new point, as soon as we detect that its iterate is close to an existing trajectory, we can stop and assign it to that trajectory’s mode. With image segmentation (even if the number of features, and thus dimensions, is large), the trajectories can be coded by discretizing the image plane into subpixel cells and marking each cell the first time an iterate visits it with the mode it converges to. Since nearly all these cells end up empty, the memory required is small. This reduces the average number of iterations per point to nearly 1 with only a very small clustering error . This error can be controlled through the subpixel size.
Yuan et al.  show that, if is the result of applying a mean-shift step to a point , then all the points within a distance from have a KDE value . Not all these points actually converge to the same mode as , however many do. A fast mean-shift algorithm results from not running mean-shift iterations for all those points and assigning them to the same mode as . Hence, as in the discretization approach above, only a few data points actually run mean-shift iterations. The approximation error was not controlled in , although one could use instead a distance and use to control it.
Paris and Durand  combine several techniques to accelerate mean-shift image segmentation, including spatial discretization and 1D convolutions (classifying most pixels without iterating, as above), and construct a hierarchy of clusterings based on the notion of topological persistence from computational geometry . The algorithm is fast enough to segment large images and video, although it is not clear how good an approximation it is to the true mean-shift.
As the dimension of the data points increases, searching for nearest neighbors (to compute the mean shift update in eq. (2) or (3) with a truncated kernel) becomes a bottleneck. One can then use approximate nearest-neighbor algorithms. For example, Locality-Sensitive Hashing (LSH)  is used in .
A simple and very effective acceleration that has essentially zero approximation error was given in . It essentially consists of interleaving connected-components and blurring steps. The fact that BMS collapses clusters to a single point suggests that as soon as one cluster collapses we could replace it with a single point with a weight porportional to the cluster’s number of points. This will be particularly effective if clusters collapse at different speeds, which happens if they have different sizes, as predicted in section 2.3; e.g. see fig. 6. The total number of iterations remains the same as for the original BMS but each iteration uses a dataset with fewer points and is thus faster. Specifically, the Gaussian kernel density estimate is now where and the posterior probability is as in . At the beginning and when clusters and merge then the combined weight is . Using the matrix notation of section 2.1 we have , , and . This can be proven to be equivalent to the original BMS.
The reduction step where coincident points are replaced with a single point can be approximated by a connected-components step where points closer than are considered coincident, where takes the same value as in the final connected-components step of BMS (fig. 2B). Thus, is the resolution of the method (below which points are indistinguishable), and while BMS applies it only after having stopped iterating, the accelerated version applies it at each iteration. Hence, the accelerated BMS algorithm alternates a connected-components graph contraction with a BMS step.
Experimentally with image segmentation, a remarkable result arises: the runtime of this accelerated BMS is almost constant over bandwidth values, unlike for both MS and BMS, corresponding to about 4 to 5 BMS iterations, and is dominated by the first few iterations, where almost no merging occurs. This means a speedup factor of 2 to 4 over BMS and 5 to 60 over MS .
As discussed earlier, equating modes with clusters as in mean-shift implies we lose direct control on the number of clusters and do not differentiate between meaningful and non-meaningful modes. Recently, a -modes algorithm  has been proposed that addresses these problems333There exists another algorithm called “-modes” [48, 22], but defined for categorical data, rather than continuous data.. The (Gaussian) -modes algorithm takes both and as user parameters and maximizes the objective function
over the cluster centroids and the assignments of points to clusters , where is the Gaussian kernel. For a given assignment , this can be seen as the sum of a KDE as in eq. (1) but separately for each cluster. Thus, a good clustering must move centroids to local modes, but also define separate KDEs. This naturally combines the idea of clustering through binary assignment variables (as in -means or -medoids) with the idea that, for suitable bandwidth values, high-density points are representative of a cluster (as in mean-shift). As a function of the bandwidth , the -modes objective function becomes -means for and a version of -medoids for . The training algorithm alternates an assignment step as in -means, where each data point is assigned to its closest centroid, with a mode-finding step as in mean-shift, but only for each centroid in the KDE defined by its current cluster (rather than for each data point). A more robust, homotopy-based algorithm results from starting with (i.e., -means) and gradually decreasing while optimizing the objective for each . The computational complexity of this algorithm is the same as for -means ( per iteration), although it requires more iterations, and is thus much faster than mean-shift ( per iteration).
-modes obtains exactly modes even if the KDE of the data has more or fewer than modes, because it splits the data into KDEs. Using the homotopy-based algorithm, it tends to track a major mode in each cluster and avoid outliers. Thus, -modes can work well even in high dimensions, unlike mean-shift.
The fundamental disadvantage of -modes is that the clusters it defines are convex, as with -means, since they are the Voronoi cells defined by the modes. This is solved in the Laplacian -modes algorithm , which minimizes the objective function
This relaxes the hard assignments of -modes to soft assignments (so each point may belong to different clusters in different proportions) and adds a term to the objective with a weight that encourages neighboring points to have similar soft assignments. This term can be equivalently written as in terms of the graph Laplacian constructed from an affinity matrix , where is the degree matrix. The training algorithm is as with -modes, but the assignment step becomes a convex quadratic program on , which can be solved efficiently with a variety of solvers.
Laplacian -modes becomes the -modes algorithm with , and a Laplacian -means algorithm with . The introduction of the Laplacian term allows the clusters to be nonconvex, as happens with other Laplacian-based algorithms such as spectral clustering. However, here we solve a quadratic program rather than a spectral problem, and obtain a KDE, centroids (= modes) and soft assignments of points to clusters (unlike spectral clustering, which only returns hard assignments).
An out-of-sample mapping to predict soft assignments for a test point not in the original training set can be obtained by augmenting (7) with and solving but keeping the centroids and the soft assignments for the training points fixed. The result equals the projection on the simplex of the average of two terms: the average of the assignments of ’s neighbors, and a term dependent on the distances of to the centroids . The assignment can be readily interpreted as a posterior probability of belonging to cluster . Hence, Laplacian -modes can be seen as incorporating nonparametric posterior probabilities into mean-shift (or spectral clustering). The usual way of obtaining posterior probabilities in clustering is by using a mixture model, for which an EM algorithm that maximizes the likelihood is derived. In contrast to this parametric approach, in Laplacian -modes the assignments optimize an objective function that is designed specifically for clustering, unlike the likelihood; the clusters are not obliged to follow a model (such as Gaussian); and, consequently, the optimization algorithm does not depend on the particular type of clusters (unlike the EM algorithm, which depends on the cluster model). These soft assignments or posterior probabilities are helpful to estimate the uncertainty in the clustering or for other uses. For example, in image segmentation they can be used as a smooth pixel mask for image matting, conditional random fields or saliency maps.
In centroid-based clustering algorithms, each cluster is associated with a centroid: the cluster mean in -means, an exemplar (data point) in -medoids, and the KDE mode in mean-shift, -modes and Laplacian -modes. A desirable property of centroids, which makes them interpretable, is that they should be valid patterns and be representative of their cluster (“look” like a typical pattern in the cluster). A well-known disadvantage of -means is that the mean of a nonconvex cluster need not be a valid pattern, since it may lie in a low-density area. This is often the case with clusters with manifold structure. For example, given a sequence of rotated digit-1 images , its mean is not a valid digit-1 image itself. In -medoids, exemplars (data points) are valid patterns by definition, although when the data is noisy, individual exemplars may look somewhat atypical (see the manifold denoising discussion in section 2.6). Modes can strike an optimal tradeoff, in being typical, valid and yet denoised representatives, as follows. First, modes are by definition on high-density areas, so in this sense they are representative of the KDE and cluster. Second, a mode can be characterized as a scale-dependent, local average of the data. Indeed, a mode is a maximum of the density , so its gradient at is zero, and this implies that equals the weighted average of the data in the sense of eq. (2) and (3) (where ), as seen in section 2. As a function of the scale, a mode spans a continuum between equaling the regular mean of the whole data () and equaling any individual data point (). With an intermediate bandwidth in the KDE, they are local averages of the data and so can remove noise that affects each individual exemplar, thus being even more prototypical than actual data points. In this sense, (Laplacian) -modes achieves a form of intelligent denoising similar to that of MBMS (section 2.6).
Table 1 compares several clustering algorithms in terms of whether their centroids are valid patterns, whether they can model nonconvex clusters, whether they estimate a density, and whether the cluster assignments are hard or soft.
|Centroids||likely invalid||“valid”||“valid”||N/A||valid||valid||likely invalid|
|no||depends||yes||yes||no||yes||to some extent|
|Assignment||hard||hard||hard||hard||hard||soft, nonparam.||soft, param.|
We illustrate MS and BMS in the clustering application where they have been most effective (image segmentation), and point out clustering applications where they are not so effective (high-dimensional data or data with manifold structure), as well as applications beyond clustering where MS and BMS have been used. Matlab code for most of the algorithms described may be obtained from the author.
MS and BMS are most effective with low-dimensional data, and its most successful application is in image segmentation . They are applied as follows. Given an image, we consider each pixel as a data point in a space consisting of two types of features: spatial features (the and location of a pixel) and range features (the intensity or grayscale value, the color values, texture features, etc. of the pixel). The range features are scaled to span approximately the range of the spatial features. This way, all features and the bandwidth have pixel units. For example, for the image of fig. 5, we rescale the original intensity values to the range , so a feature vector would correspond to the pixel located at coordinates , which has an intensity equal to % of the maximum intensity (white). The precise scaling will affect the clustering and should be done carefully. Using spatial features is beneficial because they introduce spatial coherence (nearby pixels tend to belong to the same cluster), although sometimes only the range features are used. One should use a perceptually uniform color space such as LAB rather than the RGB space, so that Euclidean distances approximately match perceptual differences in color .
|A: original image||B: MS segmentation||C: datapoints & iterates||D: data/iter. (2D proj.)|
|I H V[b]|
Fig. 5 shows an example with a grayscale image of pixels. Thus, the dataset contains points in 3D (location , and intensity ). Fig. 5 shows the result with MS while fig. 6 shows the result with BMS. As noted earlier, MS and BMS generally give similar clusterings over a range of bandwidths.
With the Gaussian kernel, reasonable clusterings arise for bandwidths around of the image size (vertical or horizontal), although one should explore a range of bandwidths. In some computer vision applications, MS is sometimes used with a very small bandwidth in order to oversegment an image into uniform patches which are then further processed.
Fig. 7 illustrates two situations where MS (and BMS) do not produce good clusterings. The first one is data having manifold structure (of lower dimension than the feature space), as in the 1D spirals in a 2D space shown in the left panel. The KDE will have either many modes spread over the data manifolds (for smaller bandwidths) or few modes that mix manifolds (for larger bandwidths). No bandwidth value will cluster the spirals correctly. The Laplacian -modes algorithm is able to cluster the spirals correctly, while estimating a reasonable KDE and mode for each spiral .
The second situation is high-dimensional data, as in the MNIST handwritten digits , where each data point is a grayscale image of pixels (i.e., feature dimensions). In high-dimensional spaces, MS tends to produce either a single mode (for larger bandwidths), which is uninformative, or many modes (for smaller bandwidths), which often sit on small groups of points or outliers. In the results in fig. 7 (right panel), the MS bandwidth was tuned to produce clusters, and -means and Laplacian -modes were run with . Since the data forms nonconvex clusters, -means produces centroids that average distant data points, of possibly different digit classes, and thus do not look like digits. Again, Laplacian -modes can partition the data into exactly clusters, while finding a mode for each that looks like a valid digit and is more representative of its class .
In (univalued) regression, we assume there exists a mapping which assigns a unique vector to each possible input (i.e., is a function in the mathematical sense). We estimate given a training set of pairs . This is the usual regression setting. In multivalued regression, the mapping assigns possibly multiple vectors to each input . A classical example is learning the inverse of a (univalued) mapping . For example, the inverse of is , which has , or outputs depending on the value of (smaller, equal or greater than zero, respectively). Carreira-Perpiñán [8, 10, 11] proposed to represent a multivalued mapping by the modes of the conditional distribution . Hence, in this case, we apply mean-shift to a conditional distribution . One advantage of this approach is that the number of inverses is selected automatically for each value of ; essentially, it is the number of “clusters” found by mean-shift in . Once the modes have been found, one can also estimate error bars for them from the local Hessian at each mode [8, 10]. The conditional distribution may be obtained from a Gaussian KDE or Gaussian mixture
for the joint distribution ofand (since the marginal and conditional distributions of any Gaussian mixture are also Gaussian mixtures), but it can also be learned directly, for example using mixture density networks or particle filters .
Fig. 8 illustrates two examples. On the left panels, we learn the inverse kinematics mapping of a robot arm [65, 66]. The forward kinematics mapping is univalued and gives the position in workspace of the arm’s end-effector given the angles at the joints. The inverse mapping is multivalued, as shown by the elbow-up and elbow-down configurations. In this case, the conditional distribution was learned using a particle filter. On the right panel, we learn to map speech sounds to tongue shapes (articulatory inversion) . The American English /r/ sound (as in “rag” or “roll”) can be produced by one of two tongue shapes: bunched or retroflex. The figure shows, for a specific /r/ sound, 3 modes in tongue shape space (two bunched and one retroflex). In both inverse kinematics and articulatory inversion, using the conditional mean (shown in magenta), or equivalently doing univalued least-squares regression, leads to invalid inverses.
|theta1[t] theta2[-90]||x1[t] x2[-90] t1 t2|
Mean-shift algorithms are based on the general idea that locally averaging data results in moving to higher-density, and therefore more typical, regions. Iterating this can be done in two distinct ways, depending on whether the dataset itself is updated: mode finding (MS), or smoothing (BMS), both of which can be used for clustering, but also for manifold denoising, multivalued regression and other tasks.
In practice, mean-shift algorithms are very attractive because—being based on nonparametric kernel density estimates—they make few assumptions about the data and can model nonconvex clusters. The user need only specify the desired scale of the clustering but not the number of clusters itself. Although computationally slow, mean-shift algorithms can be accelerated and scale to large datasets. They are very popular with low-dimensional data having nonconvex clusters, such as image segmentation. The (Laplacian) -modes algorithms remain nonparametric but still perform well with high-dimensional data and outliers.
Mean-shift is intimately related to kernel density estimates and Gaussian mixtures, and to neighborhood graphs. The graph Laplacian or random-walk matrix arises as a fundamental object in clustering and dimensionality reduction that encapsulates geometric structure about a dataset. Although not usually seen this way, MS/BMS algorithms are essentially based on , and they alternate between smoothing the data with (a power iteration), with possibly modified eigenvalues, and updating itself.
Some directions for future research are as follows. (1) Understanding the geometry of Gaussian mixtures and its modes in high dimensions. (2) Finding a meaningful extension of (blurring) mean-shift to directed graphs. (3) Designing, or learning, random-walk or Laplacian matrices that are optimal for clustering in some sense. (4) Finding a meaningful definition of clusters that is based on “bumps” of a kernel density estimate, i.e., distinct regions of high probability rather than modes.
I thank Weiran Wang and Marina Meilă for comments on the paper.
Consider an undirected graph with vertex set and edge set . A connected component of is a maximal subset of such that every pair of vertices in it can be connected through a path using edges in . Hence, the vertex set can be partitioned into connected components. The connected component of a given vertex can be found by depth-first search (DFS), which recursively follows edges adjacent to until all vertices reachable from have been found. This can be repeated for the remaining vertices until all connected components have been found, with a total runtime .
Connected-components is a clustering algorithm in its own right. To use it, one provides a matrix of distances between every pair of vertices (data points and ) and a threshold , and defines a graph having as vertices the data points and as edges if for (this is sometimes called an -ball graph). The connected components of this graph give the clusters.
Connected-components gives poor clustering results unless the clusters are tight and clearly separated. However, it can be reliably used as a postprocessing step with mean-shift algorithms, as described in the main text, to merge points that ideally would be identical. For example, with MS, the final iterates for points that in the limit would converge to the same mode are numerically different from each other (by a small amount if using a sufficiently accurate stopping criterion in the mean-shift iteration). Hence, these iterates form a tight cluster around their mode, which is widely separated from the tight clusters corresponding to other modes.
Naively implemented (fig. 9 left), connected-components would run in time, because to construct the graph we have to threshold all pairwise distances, each of which costs . However, in the case of tight, clearly separated clusters, we need not construct the graph explicitly, and it runs in if there are connected components. Let the data satisfy the following “tight clusters” assumption: there exists that is larger than the diameter of each component (the largest distance between every pair of points in a component) but smaller than the distance between any two components (the smallest distance between two points belonging to different components). Then the connected components can be found incrementally by connecting each point to the representative of its component (fig. 9 right), where the representative of component is , which is a point in that component. This can be done on the fly, as we process each data point in MS, or given the final set of iterates in BMS.
Some heuristics can further accelerate the computation. 1) The distances can be computed incrementally (dimension by dimension), so that as soon as is exceeded we exit the distance calculation. In the tight cluster assumption, this could reduce the cost to , since for distances above each individual distance dimension will typically exceed . 2) In image segmentation, we can scan the pixels in raster order, so that for most pixels their component is the same as for the previous pixel (this will not be true when crossing a cluster boundary, but that happens infrequently). Then, when computing the distance, we always try first the last pixels’s component. The average cost for the entire connected-components runtime is then , because most pixels compute a single distance.
The value of depends on the problem but it can usually be chosen in a wide interval. For example, in image segmentation with features in pixel units, we can set , since we do not need subpixel accuracy to locate a mode, and modes must be at least one pixel apart to be meaningful. In BMS, we can safely set to a smaller value (say, ), since its cubic convergence rate produces extremely tight clusters quickly.
|A. Naive connected-components algorithm Define an -ball graph: vertices edges , . Apply DFS to this graph.||B. Efficient connected-components algorithm first component for to for to if old component assign to component ; break if was not assigned new component|
Scale-based clustering using the radial basis function network.
IEEE Trans. Neural Networks, 7(5):1250–1261, Sept. 1996.
Proc. of the 11th Int. Conf. Artificial Intelligence and Statistics (AISTATS 2007), San Juan, Puerto Rico, Mar. 21–24 2007.