AniML
A Java implementation of random forest machine learning algorithm / classifier
view repo
A natural way to characterize the cluster structure of a dataset is by finding regions containing a high density of data. This can be done in a nonparametric way with a kernel density estimate, whose modes and hence clusters can be found using meanshift algorithms. We describe the theory and practice behind clustering based on kernel density estimates and meanshift algorithms. We discuss the blurring and nonblurring versions of meanshift; theoretical results about meanshift algorithms and Gaussian mixtures; relations with scalespace theory, spectral clustering and other algorithms; extensions to tracking, to manifold and graph data, and to manifold denoising; Kmodes and Laplacian Kmodes algorithms; acceleration strategies for large datasets; and applications to image segmentation, manifold denoising and multivalued regression.
READ FULL TEXT VIEW PDF
We provide initial seedings to the Quick Shift clustering algorithm, whi...
read it
Many clustering algorithms exist that estimate a cluster centroid, such ...
read it
In this paper, we study how the mean shift algorithm can be used to deno...
read it
The mean shift algorithm is a popular way to find modes of some probabil...
read it
Epanechnikov Mean Shift is a simple yet empirically very effective algor...
read it
The nonparametric formulation of densitybased clustering, known as moda...
read it
Many distancebased algorithms exhibit bias towards dense clusters in
in...
read it
A Java implementation of random forest machine learning algorithm / classifier
One intuitive way of defining clusters is to assume that the data points are a sample of a probability density function, and then to define the clusters through this density. For example, fig.
1shows a 2D dataset and a density estimate for it, whose contours clearly suggest that there are two clusters of a complex shape. The first step, then, is to learn an estimate of the density for the data points. This can be done with a parametric model, such as a Gaussian mixture, typically trained with an EM algorithm to maximize the likelihood
[55]. Such an approach is often computationally efficient and can give good results with clusters of elliptical shape, but it has several disadvantages. The likelihood function will typically have local optima, and finding a global optimum is, in general, very difficult; thus, the result is dependent on the initialization, and in practice a user will try different initializations (usually random restarts). The selection of the model (what kernel and how many components) is left to the user, as well as the number of clusters to find. And when the clusters have complex shapes, as for example in image segmentation, many components will be required to approximate them well, increasing the training time and the number of local optima.We focus on nonparametric, kernel density estimates (KDE). A KDE is a generalization of histograms to define density estimates in any dimension that are smooth. They simplify the mathematical and computational treatment of densities and, crucially, enable one to use continuous optimization to find maxima of the density. With a kernel such as the Gaussian kernel, a KDE requires a single user parameter, the bandwidth (also referred to as scale). Given the bandwidth, the KDE is uniquely determined and, as seen below, so will be its clusters, which can take complex nonconvex shapes. Hence, the user need not select the number of clusters or try random restarts. We will focus on clusters defined by the modes of the KDE (although this is not the only way to define the clusters). A mode is a local maximum of the density. A natural algorithm to find modes of a KDE is the meanshift iteration, essentially a local average, described in section 2. The basic idea in meanshift clustering is to run a meanshift iteration initialized at every data point and then to have each mode define one cluster, with all the points that converged to the same mode belonging to the same cluster. Section 2.3 reviews theoretical results regarding the number and location of modes of a KDE, the convergence of meanshift algorithms and the character of the cluster domains. Section 2.4 discusses relations of meanshift algorithms with spectral clustering and other algorithms. Sections 2.5 and 2.6 describe extensions of meanshift for clustering and manifold denoising, respectively. One disadvantage of meanshift algorithms is their computational cost, and section 2.7 describes several accelerations. Section 3 describes another family of KDEbased clustering algorithms which are a hybrid of means and meanshift, the modes and Laplacian modes algorithms, which find exactly
clusters and a mode in each, and work better with highdimensional data. Section
4 shows applications in image segmentation, inverse problems, denoising, and other areas. We assume multivariate, continuous (i.e., not categorical) data throughout.In meanshift clustering, the input to the algorithm are the data points (multivariate, continuous feature vectors) and the bandwidth or scale. Call
the data points to be clustered. We define a kernel density estimate [83](1) 
with bandwidth and kernel , e.g. for the Gaussian kernel or if and if for the Epanechnikov kernel. Many of the results below carry over to kernels where each point has its own weight and its own bandwidth, which can be an isotropic, diagonal or full covariance matrix. To simplify the presentation we focus on the case where all points have the same, scalar bandwidth (the isotropic, homoscedastic case) and the same weight unless otherwise noted. This is the case found most commonly in practice. Also, we mostly focus on Gaussian kernels, which are easier to analyze and give rise to simpler formulas.
We can derive a simple iterative scheme for to find a mode of by equating its gradient to zero and rearranging terms (section 2.7 discusses other ways to find modes). We obtain
(2) 
where and the vector is the mean shift, since it averages the individual shifts with weights as above. For a Gaussian kernel,
and this simplifies to the following, elegant form (where, by Bayes’ theorem,
is the posterior probability of the component centered at
given point ) [9]:(3) 
As discussed below, under mild conditions this scheme converges to modes of from nearly any initial . Intuitively, each step moves the iterate to a local average of the data, in that data points closer to have larger weight, and this increases the density. Eq. 2 is called the meanshift iteration, and it can be used in two distinct ways: to find modes, and to filter (or smooth) a dataset. This gives rise to two different clustering algorithms, as follows. We will refer to them as mean shift (MS) (where modes are found) and blurring mean shift (BMS) (where the dataset is filtered).
A. Gaussian meanshift (MS) algorithm for repeat : until stop end connectedcomponents(,)  B. Gaussian blurring meanshift (BMS) algorithm repeat for : end : until stop connectedcomponents(,) 
C. Gaussian MS algorithm in matrix form repeat until stop connectedcomponents(,)  D. Gaussian BMS algorithm in matrix form repeat until stop connectedcomponents(,) 
Here, we declare each mode of as representative of one cluster, and assign data point to the mode it converges to under the meanshift iteration, . The algorithm is given in fig. 2A for the Gaussian kernel. We can also estimate error bars for each mode from the local Hessian [8, 10], given in eq. (5), which is related to the local covariance.
Some practical problems need to be solved. Firstly, some points (minima and saddle points) do not converge to modes. It is unlikely that this will happen with a finite sample, but if so such points can be detected by examining the Hessian or by a postprocessing step that checks for small clusters.
Second, the meanshift iteration is stopped after a finite number of steps, for example when the relative change in the value of is smaller than a set tolerance tol . This means that data points that in theory would converge to the same mode actually stop at numerically slightly different points. A postprocessing step is necessary to merge these into a unique mode. This can be done by finding the connected components^{1}^{1}1The connectedcomponents algorithm is described in appendix A. of a graph that has vertices, one for every convergence point, and has an edge between any pair of vertices lying within a small distance . The graph need not be explicitly constructed. The user should set tol small enough to converge to the modes with good accuracy, while limiting the computational cost incurred; and should be set quite larger than tol, but smaller than the distance between different true modes.
Here, each point of the dataset actually moves to the point given by eq. (2). That is, given the dataset , for each we obtain a new point by applying one step of the meanshift algorithm: . Thus, one iteration of blurring meanshift results in a new dataset which is a blurred (smoothed) version of . By iterating this process we obtain a sequence of datasets (and a sequence of kernel density estimates ) where is the original dataset and is obtained by blurring with one meanshift step (see fig. 6).
As will be shown below, Gaussian BMS can be seen as an iterated filtering (in the signal processing sense) that eventually leads to a dataset with all points coincident for any starting dataset and bandwidth. However, before that happens, the dataset quickly collapses into meaningful, tight clusters which depend on (see fig. 6), and then these pointlike clusters continue to move towards each other relatively slowly. A stopping criterion that detects this situation quite reliably is based on whether the entropy of the dataset changes [13] (a simpler criterion would be to stop when the update to is small, but this does not always give good clusters). As with MS clustering, a connectedcomponents postprocessing step merges the points into actual clusters. The BMS algorithm is given in fig. 2B.
Although both MS and BMS are based on the same meanshift iteration, they are different algorithms and can produce different clustering results. Specifically, given a value of the bandwidth, the number of clusters resulting from MS and BMS is usually different. However, the collection of clusterings produced over a range of bandwidths can be quite similar.
BMS is quite faster than MS in number of iterations and in runtime, particularly if using the accelerated BMS algorithm (section 2.7), which introduces essentially no approximation error. However, MS (and also BMS) can be considerably accelerated if a small clustering error is tolerated (section 2.7).
In MS, the optimizations (one for each data point) proceed independently, but they could be done synchronously, as in BMS, without altering the result. However, practically this is wasteful, because the number of iterations required varies considerably among points, and a synchronous scheme would have to run the largest number of iterations. Conversely, it is possible to run BMS with asynchronous iterations, for example moving points as soon as their update is computed. However, this makes the result dependent on the order in which points are picked, and is unlikely to be faster than the accelerated algorithm described below.
The fundamental parameter in meanshift algorithms is the bandwidth , which determines the number of clusters. The statistics literature has developed various ways to estimate the bandwidth of a KDE [77, 83]
, mostly in the 1D setting, for example based on minimizing a suitable loss function (such as the mean integrated squared error), or more heuristic rules (such as making the bandwidth proportional to the average distance of each point to its
th nearest neighbor). While these bandwidth estimators are useful and can give reasonable results, they should be used with caution, because the bandwidth that gives the best density estimate (in a certain sense) need not give the best clustering—clustering and density estimation are, after all, different problems. Besides, clustering is by nature exploratory, and it is best to explore a range of bandwidths. Computationally, this is particularly easy to do in MS (see the scalespace discussion below).It is also possible to use a different bandwidth for each point (called adaptive KDE), which can help with areas where points are sparse, for example. A good way to do this is with entropic affinities [45, 82], where the user sets a global number of neighbors and then, for each data point , the bandwidth is computed so that point has a distribution over neighbors with a perplexity (logentropy) , i.e., each point sets its own bandwidth to have effective neighbors. One could then vary to achieve different clusterings. Other ways to construct adaptive KDEs for MS have been proposed [28]. The meanshift update with adaptive bandwidths has the form [14]:
(4) 
where the values are the posterior probabilities
reweighted by the inverse variance and renormalized (compare with the singlebandwidth equation (
3)).Meanshift has also been used to track moving objects (“blobs”) through a sequence of images [30]. Since the blobs can change size, using a fixed bandwidth (scale) becomes problematic. Collins [27] used ideas from scalespace theory (see below) to select a scale that adapts to the blob size. An “image feature” is defined as a point in scalespace where a certain differential operator achieves a local maximum (i.e., a mode) with respect to both space and scale. This mode is then tracked by a twostep meanshift procedure that convolves in image space with a filterbank of spatial differenceofGaussian filters, and convolves in scale space with an Epanechnikov kernel, in an efficient way.
Different kernels give rise to different versions of the meanshift and blurring meanshift algorithms. Much previous work (including [37, 24]) uses the Epanechnikov kernel for computational efficiency, since the kernel evaluations involve only pairs of neighboring points (at distance ) rather than all pairs of points (though the neighbors must still be found at each iteration), and convergence occurs in a finite number of iterations. However, in practice, the Gaussian kernel produces better results than the Epanechnikov kernel [29], which generates KDEs that are only piecewise differentiable and can contain spurious modes (see below).
The behavior of modes and other critical points (minima, saddles) as a function of the bandwidth (or scale) is the basis of scalespace theory
[91, 50, 54]. Here, one studies the evolution of an image under Gaussian convolution (blurring). If we represent the image as a sum of delta functions centred at the pixel locations and with value equal to the grayscale, convolving this with an isotropic Gaussian kernel of scale gives a KDE (with weighted components). As the scale increases, the image blurs, structures in the image such as edges or objects lose detail, and for large scales the image tends to uniformly gray. Some important structures, such as objects, can be associated with modes of the KDE, and the lifetime of a mode—defined by the scale interval between the creation and destruction (merging with another mode) of that mode—is taken as an indication of its importance in the image.An ideal convolution kernel should blur structure in the image but never create new structure (which would then reflect properties of the kernel rather than of the image). The result would then be a tree of modes, where there are modes for (one mode at each component) and modes merge as increases until there is a single mode. Unfortunately, this is only true for the Gaussian kernel and only in dimension one (section 2.3). In dimension two, i.e., with images, new modes can be created as the scale decreases. However, in practice these creation events seem rare and shortlived, in that the mode created usually merges with another mode or critical point at a slightly larger scale. Thus, they are generally ignored. Computationally, one starts with a mode at each pixel (or data point) for and then tracks the location of the modes as in a numerical continuation method [61], by running meanshift at a new, larger scale using as initial points the modes at the previous scale. By construction, this results in a tree—although, unlike in agglomerative clustering, the resulting set of clusterings at each scale need not be nested.
Another notion of lifetime, topological persistence, has been explored in computational geometry [34], and used to define a hierarchical meanshift image segmentation [62, 23]. The tree of modes has also been proposed in the statistical literature as a tool for visualization of KDEs [59]
, but this is practical only for small datasets in 1D or 2D. The mode tree is sensitive to small changes in the data and gives no way to differentiate between important modes and those caused by, for example, outliers, so it can help to combine several trees constructed by jittering or resampling the original dataset
[60].As indicated in fig. 2D, we can equivalently write each Gaussian BMS iteration in matrix form [13] as in terms of the randomwalk matrix , an stochastic matrix with elements and . Here, is the matrix of data points,
is a Gaussian affinity matrix, which defines a weighted graph where each
is a vertex, and is the degree matrix of that graph. This establishes a connection with spectral clustering, which we describe in section 2.4.Also, we can regard the iteration as a smoothing, or more generally a filtering [79, 33, 80, 15], of (each dimension of) the data with an inhomogeneous filter
, where the filter depends on the data and is updated at each iteration as well, hence resulting in a nonlinear filtering. This in turn suggests that one could use other filters constructed as a function
, where is a scalar function, so it modifies the spectrum of . CarreiraPerpiñán [15] studied several such filters, including explicit, implicit, power and exponential ones, depending on a step size parameter , resulting in generalized Gaussian blurring meanshift algorithms. He gave convergence conditions on and and found that the different filters tend to find similar clusters, i.e., over a range of bandwidths they can obtain the same clustering (at possibly different values). However, their runtime varies widely. Implicit filters (which involve solving a linear system) or power filters (which involve iterating a matrix product) have a strong clustering effect in each iteration, but their iterations themselves are more costly. When one considers both the number of iterations and the cost of each iteration, the method found fastest was a slightly overrelaxed explicit function of the form with . However, its runtime was very close to that of the standard BMS (). An interesting extension of this work would be to be able to design the function so that the resulting generalized BMS algorithm is optimal (in some sense) for clustering.It is also possible to write MS in matrix form (fig. 2C), where we write the randomwalk matrix with the symbol to differentiate it from the standard randomwalk matrix , since is defined on two sets of points ( and ) while is defined on one (). However, the matrix form implies that the MS updates for all points are synchronous, which, as mentioned before, is slow.
Having run MS or BMS on a dataset, how should we deal with new data points not in the original dataset? The purist option (and the only one for BMS) is to run the clustering algorithm from scratch on the entire dataset (old and new points), but this is computationally costly. A faster option with MS is to use the original KDE and simply run MS on each of the new points, assigning them to the mode they converge to. This reflects the fact that MS clusters not just the points in the dataset, but (implicitly) the whole space. However, this point of view implies no new clusters are created when new points arrive.
The advantages of meanshift algorithms stem from the nonparametric nature of the KDE: (1) It makes no model assumptions (other than using a specific kernel), unlike Gaussian mixture models or
means, for example. (2) It is able to model complex clusters having nonconvex shape (although this does not imply that all shapes can be modeled well), unlike means. (3) The user need only set one parameter, the bandwidth, which has an intuitive physical meaning of local scale, and this determines automatically the number of clusters. This is often more convenient than having to select the number of clusters explicitly. (4) It has no local minima, thus the clustering it defines is uniquely determined by the bandwidth, without the need to run the algorithm with different initializations. (5) Outliers, which can be very problematic for Gaussian mixtures and means, do not overly affect the KDE (other than creating singleton clusters).Using KDEs and equating modes with clusters as in meanshift also has some disadvantages. The most important one is that KDEs break down in high dimensions, where the number of clusters changes abruptly from one for large to many, with only a minute decrease in . Indeed, most successful applications of meanshift have been in lowdimensional problems, in particular image segmentation (using a few features per pixel, such as color in LAB space). A way of using modes in highdimensional spaces is the modes algorithm described in section 3.
Other possible disadvantages, depending on the case, are as follows. (1) In some applications (e.g. figureground or medical image segmentation) the user may seek a specific number of clusters. However, in meanshift clustering we have no direct control over the number of clusters: to obtain clusters, one has to search over . This is computationally costly (and sometimes not well defined, since the number of clusters might not be a monotonic function of ). (2) We do not differentiate between meaningful and nonmeaningful modes. For example, outliers will typically create their own mode; or, the density in a cluster may genuinely contain multiple modes (especially with clusters that have a nonconvex or manifold structure, as in fig. 1D). Some of these problems may be partially corrected by postprocessing the results from meanshift (e.g. to remove lowdensity modes and clusters with few points, which are likely outliers), or by the modes algorithm. Finally, meanshift is slow computationally, and this is addressed in section 2.7.
The meanshift algorithm is so simple that it has probably been discovered many times. In 1975, Fukunaga and Hostetler [37] were perhaps the first to propose its idea and also introduced the term “mean shift”. They derived the blurring version of the algorithm (BMS) for a KDE with the Epanechnikov kernel as gradient ascent on with a variable step size, without proving convergence or giving a stopping criterion. They observed that it could be used for clustering and dimensionality reduction (or denoising), since points converge locally to cluster centroids or medial lines for appropriate values of the bandwidth. Since 1981, the algorithm was also independently known in the statistics literature as “mean update algorithm” (see [81] pp. 167ff and references therein). The term “blurring process” is due to Cheng [24], who discussed the convergence of both the blurring (BMS) and the nonblurring (MS) versions of meanshift. CarreiraPerpiñán [9], motivated by the problem of finding all the modes of a Gaussian mixture for multivalued regression [8, 10], independently rediscovered the algorithm for the Gaussian kernel and proved its convergence for arbitrary covariance matrices [10]. Since the early 2000s, the nonblurring meanshift received much attention thanks to the work of Comaniciu and Meer [29], who demonstrated its success in image filtering, image segmentation and later in tracking [30]. This was followed by work by many researchers in both theoretical, computational and application issues. Algorithms similar to meanshift have appeared in scalespace clustering [90, 92, 21, 68], in clustering by deterministic annealing [69] and in preimage finding in kernelbased methods [72].
Although MS and BMS are defined by very simple iterative schemes, they exhibit some remarkable and somewhat counterintuitive properties regarding whether they converge at all, how fast they converge, and the character of the convergence domains. The geometry of the modes of a KDE is also surprising. The relevant literature is scattered over different areas, including computer vision, statistics and machine learning. We give a summary of results, with particular attention to Gaussian kernels, without proof. More details can be found in the references cited.
Intuitively, one might expect that a sum of unimodal kernels with bandwidth as in eq. (1) would have at most modes, and that the number of modes should decrease monotonically as increases from zero and the different components coalesce. In general, this is only true for the Gaussian kernel in dimension one. Motivated by scalespace theory, several papers [50, 3, 95] showed that, in dimension one, the Gaussian kernel is the only kernel that does not create modes as the scale increases. It is easy to see that, in KDEs with nongaussian kernels (Epanechnikov, Cauchy, etc.), modes can appear as increases [18]. The creation of modes need not occur often, though, and some kernels (such as the Epanechnikov kernel) are more likely than others to create modes. It is less easy to see, but nonetheless true, that modes can also appear with the Gaussian kernel in dimension two, i.e., with images, and thus in any larger dimension, as shown by an example in [53].
The scalespace theory results were restricted to a single bandwidth , and also the creation of modes does not necessarily imply that a mixture with components may have more than modes. The results for the general case of Gaussian mixtures (GMs) are as follows. Again, there is a qualitative difference between 1D and 2D or more. A GM (with possibly different bandwidths ) can have at most modes in 1D [19] but more than modes in 2D or above, even if the components are isotropic [20]. In 2D, if the components have diagonal or full covariance matrices, it is easy to construct examples with more modes than components [10, 19]. It is far harder to achieve this if all the components are isotropic with equal bandwidth , but still possible. This was first shown by a construction suggested by Duistermaat and studied by CarreiraPerpiñán and Williams [20], consisting of 3 Gaussians in the vertices of an equilateral triangle. For a narrow interval of scales (lifetime), an extremely shallow fourth mode appears at the triangle barycenter. They generalized this construction to dimension as a regular simplex with a Gaussian in each vertex, i.e., a GM with components, one at each vertex of the simplex [20]. They showed that the number of modes is either (for small scale, modes near the vertices), (for large scale, at the simplex barycenter) or (for a narrow range of intermediate scales, modes at the barycenter and near the vertices). Interestingly, the lifetime of the mode that is created in the barycenter peaks at and then slowly decreases towards zero as increases. Small perturbations (but not vanishingly small) of the construction prevent the creation of the extra mode, which suggests that it may be unlikely for isotropic GMs to have more modes than components. However, apart from a few isolated studies, the geometry of the modes of GMs in high dimensions is poorly understood.
As for the location of the modes of a GM, they are in the interior of the convex hull of the data points if the components are isotropic (with possibly different bandwidths ), for any dimension [10, 19], as seen in fig. 1B. With KDEs, it is easy to see this from the fact that eq. (2) is a convex sum if . If the components have diagonal or full covariance matrices, the modes can be outside the convex hull [10, 18, 19].
Most work on mean shift and scale space theory tacitly assumes that the modes of a GM are finite in number or at least are isolated. Although sometimes this is claimed to be a consequence of Morse theory, no proof seems to exist for an arbitrary dimension^{2}^{2}2In dimension one, this is simple to prove assuming infinite differentiabilty (as for the Gaussian kernel), by setting to the gradient of the mixture in the following result. Let be an infinitely differentiable real function of . Then either is identically zero or its zero crossings are isolated. Indeed, if is zero in a nonempty interval , then we have . Repeating this argument for , , etc. we obtain that all derivatives at any are zero, hence by Taylor’s theorem .. Indeed, kernels such as the uniform or triangular kernel do create continuous ridges of modes even in 1D.
An attractive feature of MS is that it is defined without step sizes, which makes it deterministic given the dataset and bandwidth. However, this is only useful if it converges at all, and whether this is the case depends on the kernel used. Not all kernels give rise to convergent MS updates, even if they are valid kernels (i.e., they are nonnegative and integrate to one); an example where MS diverges appears in [29]. For kernels where MS diverges, it is of course possible to devise optimization algorithms that will converge to the modes, e.g. by introducing a line search, but we lose the simplicity of the MS iteration. For kernels where MS converges, the convergence is to a mode for most initial points, but in general to a stationary point (minima and saddle points in addition to modes). Here, we review convergence results for MS, with a focus on the most common kernels (Gaussian and Epanechnikov).
With the Epanechnikov kernel, convergence occurs in a finite number of iterations [29]. The intuitive reason for this is that the KDE, which is the sum of kernels, is piecewise quadratic. There is a finite number of “pieces” (each corresponding to a possible subset of the kernels), and, within one piece, a single Newton iteration would find the maximum (indeed, the MS update coincides with a Newton step for the Epanechnikov kernel [35]).
With the Gaussian kernel, MS also converges, but in an infinite number of iterations, and the convergence rate is generally linear (sublinear or superlinear in limit cases) [10, 14]. One convenient way to see this is by relating MS with the EM algorithm [56], as follows.
In general for mixtures of Gaussians (using arbitrary covariance matrices) or other kernels (and thus in particular for KDEs), Gaussian MS is an expectationmaximization (EM) algorithm and nongaussian MS is a generalized EM algorithm
[10, 14]. This can be seen by defining a certain dataset and a certain probabilistic model with hidden variables (missing data), deriving the corresponding EM algorithm to maximize the loglikelihood for it, and verifying it coincides with the MS algorithm. For the GM case, the model consists of a constrained mixture of Gaussians, where each component has the given weight, mean and covariance matrix (constant, data point and bandwidth for KDEs, respectively), but the whole mixture can be freely (but rigidly) translated. The dataset consists solely of one point located at the origin. Thus, maxima of the likelihood occur whenever the translation vector is such that a mode of the mixture coincides with the origin. The missing data is the index of the mixture component that generated the origin. The resulting E step computes the posterior probabilities of eq. (3), and the M step maximizes a lower bound on the loglikelihood in closed form giving the MS update. With nongaussian kernels, the M step cannot be solved in closed form, and the MS update corresponds to a single iteration to solve this M step. (Whether this iteration actually increases the likelihood, leading to convergence, depends on the kernel.)Viewing MS as an EM algorithm has several consequences. Convergence for Gaussians MS is assured by the EM convergence theorems [56] (which apply because the function in the EM algorithm is continuous), and each iterate increases or leaves unchanged the density. General results for EM algorithms also indicate that the convergence order will typically be linear (with a rate dependent on the ratio of information matrices). Finally, MS can be seen as a bound optimization (see also [35]).
The convergence order can be studied in detail [14] by linearizing the MS update of eq. (3) around a mode , so the update can be written as , where is the Jacobian of . The Jacobian of and the Hessian of are related as follows:
(5) 
where is the local covariance matrix at , since (from and the definition of ) the local mean is
itself. The eigenvalues of
are in and the convergence rate is given by the largest one, with the iterates approaching along the eigenvector associated with the largest eigenvalue (assuming distinct eigenvalues). This shows that the convergence order depends on the bandwidth: it is nearly always linear, approaches superlinear convergence when or , and converges sublinearly at mode merges. The practically useful cases of MS use an intermediate value, for which the rate of linear convergence (i.e., the ratio of distances to the mode after and before the update) can be close to , thus convergence will be slow. The MS iterates smoothly approach the mode along the principal component of the local covariance matrix of the data points, from within the convex hull of the data points (see fig. 1B).We focus on Gaussian MS on a KDE, i.e., with isotropic covariances of bandwidth . As seen in fig. 1B, if initialized at points far from a mode, the first MS steps are often large and make considerable progress towards the mode. This is an advatageous property generally observed in alternating optimization algorithms (such as EM). After that, the steps become small, in agreement with the linear convergence rate.
The path followed by the MS iterates has the following properties (illustrated in fig. 1B). (1) Near convergence, the path follows the direction of the principal eigenvector of the local covariance at the mode [14]. (2) Each iterate is a convex linear combination of the data points, so the path lies in the interior of the convex hull of the data points [9, 14]. This is also true of nongaussian kernels if they satisfy . (3) The path is smooth in the sense that consecutive steps (consecutive meanshift vectors ) always make an angle in [29]. (4) The mean shift vector is proportional to the gradient of , so the MS iteration is a gradient step, but the step size is not the best one (i.e., does not maximize ) along the gradient.
In means, the centroids partition the space into Voronoi cells, which are convex polytopes. In MS, it is harder to characterize the regions that the modes partition the space into, i.e., their domains of convergence, which can in fact have surprising properties [14]. In general, they can be curved, nonconvex and disconnected (for example, one domain can be completely surrounded by another). This allows MS to represent complexshaped clusters and is an advantage over means. A less desirable aspect is that the domain boundaries can show fractal behavior, although this seems confined to cluster boundaries, and could be removed if necessary by postprocessing the clusters. This fractal behavior is due to the iterated MS mapping , and would not occur if we defined the clusters purely based on flow lines of the gradient.
Cheng [24] proved convergence of blurring meanshift, as follows. (1) For kernels broad enough to cover the dataset (e.g. infinitesupport kernels such as the Gaussian) convergence is to a dataset with all points coincident (), regardless of the value of . This can be seen by noting that the diameter of the data set decreases at least geometrically. (2) For finitesupport kernels and small enough , convergence is to several clusters with all points coincident in each of them; the clusters depend on the value of .
Another proof can be obtained from the matrix formulation of BMS [13], since at each iteration the dataset is multiplied times a stochastic matrix . By the PerronFrobenius theorem [47, ch. 8], with broad kernels this will have a single eigenvalue equal to 1 and all other eigenvalues with magnitude less than 1. Since a fixed point verifies then for some , i.e., all points coincide. For nonbroad kernels, the unit eigenvalue is multiple, resulting in multiple clusters where all points coincide.
While, for Gaussian BMS, the dataset converges to a size (therefore variance) zero, its variance need not decrease monotonically, in fact it is easy to construct examples in 1D where it increases at some steps.
CarreiraPerpiñán [13]
completely characterized the behavior with the Gaussian kernel (Gaussian BMS), assuming that the dataset is infinite with a Gaussian distribution. In this case, one can work with distributions rather than finite samples, and the iteration
can be written as a Gaussian integral and solved in closed form. The result is that the data distributionis Gaussian after each iteration, with the same mean, and it shrinks towards its mean independently along each principal axis and converges to it with cubic order. Specifically, the standard deviation
along a given principal axis evolves as with , where is the BMS bandwidth, and converges to 0 cubically. The reason for this extremely fast convergence is that, since is kept constant but the dataset shrinks, effectively increases. Thus, at each iteration both and decrease. Note that the smaller the initial is, the faster the convergence and so the direction of largest variance (principal component) collapses much more slowly (in relative terms) than all other directions.This explains the practical behavior shown by Gaussian BMS (see fig. 6): (1) clusters collapse extremely fast (in a handful of iterations, for a suitable bandwidth); (2) after a few iterations only the local principal component survives, resulting in temporary linearlyshaped clusters (that quickly straighten). These two behaviors make BMS useful for clustering and denoising, respectively.
The same proof technique applies to the generalized Gaussian BMS algorithms that use an update of the form . With a Gaussian dataset, each iteration produces a new Gaussian with a standard deviation (separately along each principal axis) . This allows a complete characterization of the conditions and order of convergence in terms of the real function , , instead of a matrix function. Convergence occurs if for and . Depending on , the convergence order can vary from linear to cubic and beyond [15].
As noted earlier, putting Gaussian BMS in matrix form in terms of a Gaussian affinity matrix uncovers an intimate relation with spectral clustering. Each BMS iteration is a product of the data times , the stochastic matrix of the random walk in a graph [25, 58], which in BMS represents the posterior probabilities of each point under the kernel density estimate (1). is closely related to the matrix (equivalent to the normalized graph Laplacian) commonly used in spectral clustering, e.g. in the normalized cut [76]. The eigenvalue/eigenvector pairs and of and satisfy and . In spectral clustering, given and one computes the eigenvectors associated with the top eigenvalues of (if clusters are desired); in this spectral space the clustering structure of the data is considerably enhanced and so a simple algorithm such as means can often find the clusters. In BMS, we iterate the product . If were kept constant, this would be the power method [39] and each column of would converge to the leading left eigenvector of (the vector of ones, i.e., a single cluster), with a rate of convergence given by the second eigenvalue (the Fiedler eigenvalue in spectral clustering). However, the dynamics of BMS is more complex because also changes after each iteration. In practice and quickly reach a quasistable state where points have collapsed in clusters which slowly approach each other and remains almost constant (at which point BMS is stopped). Thus, BMS can be seen as refining the original affinities into a matrix consisting of blocks of (nearly) constant value and then (trivially) extracting piecewiseconstant eigenvectors for each cluster with the power method. With the generalized BMS algorithm, one uses instead the matrix , which has the same eigenvectors as but eigenvalues . However, this manipulation of the spectrum of is performed implicitly, without actually having to compute the eigenvectors as in spectral clustering.
While both spectral clustering and BMS rely on the randomwalk matrix , they differ in several respects. They do not give the same clustering results. In spectral clustering, the user sets the desired number of clusters and the bandwidth (if using Gaussian affinities), while in BMS the user sets only and is determined by this. Computationally, spectral clustering solves an eigenproblem (and then runs means), while BMS (especially if using its accelerated version) performs a small number of matrix products (and then runs connectedcomponents), thus it is considerably faster.
Many image processing algorithms (for example, for image denoising) operate on range variables (intensity, color) defined on the space variables (of the image lattice). This includes bilateral filtering [63], nonlinear diffusion [89] and others. Meanshift is basically an iterated, local averaging that operates jointly on range and space, and bears both similarities and differences with those algorithms, which are described in [4]. For example, in bilateral filtering the spatial component is fixed during iterations, so only the range variables are updated, and a stopping criterion is necessary to prevent excessive smoothing.
Meanshift algorithms appear whenever we have expressions having the form of a sum over data points of a function of squared distances of the parameter (whether the latter is a vector or a matrix or a set thereof). Equating the gradient of this expression to zero we can solve for and obtain a fixedpoint iteration with the form of a weighted average of the data points, where the weights depend on . One example mentioned below are Riemannian centers of mass. Another example are Laplacian objective functions, of the form , where is of are coordinates of the data points in a lowdimensional space, is an affinity matrix and is its graph Laplacian. Alternating optimization over each can be done with a meanshift algorithm. Laplacian objectives appear in algorithms for (among other problems) dimensionality reduction and clustering, such as Laplacian eigenmaps [5], the elastic embedding [16] or spectral clustering [76]. However, as noted in section 2.3, the resulting meanshift iteration need not converge in general. This is the case, for example, when the weights or the kernels can take negative values, as in the elastic embedding. In this case, one should use a line search or a different optimization algorithm altogether.
In some applications, the distribution of the data changes over time, and we want to track clusters and their location, in order to predict their location at a future time (assuming the distribution changes slowly). One example is tracking a nonrigid object over a sequence of images, such as a video of a moving car. A robust way to represent the object is by the color histogram of all the pixels within the object. In its simplest form, one can use a region of fixed shape and size but variable location, initialized by the user in the first image. Then, the most likely location of the object in the next image can be defined as the region (near the current region) having the histogram closest to the current histogram. Comaniciu et al. [30] noted that we can use a differentiable KDE instead of a histogram, and that any differentiable similarity function of two KDEs also has the form (up to constant factors) of a weighted KDE to first order (a good approximation if the object moves slowly). Hence, we can use meanshift iterations to maximize the (linearized) similarity over the location, thus finding the location where pixels look most like in the previous image’s region. The resulting algorithm is fast enough to track several objects in realtime in video, and is robust to partial occlusion, clutter, distractors and camera motion.
Hall et al. [43] define a spatiotemporal KDE with a product kernel . The kernel over the spatial variables is typically Gaussian and the kernel over the temporal variable considers only past observations and gives more importance to recent ones. At each new value of , one finds modes over by starting from the modes of the previous time value (as in scalespace theory, but using time rather than scale).
The original mean shift algorithm is defined on the Euclidean space
. Sometimes, the data to be clustered lies on a lowdimensional manifold, so the meanshift iterates and the modes should also lie on this manifold. In some applications (such as motion segmentation, diffusion tensor imaging and other computer vision problems) this is a known Riemannian manifold, such as rotation matrices, Grassmann manifolds or symmetric positive definite matrices
[6]. Subbarao and Meer [78] extended meanshift to Riemannian manifolds by defining the squared distances appropriately with respect to the manifold (although this does not result in a proper KDE on the manifold). However, most times the manifold is not known a priori. Shamir et al. [73] extended meanshift to 3D point clouds in computer graphics applications, by constraining the mean shift steps to lie on the surfaces of a triangulated mesh of the data. The Laplacian modes described later uses a different approach to find modes that are valid patterns with data lying on nonconvex manifolds.The meanshift algorithm operates on data where each data point is defined by a feature vector in . In some applications, the data is best represented as a weighted graph, where each vertex is a data point , and an edge represents a neighborhood relation between a pair of data points, with a realvalued weight representing distance. This allows one to work with data that need not live in a Euclidean space, and can be used to represent distances along a manifold (for example, by approximating geodesics through shortest paths over the graph).
Some algorithms have been proposed for graph data that are based on a KDE. They work by assigning to each data point a “parent” in the graph, which is another data point in their neighborhood having a higher density (roughly speaking). This results in a forest of directed trees that spans the graph, whose roots are “modes” in the sense of not having any neighboring point with higher density. Each tree corresponds to a cluster. The parent of a data point is defined as the data point that optimizes a criterion based on the KDE and the distances between data points. The crucial aspect is that this optimization is defined only over the data points rather than over the space (as happens with medoid algorithms).
One of the earliest approaches is the algorithm of [51]. This uses as criterion , which is a numerical approximation to the directional gradient of the KDE. This is maximized over the data points that are in a neighborhood of (e.g. the nearest neighbors or the points within a distance ). The output of this algorithm can be improved with clustermerging based on topological persistence [23].
The medoidshift algorithm of [75] is more directly related to meanshift. It uses as criterion the Riemannian center of mass (sometimes called Fréchet mean [1, 64]) , where is evaluated at the current iterate . The algorithm then alternates computing the Riemannian center of mass (over the data points) with updating the weights. We recover meanshift by using the squared Euclidean distance for and optimizing over . Unlike meanshift (which finds maxima of the KDE), this does not seem to maximize a global objective function, but rather maximizes a local objective at each iteration and data point. The algorithm gives better clusterings and is faster in its “blurring” version, where at every iteration the entire dataset is updated. In matrix form (using the randomwalk matrix and the matrix of pairwise distances, and updating all points synchronously), this takes the form , where the minimization is columnwise. Hence, medoidshift can be seen as a discrete filter, while the BMS iteration is a continuous filter. As in the accelerated BMS, each iteration contracts the graph, reducing the number of points. It is possible to define other variations by using different types of distances , such as the local Tukey median in the medianshift algorithm of [74].
The meanshift update as originally proposed [37] results from estimating the gradient of a density in two steps: first, one estimates the density with a KDE, and then one differentiates this. Sasaki et al. [71] directly estimate the gradient of the logdensity by using score matching [49] and derive a meanshift algorithm based on this (by equating the gradient to zero and obtaining a fixedpoint iteration). In score matching, one does a leastsquares fit to the true logdensity gradient using a model. If the model is a linear combination of basis functions (one per data point), one obtains an update that is identical to the meanshift update of eq. 3 but with weighted kernels (although, since these weights can be negative, this fixedpoint iteration is not guaranteed to converge). A faster, parametric method is obtained by using fewer basis functions than data points. They observe better results than the original meanshift with highdimensional data. Note that, as seen in eq. (4), using an adaptive KDE (e.g. with bandwidths obtained using the entropic affinities) also gives a weighted meanshift update, but the weights obey a different criterion.
Ranking data consists of permutations of a given set of items, and are discrete. Meilă and Bao [57] extended blurring meanshift to ranking data by using the Kendall distance rather than the Euclidean distance, and by rounding the continuous average of eq. (3) to the nearest permutation after each iteration (so the algorithm stops in a finite number of steps).
Functional data, such as a collection of curves or surfaces, live in an infinitedimensional space. Meanshift clustering can be extended to functional data, where it corresponds to a form of adaptive gradient ascent on an estimated surrogate density [26].
The meanshift iteration (2) or (3) is essentially a local smoothing, which provides a smooth or denoised version of a data point as a weighted average of its nearby data points. Clustering is just one specific problem that can make use of this smoothing. Here we describe work based on meanshift for manifold denoising. This was already pointed out in [37]. Consider data lying on a lowdimensional manifold in , but with noise added (as in fig. 3). If we run one meanshift iteration for each data point in parallel, and then replace each point with its denoised version, we obtain a denoised dataset, where points have moved towards the manifold. Repeating this eventually compresses the dataset into clusters, and this is the basis of the BMS algorithm, although of course it destroys the manifold structure. However, if we stop BMS very early, usually after one or two iterations, we obtain an algorithm that can remove noise from a dataset with manifold structure.
The denoising ability of local smoothing was noted independently in the computer graphics literature for the problem of surface fairing or smoothing. Earlier, the Laplacian had also been used to smooth finite element meshes [46]. In surface smoothing, 3D point clouds representing the surface of an object can be recorded using LIDAR, but usually contain noise. This noise can be eliminated by Laplacian smoothing, which replaces the 3D location of each point with the average of its neighbors [79, 80]. The neighbors of each point are obtained from a triangulated graph of the cloud, which is usually available from the scanning pattern of LIDAR. Typically, the Laplacian is constructed without the notion of a KDE or a randomwalk matrix, and both the values of its entries and the connectivity pattern are kept constant during iterations.
One important problem of BMS (or Laplacian smoothing) is that points lying near the boundary of a manifold or surface will move both orthogonally to and tangentially along the manifold, away from the boundary and thus shrinking the manifold. While this is good for clustering, it is not for surface smoothing because it distorts the object shape, or for manifold learning because it distorts the manifold structure. For example, in the MNIST dataset of handwritten digit images [52], motion along the manifold changes the style of a digit (e.g. slant, thickness). Various approaches, such as rescaling to preserve the object volume, have been proposed in computer graphics [33]. In machine learning, the manifold blurring meanshift (MBMS) algorithm [86] is an extension of BMS to preserve the local manifold structure. Rather than applying to each point the meanshift vector directly, this vector is corrected to eliminate tangential motion. The meanshift vector is first projected onto the local tangent space to the manifold at that point (estimated using local PCA on the nearest neighbors of the point), and this projection is removed from the motion. Hence, the motion is constrained to be locally orthogonal to the manifold. Comparing the local PCA eigenvalues between the orthogonal and tangent spaces gives a criterion to stop iterating.
Fig. 4 shows MNIST images before and after denoising with MBMS. The digits look smoother (as if they had been antialiased to reduce pixelation) and easier to read (compare the original vs the denoised
), and indeed a classifier trained on the denoised data performs better
[86]. While this smoothing homogenizes the digits somewhat, it preserves distinctive style aspects of each digit. MBMS performs a sophisticated denoising (very different from simple averaging or filtering) by intelligently closing loops, removing or shortening spurious strokes, enlarging holes, removing speckle noise and, in general, subtly reshaping the digits while respecting their orientation, slant and thickness. MBMS has also been applied to the identification of the structure of tectonic faults from seismic data [41].(Manifold) denoising can be used as a preprocessing stage for several tasks, leading to more robust models, such as dimensionality reduction [86], classification [44, 86], density estimation [42] or matrix completion [88].
In some algorithms, one can establish a continuum between clustering and dimensionality reduction, with clustering being the extreme case of dimensionality reduction where the manifold dimensionality is zero. This is the case with MBMS, where a tangent space of dimension zero recovers BMS. Another case is spectral clustering and spectral dimensionality reduction, e.g. normalized cut [76] vs Laplacian eigenmaps [5]
: both operate by extracting eigenvectors from the graph Laplacian, but in spectral clustering the number of eigenvectors is the number of clusters, while in spectral dimensionality reduction one cluster is assumed and the eigenvectors correspond to degrees of freedom along the manifold defined by the cluster.
MBMS  

BMS 


Computationally, meanshift algorithms are slow, because their complexity is quadratic on the number of points (which would limit their applicability to small data sets), and many iterations may be required to converge (for MS). Significant work has focused on faster, approximate meanshift algorithms, using a variety of techniques such as subsampling, discretization, search data structures [70], numerical optimization [61], and others, which we review below. Some of them are specialized to image segmentation while others apply to any data. A practical implementation of a fast meanshift algorithm could combine several of these techniques. In addition, although it does not seem to have been studied in detail, meanshift algorithms benefit from embarrasing parallelism, since the iterations for each point proceed independently from the rest of the points.
With the Epanechnikov kernel, meanshift converges in a finite number of iterations. If the neighborhood structure of the data is known a priori, as is the case in image segmentation [29], MS can be very efficient without any approximations, since the iteration for each point involves only its neighbors.
With the Gaussian kernel and with arbitrary data, for which the neighborhood structure is not known and varies over iterations, the runtime of MS can be large, particularly with large datasets, and much work has tried to accelerate Gaussian MS. Computationally, MS has two bottlenecks [12]: (1) Accurately converging to a mode can require many iterations, since its convergence is typically linear and can approach sublinearity in regions where two modes are close. (2) Each iteration is linear on the number of points , because it is an average of the data. Thus, running MS for the entire dataset is , where is the average number of iterations per point and the dimension. For example, in image segmentation, typical values are 10 to 100 for (depending on the accuracy sought and the bandwidth ), (higher if using texture features) and thousands to millions for (the number of pixels). Acceleration algorithms should attack either or both bottlenecks, while keeping a low approximation error, i.e., producing a clustering (assignment of points to modes found) that is close to that of the naive MS—otherwise one would really be running a different clustering algorithm.
As for accelerating the convergence, the obvious answer is using Newton’s method (with a modified Hessian to ensure ascent), because it has quadratic convergence, computing the Hessian of a Gaussian mixture is simple [9] and Newton iterations are typically not much costlier than those of MS. The reason is that, for the lowdimensional problems for which MS is best suited, computing the Newton direction (which involves solving a linear system) is not costly compared to computing the gradient or Hessian themselves. However, Newton’s method does have other problems with MS: (1) it introduces user parameters such as step sizes or damping coefficients; (2) it need not be effective far from a mode, where the Hessian is undefined; (3) since Newton’s method is very different from MS, it can make a point converge to a different mode than under MS. A solution to this is to use a MSNewton method [12] that starts with MS iterations (which often suffice to move the iterate relatively close to a mode) and then switches to Newton’s method.
As for reducing the cost of one iteration below , the most obvious strategy is to approximate the average in each iteration with a small subset of the data, namely the points closest to the current iterate. This can be done by simply ignoring faraway points in the average, which introduces an approximation error in the modes, or by updating faraway points infrequently as in the sparse EM algorithm of [12], which guarantees convergence to a true mode. Unfortunately, either way this requires finding nearest neighbors, which is itself unless we know the neighborhood structure a priori. And, if the bandwidth is relatively large (which is the case if we want a few clusters), then a significant portion of the data have nonnegligible weights in the average. Another approach is to use body methods such as the (improved) fast Gauss transform [40, 93] or various tree structures [85]. These can reduce the cost of one iteration for every data point to or , respectively. However, this implies we run synchronous iterations (all points move at once), and does not scale to high dimensions (because the data structures involved grow exponentially with the dimension), although it is generally appropriate for the low dimensions used in image segmentation.
A final approach consists of eliminating many of the iterations for many of the points altogether, in effect predicting what mode a given point will converge to. This cannot be simply done by running MS for a small subset of the points and then assigning the remaining points to the mode of their closest point in the subset, because meanshift clusters have complex shapes that significantly differ from a nearestneighbor clustering, and the clustering error would be large. A very effective approach for image segmentation is the spatial discretization approximation of [12]. This is based on the fact that the trajectories followed by different points as they converge towards a given mode collect close together (see section 2.3 and fig. 1B). Thus, we can run the naive MS for a subset of the points, keeping track of the trajectories they followed. For a new point, as soon as we detect that its iterate is close to an existing trajectory, we can stop and assign it to that trajectory’s mode. With image segmentation (even if the number of features, and thus dimensions, is large), the trajectories can be coded by discretizing the image plane into subpixel cells and marking each cell the first time an iterate visits it with the mode it converges to. Since nearly all these cells end up empty, the memory required is small. This reduces the average number of iterations per point to nearly 1 with only a very small clustering error [12]. This error can be controlled through the subpixel size.
Yuan et al. [94] show that, if is the result of applying a meanshift step to a point , then all the points within a distance from have a KDE value . Not all these points actually converge to the same mode as , however many do. A fast meanshift algorithm results from not running meanshift iterations for all those points and assigning them to the same mode as . Hence, as in the discretization approach above, only a few data points actually run meanshift iterations. The approximation error was not controlled in [94], although one could use instead a distance and use to control it.
Paris and Durand [62] combine several techniques to accelerate meanshift image segmentation, including spatial discretization and 1D convolutions (classifying most pixels without iterating, as above), and construct a hierarchy of clusterings based on the notion of topological persistence from computational geometry [34]. The algorithm is fast enough to segment large images and video, although it is not clear how good an approximation it is to the true meanshift.
As the dimension of the data points increases, searching for nearest neighbors (to compute the mean shift update in eq. (2) or (3) with a truncated kernel) becomes a bottleneck. One can then use approximate nearestneighbor algorithms. For example, LocalitySensitive Hashing (LSH) [2] is used in [38].
A simple and very effective acceleration that has essentially zero approximation error was given in [13]. It essentially consists of interleaving connectedcomponents and blurring steps. The fact that BMS collapses clusters to a single point suggests that as soon as one cluster collapses we could replace it with a single point with a weight porportional to the cluster’s number of points. This will be particularly effective if clusters collapse at different speeds, which happens if they have different sizes, as predicted in section 2.3; e.g. see fig. 6. The total number of iterations remains the same as for the original BMS but each iteration uses a dataset with fewer points and is thus faster. Specifically, the Gaussian kernel density estimate is now where and the posterior probability is as in [9]. At the beginning and when clusters and merge then the combined weight is . Using the matrix notation of section 2.1 we have , , and . This can be proven to be equivalent to the original BMS.
The reduction step where coincident points are replaced with a single point can be approximated by a connectedcomponents step where points closer than are considered coincident, where takes the same value as in the final connectedcomponents step of BMS (fig. 2B). Thus, is the resolution of the method (below which points are indistinguishable), and while BMS applies it only after having stopped iterating, the accelerated version applies it at each iteration. Hence, the accelerated BMS algorithm alternates a connectedcomponents graph contraction with a BMS step.
Experimentally with image segmentation, a remarkable result arises: the runtime of this accelerated BMS is almost constant over bandwidth values, unlike for both MS and BMS, corresponding to about 4 to 5 BMS iterations, and is dominated by the first few iterations, where almost no merging occurs. This means a speedup factor of 2 to 4 over BMS and 5 to 60 over MS [13].
As discussed earlier, equating modes with clusters as in meanshift implies we lose direct control on the number of clusters and do not differentiate between meaningful and nonmeaningful modes. Recently, a modes algorithm [17] has been proposed that addresses these problems^{3}^{3}3There exists another algorithm called “modes” [48, 22], but defined for categorical data, rather than continuous data.. The (Gaussian) modes algorithm takes both and as user parameters and maximizes the objective function
(6) 
over the cluster centroids and the assignments of points to clusters , where is the Gaussian kernel. For a given assignment , this can be seen as the sum of a KDE as in eq. (1) but separately for each cluster. Thus, a good clustering must move centroids to local modes, but also define separate KDEs. This naturally combines the idea of clustering through binary assignment variables (as in means or medoids) with the idea that, for suitable bandwidth values, highdensity points are representative of a cluster (as in meanshift). As a function of the bandwidth , the modes objective function becomes means for and a version of medoids for . The training algorithm alternates an assignment step as in means, where each data point is assigned to its closest centroid, with a modefinding step as in meanshift, but only for each centroid in the KDE defined by its current cluster (rather than for each data point). A more robust, homotopybased algorithm results from starting with (i.e., means) and gradually decreasing while optimizing the objective for each . The computational complexity of this algorithm is the same as for means ( per iteration), although it requires more iterations, and is thus much faster than meanshift ( per iteration).
modes obtains exactly modes even if the KDE of the data has more or fewer than modes, because it splits the data into KDEs. Using the homotopybased algorithm, it tends to track a major mode in each cluster and avoid outliers. Thus, modes can work well even in high dimensions, unlike meanshift.
The fundamental disadvantage of modes is that the clusters it defines are convex, as with means, since they are the Voronoi cells defined by the modes. This is solved in the Laplacian modes algorithm [87], which minimizes the objective function
(7) 
This relaxes the hard assignments of modes to soft assignments (so each point may belong to different clusters in different proportions) and adds a term to the objective with a weight that encourages neighboring points to have similar soft assignments. This term can be equivalently written as in terms of the graph Laplacian constructed from an affinity matrix , where is the degree matrix. The training algorithm is as with modes, but the assignment step becomes a convex quadratic program on , which can be solved efficiently with a variety of solvers.
Laplacian modes becomes the modes algorithm with , and a Laplacian means algorithm with . The introduction of the Laplacian term allows the clusters to be nonconvex, as happens with other Laplacianbased algorithms such as spectral clustering. However, here we solve a quadratic program rather than a spectral problem, and obtain a KDE, centroids (= modes) and soft assignments of points to clusters (unlike spectral clustering, which only returns hard assignments).
An outofsample mapping to predict soft assignments for a test point not in the original training set can be obtained by augmenting (7) with and solving but keeping the centroids and the soft assignments for the training points fixed. The result equals the projection on the simplex of the average of two terms: the average of the assignments of ’s neighbors, and a term dependent on the distances of to the centroids [87]. The assignment can be readily interpreted as a posterior probability of belonging to cluster . Hence, Laplacian modes can be seen as incorporating nonparametric posterior probabilities into meanshift (or spectral clustering). The usual way of obtaining posterior probabilities in clustering is by using a mixture model, for which an EM algorithm that maximizes the likelihood is derived. In contrast to this parametric approach, in Laplacian modes the assignments optimize an objective function that is designed specifically for clustering, unlike the likelihood; the clusters are not obliged to follow a model (such as Gaussian); and, consequently, the optimization algorithm does not depend on the particular type of clusters (unlike the EM algorithm, which depends on the cluster model). These soft assignments or posterior probabilities are helpful to estimate the uncertainty in the clustering or for other uses. For example, in image segmentation they can be used as a smooth pixel mask for image matting, conditional random fields or saliency maps.
In centroidbased clustering algorithms, each cluster is associated with a centroid: the cluster mean in means, an exemplar (data point) in medoids, and the KDE mode in meanshift, modes and Laplacian modes. A desirable property of centroids, which makes them interpretable, is that they should be valid patterns and be representative of their cluster (“look” like a typical pattern in the cluster). A wellknown disadvantage of means is that the mean of a nonconvex cluster need not be a valid pattern, since it may lie in a lowdensity area. This is often the case with clusters with manifold structure. For example, given a sequence of rotated digit1 images , its mean is not a valid digit1 image itself. In medoids, exemplars (data points) are valid patterns by definition, although when the data is noisy, individual exemplars may look somewhat atypical (see the manifold denoising discussion in section 2.6). Modes can strike an optimal tradeoff, in being typical, valid and yet denoised representatives, as follows. First, modes are by definition on highdensity areas, so in this sense they are representative of the KDE and cluster. Second, a mode can be characterized as a scaledependent, local average of the data. Indeed, a mode is a maximum of the density , so its gradient at is zero, and this implies that equals the weighted average of the data in the sense of eq. (2) and (3) (where ), as seen in section 2. As a function of the scale, a mode spans a continuum between equaling the regular mean of the whole data () and equaling any individual data point (). With an intermediate bandwidth in the KDE, they are local averages of the data and so can remove noise that affects each individual exemplar, thus being even more prototypical than actual data points. In this sense, (Laplacian) modes achieves a form of intelligent denoising similar to that of MBMS (section 2.6).
Table 1 compares several clustering algorithms in terms of whether their centroids are valid patterns, whether they can model nonconvex clusters, whether they estimate a density, and whether the cluster assignments are hard or soft.
means  medoids  Meanshift 

modes 




Centroids  likely invalid  “valid”  “valid”  N/A  valid  valid  likely invalid  

no  depends  yes  yes  no  yes  to some extent  
Density  no  no  yes  no  yes  yes  yes  
Assignment  hard  hard  hard  hard  hard  soft, nonparam.  soft, param. 
We illustrate MS and BMS in the clustering application where they have been most effective (image segmentation), and point out clustering applications where they are not so effective (highdimensional data or data with manifold structure), as well as applications beyond clustering where MS and BMS have been used. Matlab code for most of the algorithms described may be obtained from the author.
MS and BMS are most effective with lowdimensional data, and its most successful application is in image segmentation [29]. They are applied as follows. Given an image, we consider each pixel as a data point in a space consisting of two types of features: spatial features (the and location of a pixel) and range features (the intensity or grayscale value, the color values, texture features, etc. of the pixel). The range features are scaled to span approximately the range of the spatial features. This way, all features and the bandwidth have pixel units. For example, for the image of fig. 5, we rescale the original intensity values to the range , so a feature vector would correspond to the pixel located at coordinates , which has an intensity equal to % of the maximum intensity (white). The precise scaling will affect the clustering and should be done carefully. Using spatial features is beneficial because they introduce spatial coherence (nearby pixels tend to belong to the same cluster), although sometimes only the range features are used. One should use a perceptually uniform color space such as LAB rather than the RGB space, so that Euclidean distances approximately match perceptual differences in color [36].
A: original image  B: MS segmentation  C: datapoints & iterates  D: data/iter. (2D proj.) 

I[][][1][90] H[][] V[][b] 
Fig. 5 shows an example with a grayscale image of pixels. Thus, the dataset contains points in 3D (location , and intensity ). Fig. 5 shows the result with MS while fig. 6 shows the result with BMS. As noted earlier, MS and BMS generally give similar clusterings over a range of bandwidths.
With the Gaussian kernel, reasonable clusterings arise for bandwidths around of the image size (vertical or horizontal), although one should explore a range of bandwidths. In some computer vision applications, MS is sometimes used with a very small bandwidth in order to oversegment an image into uniform patches which are then further processed.
Fig. 7 illustrates two situations where MS (and BMS) do not produce good clusterings. The first one is data having manifold structure (of lower dimension than the feature space), as in the 1D spirals in a 2D space shown in the left panel. The KDE will have either many modes spread over the data manifolds (for smaller bandwidths) or few modes that mix manifolds (for larger bandwidths). No bandwidth value will cluster the spirals correctly. The Laplacian modes algorithm is able to cluster the spirals correctly, while estimating a reasonable KDE and mode for each spiral [87].
The second situation is highdimensional data, as in the MNIST handwritten digits [52], where each data point is a grayscale image of pixels (i.e., feature dimensions). In highdimensional spaces, MS tends to produce either a single mode (for larger bandwidths), which is uninformative, or many modes (for smaller bandwidths), which often sit on small groups of points or outliers. In the results in fig. 7 (right panel), the MS bandwidth was tuned to produce clusters, and means and Laplacian modes were run with . Since the data forms nonconvex clusters, means produces centroids that average distant data points, of possibly different digit classes, and thus do not look like digits. Again, Laplacian modes can partition the data into exactly clusters, while finding a mode for each that looks like a valid digit and is more representative of its class [87].


In (univalued) regression, we assume there exists a mapping which assigns a unique vector to each possible input (i.e., is a function in the mathematical sense). We estimate given a training set of pairs . This is the usual regression setting. In multivalued regression, the mapping assigns possibly multiple vectors to each input . A classical example is learning the inverse of a (univalued) mapping . For example, the inverse of is , which has , or outputs depending on the value of (smaller, equal or greater than zero, respectively). CarreiraPerpiñán [8, 10, 11] proposed to represent a multivalued mapping by the modes of the conditional distribution . Hence, in this case, we apply meanshift to a conditional distribution . One advantage of this approach is that the number of inverses is selected automatically for each value of ; essentially, it is the number of “clusters” found by meanshift in . Once the modes have been found, one can also estimate error bars for them from the local Hessian at each mode [8, 10]. The conditional distribution may be obtained from a Gaussian KDE or Gaussian mixture
for the joint distribution of
and (since the marginal and conditional distributions of any Gaussian mixture are also Gaussian mixtures), but it can also be learned directly, for example using mixture density networks or particle filters [7].Fig. 8 illustrates two examples. On the left panels, we learn the inverse kinematics mapping of a robot arm [65, 66]. The forward kinematics mapping is univalued and gives the position in workspace of the arm’s endeffector given the angles at the joints. The inverse mapping is multivalued, as shown by the elbowup and elbowdown configurations. In this case, the conditional distribution was learned using a particle filter. On the right panel, we learn to map speech sounds to tongue shapes (articulatory inversion) [67]. The American English /r/ sound (as in “rag” or “roll”) can be produced by one of two tongue shapes: bunched or retroflex. The figure shows, for a specific /r/ sound, 3 modes in tongue shape space (two bunched and one retroflex). In both inverse kinematics and articulatory inversion, using the conditional mean (shown in magenta), or equivalently doing univalued leastsquares regression, leads to invalid inverses.
theta1[t][] theta2[][][1][90]  x1[t][] x2[][][1][90] t1 t2 
Meanshift algorithms are based on the general idea that locally averaging data results in moving to higherdensity, and therefore more typical, regions. Iterating this can be done in two distinct ways, depending on whether the dataset itself is updated: mode finding (MS), or smoothing (BMS), both of which can be used for clustering, but also for manifold denoising, multivalued regression and other tasks.
In practice, meanshift algorithms are very attractive because—being based on nonparametric kernel density estimates—they make few assumptions about the data and can model nonconvex clusters. The user need only specify the desired scale of the clustering but not the number of clusters itself. Although computationally slow, meanshift algorithms can be accelerated and scale to large datasets. They are very popular with lowdimensional data having nonconvex clusters, such as image segmentation. The (Laplacian) modes algorithms remain nonparametric but still perform well with highdimensional data and outliers.
Meanshift is intimately related to kernel density estimates and Gaussian mixtures, and to neighborhood graphs. The graph Laplacian or randomwalk matrix arises as a fundamental object in clustering and dimensionality reduction that encapsulates geometric structure about a dataset. Although not usually seen this way, MS/BMS algorithms are essentially based on , and they alternate between smoothing the data with (a power iteration), with possibly modified eigenvalues, and updating itself.
Some directions for future research are as follows. (1) Understanding the geometry of Gaussian mixtures and its modes in high dimensions. (2) Finding a meaningful extension of (blurring) meanshift to directed graphs. (3) Designing, or learning, randomwalk or Laplacian matrices that are optimal for clustering in some sense. (4) Finding a meaningful definition of clusters that is based on “bumps” of a kernel density estimate, i.e., distinct regions of high probability rather than modes.
I thank Weiran Wang and Marina Meilă for comments on the paper.
Consider an undirected graph with vertex set and edge set . A connected component of is a maximal subset of such that every pair of vertices in it can be connected through a path using edges in [31]. Hence, the vertex set can be partitioned into connected components. The connected component of a given vertex can be found by depthfirst search (DFS), which recursively follows edges adjacent to until all vertices reachable from have been found. This can be repeated for the remaining vertices until all connected components have been found, with a total runtime .
Connectedcomponents is a clustering algorithm in its own right. To use it, one provides a matrix of distances between every pair of vertices (data points and ) and a threshold , and defines a graph having as vertices the data points and as edges if for (this is sometimes called an ball graph). The connected components of this graph give the clusters.
Connectedcomponents gives poor clustering results unless the clusters are tight and clearly separated. However, it can be reliably used as a postprocessing step with meanshift algorithms, as described in the main text, to merge points that ideally would be identical. For example, with MS, the final iterates for points that in the limit would converge to the same mode are numerically different from each other (by a small amount if using a sufficiently accurate stopping criterion in the meanshift iteration). Hence, these iterates form a tight cluster around their mode, which is widely separated from the tight clusters corresponding to other modes.
Naively implemented (fig. 9 left), connectedcomponents would run in time, because to construct the graph we have to threshold all pairwise distances, each of which costs . However, in the case of tight, clearly separated clusters, we need not construct the graph explicitly, and it runs in if there are connected components. Let the data satisfy the following “tight clusters” assumption: there exists that is larger than the diameter of each component (the largest distance between every pair of points in a component) but smaller than the distance between any two components (the smallest distance between two points belonging to different components). Then the connected components can be found incrementally by connecting each point to the representative of its component (fig. 9 right), where the representative of component is , which is a point in that component. This can be done on the fly, as we process each data point in MS, or given the final set of iterates in BMS.
Some heuristics can further accelerate the computation. 1) The distances can be computed incrementally (dimension by dimension), so that as soon as is exceeded we exit the distance calculation. In the tight cluster assumption, this could reduce the cost to , since for distances above each individual distance dimension will typically exceed . 2) In image segmentation, we can scan the pixels in raster order, so that for most pixels their component is the same as for the previous pixel (this will not be true when crossing a cluster boundary, but that happens infrequently). Then, when computing the distance, we always try first the last pixels’s component. The average cost for the entire connectedcomponents runtime is then , because most pixels compute a single distance.
The value of depends on the problem but it can usually be chosen in a wide interval. For example, in image segmentation with features in pixel units, we can set , since we do not need subpixel accuracy to locate a mode, and modes must be at least one pixel apart to be meaningful. In BMS, we can safely set to a smaller value (say, ), since its cubic convergence rate produces extremely tight clusters quickly.
A. Naive connectedcomponents algorithm Define an ball graph: vertices edges , . Apply DFS to this graph.  B. Efficient connectedcomponents algorithm first component for to for to if old component assign to component ; break if was not assigned new component 
Scalebased clustering using the radial basis function network.
IEEE Trans. Neural Networks
, 7(5):1250–1261, Sept. 1996.Proc. of the 11th Int. Conf. Artificial Intelligence and Statistics (AISTATS 2007)
, San Juan, Puerto Rico, Mar. 21–24 2007.
Comments
There are no comments yet.