Up until the 1970’s there were two main ways of clustering points in space. One of them, perhaps pioneered by Pearson 
, was to fit a (usually Gaussian) mixture to the data, and that being done, classify each data point — as well as any other point available at a later date — according to the most likely component in the mixture. The other one was based on a direct partitioning of the space, most notably by minimization of the average minimum squared distance to a center: the-means problem, whose computational difficulty led to a number of famous algorithms [37, 22, 31, 36, 39]
and likely played a role in motivating the development of hierarchical clustering[63, 21, 25, 54].
In the 1970’s, two decidedly nonparametric approaches to clustering were proposed, both based on the topography given by the population density. Of course, in practice, the density is estimated, often by some form of kernel density estimation.
Clustering via level sets
One of these approaches is that of Hartigan , who proposed to look at the connected components of the upper-level sets of the population density. Thinking of clusters as “regions of high density separated from other such regions by regions of low density”, at a given level, each connected component represents a cluster, while the remaining region in space is sometimes considered as noise. The basic idea was definitely in the zeitgeist. For example, a similar approach was suggested around the same time by Koontz and Fukunaga .
The choice of level is not at all obvious, and in fact Hartigan recommended looking at the entire tree structure — which he called the “density-contour tree” and is now better known as the cluster tree — that arises by the nesting property of the upper-level sets considered as a whole. Note, however, that the cluster tree does not provide a complete partitioning of the space.
Hartigan [30, 29, 28], and later , showed that the cluster tree can be estimated by single linkage, achieving a weak notion of consistency called fractional consistency. Since then, the estimation of cluster trees using different algorithms or notions of consistency has been studied in [58, 59, 49, 10, 62, 18]. At a fixed level in a cluster tree, clustering is naturally related to level set estimation, which has in itself received a lot of attention in the literature, e.g., [46, 60, 48, 12, 38, 61, 50, 53]. To address the problem of choosing the level, [55, 57, 56] considered the lowest split level in cluster tree, which can be used to recover the full cluster tree when applied recursively.
Clustering via gradient lines
The other approach is that of Fukunaga and Hostetler , who proposed to use the gradient lines of the population density. Simply put, assuming the density has the proper regularity (which in particular requires that it is differentiable everywhere), a point is ‘moved’ upward along the curve of steepest ascent in the topography given by the density, and the points ending at the same critical point form a cluster. This gradient flow definition of clustering is particularly relevant when the density is a Morse function , as in that case each local maximum has its own basin of attraction and the union of these cover the entire space except for a set of Lebesgue measure zero.
The general idea of clustering by gradient ascent has been proposed or rediscovered a few times [14, 35, 51]. For a fairly recent review of this literature, see . And the substantial amount of research work on density modes over the last several decades, which includes [27, 52, 42, 17, 3, 24], is partly motivated by this perspective on clustering.
These two approaches — by level sets and by gradient lines — seem intuitively related, and in fact they are discussed together in a few recent review papers [40, 9] under the umbrella name of modal clustering. We argue here that the gradient flow is a natural way to partition the support of the density in concordance with the cluster tree. In doing so we provide a unified perspective on modal clustering, essentially equating the use of level sets and the use of gradient lines for the purpose of clustering points in space.
Both approaches to clustering that we discuss here rely on features of an underlying density which throughout the paper will be denoted by . Although is typically unknown, as the sample increases in size it becomes possible to estimate it consistently under standard assumptions, and the topographical features of that determine a clustering become estimable as well. In the spirit of , for example, we focus on itself, which allows us to bypass technical finite sample derivations for the benefit of providing a more concise and clear picture.
2 Clustering via level sets: the cluster tree
Given a density with respect to the Lebesgue measure on , for a positive real number , the -upper level set of is given by
Throughout, whether specified or not, we will only consider levels that are in .
Hartigan, in his classic book on clustering, suggests that “clusters may be thought of as regions of high density separated from other such regions by regions of low density” [26, Sec 11.13]. This naturally leads him to define clusters as the connected components of a certain upper level set of the underlying density: if the level is , then the clusters are the connected components of as defined above. See Figures 1 and 2 for illustrations in dimension 1 and 2, respectively.
The choice of level is rather unclear in this definition, but can be determined by the number of clusters, which in turn is often set by the data analyst. Indeed, the situation is very much like that in hierarchical clustering: there is a tree structure. This structure comes from the nesting property of upper level sets where whenever , which also implies that each cluster at level is included in a cluster at level . The set of all cluster (each one being the connected component of an upper level set) equipped with this tree structure or partial ordering is what is called the cluster tree333 Here we consider the continuous cluster tree that includes all levels. In some other works, the levels are restricted to those where a topological change occurs. — and what Hartigan calls the “density-contour tree”. Note that the root represents the entire population while the leaves are the modes (i.e., local maxima).
The clusters at a particular level do not constitute a partition of the population. Indeed, regardless of , the clusters at level , meaning the connected components of , form a partition of itself, obviously, but not a partition of since . And the cluster tree is only an organization of all the clusters at all levels, and thus also fails to provide a partition of the population. According to a recent review paper by Menardi , the region outside the upper level set of interest is dealt with via (supervised) classification. Suppose the level is chosen in some way, perhaps according to the desired number of clusters, and denoted . The connected components of are then computed. Then each point in is assigned to one of these clusters by some method for classification, the simplest one being by proximity (a point is assigned to the closest cluster).
3 Clustering via gradient lines: the gradient flow
To talk about gradient lines, we need to assume that the population density is differentiable. The gradient ascent line starting at a point is the curve given by the image of
, the parameterized curve defined by the the following ordinary differential equation (ODE)
By standard existence and uniqueness results for ODEs [32, Ch 17], if is locally Lipschitz, this curve exists and is unique, and it is defined on with converging to a critical point of as . See Figure 3 for an illustration. Henceforth, we assume that is twice continuously differentiable, which is certainly enough for such gradient lines to exist.
It is intuitive to define clusters based on local maxima, and Fukunaga and Hostetler  suggest to “assign each [point] to the nearest mode along the direction of the gradient” — as opposed to the closest mode in Euclidean distance, for instance. Define the basin of attraction of a critical point as . It turns out that, if is of Morse type  inside its support, meaning that the Hessian of at any of its critical points is non-degenerate, then all these basins of attraction, sometimes called stable manifolds, provide a partition of the entire population. In fact, the basin of attraction of the local maxima, by themselves, cover the population, except for a set of zero measure. For more background on Morse functions and their use in statistics, see the recent articles of Chacón  and Chen et al. .
In their original article, Fukunaga and Hostetler  also proposed an implementation: “The approach uses the fact that the expected value of the observations within a small region about a point can be related to the density gradient at that point.” The procedure is now known as the blurring mean-shift algorithm after Cheng , who suggested what is now known as the mean-shift algorithm, which is much closer to the plug-in approach suggested in our narrative. The mean-shift algorithm, and the twin problem of estimating the gradient lines of a density, are now well-understood [15, 16, 14, 2, 4]. The behavior of the blurring mean-shift algorithm is not as well understood, although some results do exist [15, 6, 5].
4 Relating the cluster tree and the gradient flow
We assume that the density is twice continuously differentiable and of Morse type within its support so that we may freely discuss the partition of the population given by the gradient ascent lines.
For the same population, consider the cluster tree. We saw that this object does not by itself provide a partition of the population, but a look at Figure 2 points to that possibility where, with a little imagination, we may see the contours (representing the level sets) as ‘moving’ upward. As it happens, this can be formalized, and the result is a partition that coincides with that defined by the gradient flow. Despite being quite intuitive, this correspondence appears to be novel. It is developed in the present section.
4.1 The gradient flow follows the cluster tree
A close relationship between the cluster tree and the gradient flow can be anticipated by a look at Figure 3, where it appears that gradient lines do not cross different clusters at the same level. This happens to be true, as we argue below.
To see the situation more clearly, take a point in the basin of attraction of a local maximum and let denote the gradient line between and as defined in (2). Assume that , so that the situation is not trivial. We have assumed that is Morse, so that is (almost) generic. The function is non-decreasing by construction and, as a consequence, is ‘compatible’ with the cluster tree in the sense that it does not cross from one connected component to another one at the same level. Indeed, for such a crossing would require an incursion in the region between the two connected components, say at level , and that region is, by definition, outside the upper level set and thus displays values of that are . Crossing from one cluster to another at level would thus imply going from values of that are when in the first cluster, to values when in the intermediary region, and finally to values when entering the second cluster.
To offer a different perspective, if , and is the component of such that — so that is the ‘last’ cluster containing when moving upward in the cluster tree (away from the root) — then ; belongs to a descendant of for all ; and if denotes the last cluster containing , then is a descendant of for all .
What was just said hinges on the fact that does not pass through any critical point except at when reaching , and in particular does not come into contact with any point where a level set splits. While the first part of this claim is well-known in dynamical systems theory, the second part is justified as follows.
Any point at the intersection of the closure of two connected components of (the interior of ) is a critical point.
Note that the statement is void unless is such that has strictly fewer components than , meaning that a topological event happens at level .
The level set of at level is defined by
Note that an upper level set, say , is defined by its border, , and here we work with the latter.
Let be a regular point in . There exists an open set containing such that is a -dimensional surface by the Constant Rank Level Set Theorem [34, Th 5.12]. If necessary, shrink so that for some for all . [20, Lem 4.11] can be used to show that the set has a positive reach. Denote Following the calculations in the proof of [47, Lem 2.1], it can be shown that there exists such that, for all and ,
In other words, in a small neighborhood the two sides of the surface have values strictly below and above , respectively. This then implies that cannot be at the intersection of the closure of two connected components of . ∎
4.2 The cluster tree follows the gradient flow
The cluster tree organizes the clusters (i.e., the connected components of the upper level sets) across all levels. We show now that the gradient flow provides a natural, almost canonical way of moving along the tree from the root (the population) to the leaves (the modes), thus reinforcing the case that the cluster tree and the gradient flow are intimately related when it comes to defining a partition the entire population.
Suppose that there are no critical points at level , i.e., no such that and . Note this is the generic situation, as the critical points form a discrete set by the fact that is Morse. Then, for small enough, there are no critical points at any level between and . By standard Morse theory [43, Th 2.6], this implies that there are no ‘topological events’ between levels and in that and are homeomorphic. In particular, these two level sets have the same number of connected components, and if is a connected component of , then it contains a single connected component, say , of . Knowing all this, a natural way to move to is by (metric) projection of each component of onto the component of that it contains. It turns out that, by taking small enough but still positive, this operation becomes well-defined and a diffeomorphism. Moreover, in the infinitesimal regime where , the transformation coincides with the gradient flow (at a certain speed). We support these assertions with several lemmas presented below.
In the present context, for small enough, the metric projection onto is an homeomorphism sending to .
Let denote the Hausdorff distance between , and let denote the reach of . By applying [11, Th 1], it suffices to show that, for small enough,
which guarantees that and are ‘normal compatible’, meaning that both and are homeomorphisms. First, it follows from [20, Lem 4.11] that there exists such that
|for all .||(6)|
Henceforth, the projection of onto will be denoted . The previous lemma states that is a homeomorphism when there are no critical points at level and is small enough, but in fact is a diffeomorphism under the same circumstances. This can be established following standard lines and further details are omitted, in particular since this is not essential to our narrative.
For any such that ,
Here and are considered fixed. Let be short for . It is well-known that for any non-critical point , is orthogonal to and pointing inwards. When is small enough, is parallel to by [20, Lem 4.8(2)]. So we can write
for some scalar . Note that as , since . Using a Taylor expansion, we have for some ,
We then conclude by noting that, as , . This implies that , by the fact that is twice continuously differentiable, so that is continuous. And this also implies that , because of the uniform continuity of on the line segment . ∎
Consider the following gradient flow
Starting at the same point , the gradient line defined by is the same as the gradient line defined by given in (2), but it is traveled at a different speed, and when relating the gradient lines to the cluster tree, this variant of the gradient flow is particular compelling because it is, in effect, parameterized by the level.
The gradient flow given in (10) has the following property. Starting from a point at level , meaning that , it holds that for all as long as for all . In fact, the transformation provides a homeomorphism from to whenever there are no critical points at any level anywhere between and , inclusive.
The result is essentially known, but since we were not able to locate it in this form, we provide a short proof. Throughout, and are considered fixed.
Using elementary calculus and the definition in (10), we have
Hence, , giving the first part of the statement.
Let . We show that is a homeomorphism when there are no critical points in for any , or equivalently, no critical point in . Under this condition, the trajectories of and do not intersect for any two different starting points , as is well-known. This implies that the map is injective. For any , consider the gradient descent flow ( in reverse) given by
Let . It follows from a calculation similar to (4.2) that . Notice that , so that we can write , and therefore the map is surjective. In the process, we have shown that the inverse of the gradient ascent operation — the function defined via in (10) — is given by the corresponding gradient descent operation — the function defined via in (12).
The proof is completed after we show that is continuous. (The continuity of can be proved in a similar way.) We start by stating that, because for all , and by the assumption that is continuous, there is such that for all . Define , which is the map that drives the gradient flow in (10). This map is continuous, and is compact, so that there is such that for all . The map is also differentiable, with
where denotes here the Jacobian matrix of , while denotes the Hessian of . And by continuity of and of , and the fact that for all , and again the fact that is compact, there exists such that for all , where denotes any matrix norm. In other words, is bounded and has bounded gradient on .
Let be a -packing of , meaning that and also that . And define . Note that is open with , so that and for all . If are such that , there must be such that , and because that ball is convex, we have
If, on the other hand, , then we can simply write
Hence, is Lipschitz on with corresponding constant bounded by . Finally, note that and that for any , for all , since . All together, we are in a position to apply a standard result on the dependence of the gradient flow on the initial condition, for example, the main theorem in [32, Sec 17.3], which gives the bound
which, in particular, implies
with . ∎
In this short paper we have established what we regard as an important correspondence between level set clustering and gradient line clustering (i.e., the mean-shift algorithm), which can now be understood as representing the same fundamental approach to clustering.
This approach can keep the name modal clustering, as this is a term that has been already used to refer to a class of methods that are inspired by this approach [40, 9], which already includes single-linkage clustering [30, 29, 28, 45] and other related methods based on nearest neighbors [58, 59], including DBSCAN 
and some forms of spectral clustering.
- Arias-Castro  Arias-Castro, E. (2011). Clustering based on pairwise distances when the data is of mixed dimensions. IEEE Transactions on Information Theory 57(3), 1692–1706.
Arias-Castro et al. 
Arias-Castro, E., D. Mason, and B. Pelletier (2016).
On the estimation of the gradient lines of a density and the
consistency of the mean-shift algorithm.
The Journal of Machine Learning Research17(1), 1487–1514.
Burman, P. and W. Polonik (2009).
Multivariate mode hunting: Data analytic tools with measures of
Journal of Multivariate Analysis100(6), 1198–1218.
Carreira-Perpiñán, M. A. (2000).
Mode-finding for mixtures of Gaussian distributions.IEEE Transactions on Pattern Analysis and Machine Intelligence 22(11), 1318–1323.
- Carreira-Perpinán  Carreira-Perpinán, M. A. (2006). Fast nonparametric clustering with Gaussian blurring mean-shift. In International Conference on Machine Learning, pp. 153–160.
- Carreira-Perpinán  Carreira-Perpinán, M. A. (2008). Generalised blurring mean-shift algorithms for nonparametric clustering. In , pp. 1–8.
Carreira-Perpiñán, M. Á. (2015).
Clustering methods based on kernel density estimators: Mean-shift
In C. Hennig, M. Meila, F. Murtagh, and R. Rocci (Eds.),
Handbook of Cluster Analysis, pp. 404–439. Chapman and Hall/CRC.
- Chacón  Chacón, J. E. (2015). A population background for nonparametric density-based clustering. Statistical Science 30(4), 518–532.
- Chacón  Chacón, J. E. (2020). The modal age of statistics. International Statistical Review 88(1), 122–141.
- Chaudhuri et al.  Chaudhuri, K., S. Dasgupta, S. Kpotufe, and U. Von Luxburg (2014). Consistent procedures for cluster tree estimation and pruning. IEEE Transactions on Information Theory 60(12), 7900–7912.
- Chazal et al.  Chazal, F., A. Lieutier, and J. Rossignac (2007). Normal-map between normal-compatible manifolds. International Journal of Computational Geometry & Applications 17(05), 403–421.
- Chen et al. [2017a] Chen, Y.-C., C. R. Genovese, and L. Wasserman (2017a). Density level sets: Asymptotics, inference, and visualization. Journal of the American Statistical Association 112(520), 1684–1696.
- Chen et al. [2017b] Chen, Y.-C., C. R. Genovese, and L. Wasserman (2017b). Statistical inference using the morse-smale complex. Electronic Journal of Statistics 11(1), 1390–1433.
- Cheng et al.  Cheng, M.-Y., P. Hall, and J. A. Hartigan (2004). Estimating gradient trees. In A Festschrift for Herman Rubin, pp. 237–249. Institute of Mathematical Statistics.
- Cheng  Cheng, Y. (1995). Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 17(8), 790–799.
- Comaniciu and Meer  Comaniciu, D. and P. Meer (2002). Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(5), 603–619.
- Dümbgen and Walther  Dümbgen, L. and G. Walther (2008). Multiscale inference about a density. The Annals of Statistics 36(4), 1758–1785.
- Eldridge et al.  Eldridge, J., M. Belkin, and Y. Wang (2015). Beyond Hartigan consistency: Merge distortion metric for hierarchical clustering. In Conference on Learning Theory, pp. 588–606.
- Ester et al.  Ester, M., H.-P. Kriegel, J. Sander, and X. Xu (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Knowledge Discovery and Data Mining, pp. 226–231.
- Federer  Federer, H. (1959). Curvature measures. Transactions of the American Mathematical Society 93(3), 418–491.
- Fisher  Fisher, W. D. (1958). On grouping for maximum homogeneity. Journal of the American statistical Association 53(284), 789–798.
- Forgy  Forgy, E. (1965). Cluster analysis of multivariate data: Efficiency versus interpretability of classifications. Biometrics 21, 768–780.
- Fukunaga and Hostetler  Fukunaga, K. and L. Hostetler (1975). The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on Information Theory 21(1), 32–40.
- Genovese et al.  Genovese, C. R., M. Perone-Pacifico, I. Verdinelli, and L. Wasserman (2016). Non-parametric inference for density modes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78(1), 99–126.
- Gower and Ross  Gower, J. C. and G. J. Ross (1969). Minimum spanning trees and single linkage cluster analysis. Journal of the Royal Statistical Society: Series C 18(1), 54–64.
- Hartigan  Hartigan, J. (1975). Clustering algorithms. Wiley.
- Hartigan and Hartigan  Hartigan, J. and P. Hartigan (1985). The dip test of unimodality. The Annals of Statistics 13(1), 70–84.
- Hartigan [1977a] Hartigan, J. A. (1977a). Clusters as modes. In First International Symposium on Data Analysis and Informatic. IRIA, Versailles.
- Hartigan [1977b] Hartigan, J. A. (1977b). Distribution problems in clustering. In J. V. Ryzin (Ed.), Classification and Clustering, pp. 45–71. Elsevier.
- Hartigan  Hartigan, J. A. (1981). Consistency of single linkage for high-density clusters. Journal of the American Statistical Association 76(374), 388–394.
Hartigan, J. A. and M. Wong (1979).
A K-means clustering algorithm.Journal of the Royal Statistical Society: Series C 28(1), 100–108.
- Hirsch et al.  Hirsch, M. W., S. Smale, and R. L. Devaney (2012). Differential equations, dynamical systems, and an introduction to chaos (3rd ed.). Elsevier Science & Technology.
- Koontz and Fukunaga  Koontz, W. L. and K. Fukunaga (1972). A nonparametric valley-seeking technique for cluster analysis. IEEE Transactions on Computers 2(C-21), 171–178.
- Lee  Lee, J. M. (2013). Introduction to smooth manifolds (2nd ed.). Springer.
- Li et al.  Li, J., S. Ray, and B. G. Lindsay (2007). A nonparametric statistical approach to clustering via mode identification. Journal of Machine Learning Research 8(59), 1687–1723.
- Lloyd  Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory 28(2), 129–137. The procedure was first proposed in 1957 in unpublished work when the author was at Bell Labs.
MacQueen, J. (1967).
Some methods for classification and analysis of multivariate
Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297.
- Mason and Polonik  Mason, D. M. and W. Polonik (2009). Asymptotic normality of plug-in level set estimates. The Annals of Applied Probability 19(3), 1108–1142.
- Max  Max, J. (1960). Quantizing for minimum distortion. IEE Transactions on Information Theory 6(1), 7–12.
- Menardi  Menardi, G. (2016). A review on modal clustering. International Statistical Review 84(3), 413–433.
- Milnor  Milnor, J. (1963). Morse theory. Princeton University Press.
- Minnotte  Minnotte, M. C. (1997). Nonparametric testing of the existence of modes. The Annals of Statistics 25(4), 1646–1660.
- Nicolaescu  Nicolaescu, L. (2011). An invitation to Morse theory. Springer Science & Business Media.
- Pearson  Pearson, K. (1894). Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London: Series A 185, 71–110.
- Penrose  Penrose, M. D. (1995). Single linkage clustering and continuum percolation. Journal of Multivariate Analysis 53(1), 94–109.
- Polonik  Polonik, W. (1995). Measuring mass concentrations and estimating density contour clusters-an excess mass approach. The Annals of Statistics 23(3), 855–881.
- Qiao  Qiao, W. (2021). Nonparametric estimation of surface integrals on level sets. Bernoulli 27(1), 155–191.
- Rigollet and Vert  Rigollet, P. and R. Vert (2009). Optimal rates for plug-in estimators of density level sets. Bernoulli 15(4), 1154–1178.
- Rinaldo et al.  Rinaldo, A., A. Singh, R. Nugent, and L. Wasserman (2012). Stability of density-based clustering. Journal of Machine Learning Research 13, 905.
- Rinaldo and Wasserman  Rinaldo, A. and L. Wasserman (2010). Generalized density clustering. The Annals of Statistics 38(5), 2678–2722.
- Roberts  Roberts, S. J. (1997). Parametric and non-parametric unsupervised cluster analysis. Pattern Recognition 30(2), 261–272.
- Silverman  Silverman, B. W. (1981). Using kernel density estimates to investigate multimodality. Journal of the Royal Statistical Society: Series B (Methodological) 43(1), 97–99.
- Singh et al.  Singh, A., C. Scott, and R. Nowak (2009). Adaptive Hausdorff estimation of density level sets. The Annals of Statistics 37(5B), 2760–2782.
- Sneath  Sneath, P. H. (1957). The application of computers to taxonomy. Microbiology 17(1), 201–226.
- Sriperumbudur and Steinwart  Sriperumbudur, B. and I. Steinwart (2012). Consistency and rates for clustering with DBSCAN. In Artificial Intelligence and Statistics, pp. 1090–1098.
- Steinwart  Steinwart, I. (2011). Adaptive density level set clustering. In Proceedings of the 24th Annual Conference on Learning Theory, pp. 703–738.
- Steinwart  Steinwart, I. (2015). Fully adaptive density-based clustering. The Annals of Statistics 43(5), 2132–2167.
- Stuetzle  Stuetzle, W. (2003). Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. Journal of Classification 20(1), 25–47.
- Stuetzle and Nugent  Stuetzle, W. and R. Nugent (2010). A generalized single linkage method for estimating the cluster tree of a density. Journal of Computational and Graphical Statistics 19(2), 397–418.
- Tsybakov  Tsybakov, A. B. (1997). On nonparametric estimation of density level sets. The Annals of Statistics 25(3), 948–969.
- Walther  Walther, G. (1997). Granulometric smoothing. The Annals of Statistics 25(6), 2273–2299.
- Wang et al.  Wang, D., X. Lu, and A. Rinaldo (2019). Dbscan: Optimal rates for density-based cluster estimation. Journal of Machine Learning Research 20, 1–50.
- Ward  Ward, J. H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association 58(301), 236–244.