Level Sets or Gradient Lines? A Unifying View of Modal Clustering

09/17/2021 ∙ by Ery Arias-Castro, et al. ∙ 0

The paper establishes a strong correspondence, if not an equivalence, between two important clustering approaches that emerged in the 1970's: clustering by level sets or cluster tree as proposed by Hartigan and clustering by gradient lines or gradient flow as proposed by Fukunaga and Hosteler.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Up until the 1970’s there were two main ways of clustering points in space. One of them, perhaps pioneered by Pearson [44]

, was to fit a (usually Gaussian) mixture to the data, and that being done, classify each data point — as well as any other point available at a later date — according to the most likely component in the mixture. The other one was based on a direct partitioning of the space, most notably by minimization of the average minimum squared distance to a center: the

-means problem, whose computational difficulty led to a number of famous algorithms [37, 22, 31, 36, 39]

and likely played a role in motivating the development of hierarchical clustering

[63, 21, 25, 54].

In the 1970’s, two decidedly nonparametric approaches to clustering were proposed, both based on the topography given by the population density. Of course, in practice, the density is estimated, often by some form of kernel density estimation.

Clustering via level sets

One of these approaches is that of Hartigan [26], who proposed to look at the connected components of the upper-level sets of the population density. Thinking of clusters as “regions of high density separated from other such regions by regions of low density”, at a given level, each connected component represents a cluster, while the remaining region in space is sometimes considered as noise. The basic idea was definitely in the zeitgeist. For example, a similar approach was suggested around the same time by Koontz and Fukunaga [33].

The choice of level is not at all obvious, and in fact Hartigan recommended looking at the entire tree structure — which he called the “density-contour tree” and is now better known as the cluster tree — that arises by the nesting property of the upper-level sets considered as a whole. Note, however, that the cluster tree does not provide a complete partitioning of the space.

Hartigan [30, 29, 28], and later [45], showed that the cluster tree can be estimated by single linkage, achieving a weak notion of consistency called fractional consistency. Since then, the estimation of cluster trees using different algorithms or notions of consistency has been studied in [58, 59, 49, 10, 62, 18]. At a fixed level in a cluster tree, clustering is naturally related to level set estimation, which has in itself received a lot of attention in the literature, e.g., [46, 60, 48, 12, 38, 61, 50, 53]. To address the problem of choosing the level, [55, 57, 56] considered the lowest split level in cluster tree, which can be used to recover the full cluster tree when applied recursively.

Clustering via gradient lines

The other approach is that of Fukunaga and Hostetler [23], who proposed to use the gradient lines of the population density. Simply put, assuming the density has the proper regularity (which in particular requires that it is differentiable everywhere), a point is ‘moved’ upward along the curve of steepest ascent in the topography given by the density, and the points ending at the same critical point form a cluster. This gradient flow definition of clustering is particularly relevant when the density is a Morse function [8], as in that case each local maximum has its own basin of attraction and the union of these cover the entire space except for a set of Lebesgue measure zero.

The general idea of clustering by gradient ascent has been proposed or rediscovered a few times [14, 35, 51]. For a fairly recent review of this literature, see [7]. And the substantial amount of research work on density modes over the last several decades, which includes [27, 52, 42, 17, 3, 24], is partly motivated by this perspective on clustering.

Contribution

These two approaches — by level sets and by gradient lines — seem intuitively related, and in fact they are discussed together in a few recent review papers [40, 9] under the umbrella name of modal clustering. We argue here that the gradient flow is a natural way to partition the support of the density in concordance with the cluster tree. In doing so we provide a unified perspective on modal clustering, essentially equating the use of level sets and the use of gradient lines for the purpose of clustering points in space.

Setting

Both approaches to clustering that we discuss here rely on features of an underlying density which throughout the paper will be denoted by . Although is typically unknown, as the sample increases in size it becomes possible to estimate it consistently under standard assumptions, and the topographical features of that determine a clustering become estimable as well. In the spirit of [8], for example, we focus on itself, which allows us to bypass technical finite sample derivations for the benefit of providing a more concise and clear picture.

2 Clustering via level sets: the cluster tree

Given a density with respect to the Lebesgue measure on , for a positive real number , the -upper level set of is given by

(1)

Throughout, whether specified or not, we will only consider levels that are in .

Hartigan, in his classic book on clustering, suggests that “clusters may be thought of as regions of high density separated from other such regions by regions of low density” [26, Sec 11.13]. This naturally leads him to define clusters as the connected components of a certain upper level set of the underlying density: if the level is , then the clusters are the connected components of as defined above. See Figures 1 and 2 for illustrations in dimension 1 and 2, respectively.

Figure 1: A sample of upper level sets of a density in dimension

with two modes (which happens to be the mixture of two normal distributions). At any level

, where is the value of the density at the local minimum, is connected, and thus corresponds to the cluster at the level. At any level , where is the value of the density at its local maximum near , has exactly two connected components, and these are the clusters at that level. (At , one of the clusters is a singleton.) Finally, at , where is the value of the density at its global maximum (near ), is again connected, and is thus the cluster at that level. (At , the cluster is a singleton.)
Figure 2: A sample of upper level sets of a density in dimension with two modes (which happens to be the mixture of two normal distributions, one with a non-scalar covariance matrix). The situation is similar to that of Figure 1, where the number of connected components of the upper level set at is one when , where is the value of the density at the saddle point; when , where is the value of the density at the local (but not global) maximum; and again when , where is the maximum value of the density.

The choice of level is rather unclear in this definition, but can be determined by the number of clusters, which in turn is often set by the data analyst. Indeed, the situation is very much like that in hierarchical clustering: there is a tree structure. This structure comes from the nesting property of upper level sets where whenever , which also implies that each cluster at level is included in a cluster at level . The set of all cluster (each one being the connected component of an upper level set) equipped with this tree structure or partial ordering is what is called the cluster tree333 Here we consider the continuous cluster tree that includes all levels. In some other works, the levels are restricted to those where a topological change occurs. — and what Hartigan calls the “density-contour tree”. Note that the root represents the entire population while the leaves are the modes (i.e., local maxima).

The clusters at a particular level do not constitute a partition of the population. Indeed, regardless of , the clusters at level , meaning the connected components of , form a partition of itself, obviously, but not a partition of since . And the cluster tree is only an organization of all the clusters at all levels, and thus also fails to provide a partition of the population. According to a recent review paper by Menardi [40], the region outside the upper level set of interest is dealt with via (supervised) classification. Suppose the level is chosen in some way, perhaps according to the desired number of clusters, and denoted . The connected components of are then computed. Then each point in is assigned to one of these clusters by some method for classification, the simplest one being by proximity (a point is assigned to the closest cluster).

3 Clustering via gradient lines: the gradient flow

To talk about gradient lines, we need to assume that the population density is differentiable. The gradient ascent line starting at a point is the curve given by the image of

, the parameterized curve defined by the the following ordinary differential equation (ODE)

(2)

By standard existence and uniqueness results for ODEs [32, Ch 17], if is locally Lipschitz, this curve exists and is unique, and it is defined on with converging to a critical point of as . See Figure 3 for an illustration. Henceforth, we assume that is twice continuously differentiable, which is certainly enough for such gradient lines to exist.

Figure 3: A sample of gradient ascent lines for the density of Figure 2. The square points are the starting points, while the round points are the end points, which are not only critical points but also local maxima (modes) in this example.

It is intuitive to define clusters based on local maxima, and Fukunaga and Hostetler [23] suggest to “assign each [point] to the nearest mode along the direction of the gradient” — as opposed to the closest mode in Euclidean distance, for instance. Define the basin of attraction of a critical point as . It turns out that, if is of Morse type [41] inside its support, meaning that the Hessian of at any of its critical points is non-degenerate, then all these basins of attraction, sometimes called stable manifolds, provide a partition of the entire population. In fact, the basin of attraction of the local maxima, by themselves, cover the population, except for a set of zero measure. For more background on Morse functions and their use in statistics, see the recent articles of Chacón [8] and Chen et al. [13].

Remark 3.1.

In their original article, Fukunaga and Hostetler [23] also proposed an implementation: “The approach uses the fact that the expected value of the observations within a small region about a point can be related to the density gradient at that point.” The procedure is now known as the blurring mean-shift algorithm after Cheng [15], who suggested what is now known as the mean-shift algorithm, which is much closer to the plug-in approach suggested in our narrative. The mean-shift algorithm, and the twin problem of estimating the gradient lines of a density, are now well-understood [15, 16, 14, 2, 4]. The behavior of the blurring mean-shift algorithm is not as well understood, although some results do exist [15, 6, 5].

4 Relating the cluster tree and the gradient flow

We assume that the density is twice continuously differentiable and of Morse type within its support so that we may freely discuss the partition of the population given by the gradient ascent lines.

For the same population, consider the cluster tree. We saw that this object does not by itself provide a partition of the population, but a look at Figure 2 points to that possibility where, with a little imagination, we may see the contours (representing the level sets) as ‘moving’ upward. As it happens, this can be formalized, and the result is a partition that coincides with that defined by the gradient flow. Despite being quite intuitive, this correspondence appears to be novel. It is developed in the present section.

4.1 The gradient flow follows the cluster tree

A close relationship between the cluster tree and the gradient flow can be anticipated by a look at Figure 3, where it appears that gradient lines do not cross different clusters at the same level. This happens to be true, as we argue below.

To see the situation more clearly, take a point in the basin of attraction of a local maximum and let denote the gradient line between and as defined in (2). Assume that , so that the situation is not trivial. We have assumed that is Morse, so that is (almost) generic. The function is non-decreasing by construction and, as a consequence, is ‘compatible’ with the cluster tree in the sense that it does not cross from one connected component to another one at the same level. Indeed, for such a crossing would require an incursion in the region between the two connected components, say at level , and that region is, by definition, outside the upper level set and thus displays values of that are . Crossing from one cluster to another at level would thus imply going from values of that are when in the first cluster, to values when in the intermediary region, and finally to values when entering the second cluster.

To offer a different perspective, if , and is the component of such that — so that is the ‘last’ cluster containing when moving upward in the cluster tree (away from the root) — then ; belongs to a descendant of for all ; and if denotes the last cluster containing , then is a descendant of for all .

What was just said hinges on the fact that does not pass through any critical point except at when reaching , and in particular does not come into contact with any point where a level set splits. While the first part of this claim is well-known in dynamical systems theory, the second part is justified as follows.

Lemma 4.1.

Any point at the intersection of the closure of two connected components of (the interior of ) is a critical point.

Note that the statement is void unless is such that has strictly fewer components than , meaning that a topological event happens at level .

Proof.

The level set of at level is defined by

(3)

Note that an upper level set, say , is defined by its border, , and here we work with the latter.

Let be a regular point in . There exists an open set containing such that is a -dimensional surface by the Constant Rank Level Set Theorem [34, Th 5.12]. If necessary, shrink so that for some for all . [20, Lem 4.11] can be used to show that the set has a positive reach. Denote Following the calculations in the proof of [47, Lem 2.1], it can be shown that there exists such that, for all and ,

and (4)

In other words, in a small neighborhood the two sides of the surface have values strictly below and above , respectively. This then implies that cannot be at the intersection of the closure of two connected components of . ∎

4.2 The cluster tree follows the gradient flow

The cluster tree organizes the clusters (i.e., the connected components of the upper level sets) across all levels. We show now that the gradient flow provides a natural, almost canonical way of moving along the tree from the root (the population) to the leaves (the modes), thus reinforcing the case that the cluster tree and the gradient flow are intimately related when it comes to defining a partition the entire population.

Suppose that there are no critical points at level , i.e., no such that and . Note this is the generic situation, as the critical points form a discrete set by the fact that is Morse. Then, for small enough, there are no critical points at any level between and . By standard Morse theory [43, Th 2.6], this implies that there are no ‘topological events’ between levels and in that and are homeomorphic. In particular, these two level sets have the same number of connected components, and if is a connected component of , then it contains a single connected component, say , of . Knowing all this, a natural way to move to is by (metric) projection of each component of onto the component of that it contains. It turns out that, by taking small enough but still positive, this operation becomes well-defined and a diffeomorphism. Moreover, in the infinitesimal regime where , the transformation coincides with the gradient flow (at a certain speed). We support these assertions with several lemmas presented below.

Lemma 4.2.

In the present context, for small enough, the metric projection onto is an homeomorphism sending to .

Proof.

Let denote the Hausdorff distance between , and let denote the reach of . By applying [11, Th 1], it suffices to show that, for small enough,

(5)

which guarantees that and are ‘normal compatible’, meaning that both and are homeomorphisms. First, it follows from [20, Lem 4.11] that there exists such that

for all . (6)

Then (5) holds for small enough, because as , as shown, for example, in the proof of [47, Lem 2.1]. ∎

Remark 4.3.

Henceforth, the projection of onto will be denoted . The previous lemma states that is a homeomorphism when there are no critical points at level and is small enough, but in fact is a diffeomorphism under the same circumstances. This can be established following standard lines and further details are omitted, in particular since this is not essential to our narrative.

Lemma 4.4.

For any such that ,

Proof.

Here and are considered fixed. Let be short for . It is well-known that for any non-critical point , is orthogonal to and pointing inwards. When is small enough, is parallel to by [20, Lem 4.8(2)]. So we can write

(7)

for some scalar . Note that as , since . Using a Taylor expansion, we have for some ,

(8)

Extracting the expression of from (4.2) and plugging it into (7), we get

(9)

We then conclude by noting that, as , . This implies that , by the fact that is twice continuously differentiable, so that is continuous. And this also implies that , because of the uniform continuity of on the line segment . ∎

Consider the following gradient flow

(10)

Starting at the same point , the gradient line defined by is the same as the gradient line defined by given in (2), but it is traveled at a different speed, and when relating the gradient lines to the cluster tree, this variant of the gradient flow is particular compelling because it is, in effect, parameterized by the level.

Lemma 4.5.

The gradient flow given in (10) has the following property. Starting from a point at level , meaning that , it holds that for all as long as for all . In fact, the transformation provides a homeomorphism from to whenever there are no critical points at any level anywhere between and , inclusive.

Proof.

The result is essentially known, but since we were not able to locate it in this form, we provide a short proof. Throughout, and are considered fixed.

Using elementary calculus and the definition in (10), we have

(11)

Hence, , giving the first part of the statement.

Let . We show that is a homeomorphism when there are no critical points in for any , or equivalently, no critical point in . Under this condition, the trajectories of and do not intersect for any two different starting points , as is well-known. This implies that the map is injective. For any , consider the gradient descent flow ( in reverse) given by

(12)

Let . It follows from a calculation similar to (4.2) that . Notice that , so that we can write , and therefore the map is surjective. In the process, we have shown that the inverse of the gradient ascent operation — the function defined via in (10) — is given by the corresponding gradient descent operation — the function defined via in (12).

The proof is completed after we show that is continuous. (The continuity of can be proved in a similar way.) We start by stating that, because for all , and by the assumption that is continuous, there is such that for all . Define , which is the map that drives the gradient flow in (10). This map is continuous, and is compact, so that there is such that for all . The map is also differentiable, with

(13)

where denotes here the Jacobian matrix of , while denotes the Hessian of . And by continuity of and of , and the fact that for all , and again the fact that is compact, there exists such that for all , where denotes any matrix norm. In other words, is bounded and has bounded gradient on .

Let be a -packing of , meaning that and also that . And define . Note that is open with , so that and for all . If are such that , there must be such that , and because that ball is convex, we have

(14)

If, on the other hand, , then we can simply write

(15)

Hence, is Lipschitz on with corresponding constant bounded by . Finally, note that and that for any , for all , since . All together, we are in a position to apply a standard result on the dependence of the gradient flow on the initial condition, for example, the main theorem in [32, Sec 17.3], which gives the bound

(16)

which, in particular, implies

(17)

with . ∎

Remark 4.6.

The gradient flow defined in (2

) is the vector field corresponding to

, while the variant of (10) corresponds to . Fukunaga and Hostetler [23] proposed to use a different variant: the one based on , the idea being to quickly move points in low density regions to higher density regions.

5 Discussion

In this short paper we have established what we regard as an important correspondence between level set clustering and gradient line clustering (i.e., the mean-shift algorithm), which can now be understood as representing the same fundamental approach to clustering.

This approach can keep the name modal clustering, as this is a term that has been already used to refer to a class of methods that are inspired by this approach [40, 9], which already includes single-linkage clustering [30, 29, 28, 45] and other related methods based on nearest neighbors [58, 59], including DBSCAN [19]

and some forms of spectral clustering

[1].

References

  • Arias-Castro [2011] Arias-Castro, E. (2011). Clustering based on pairwise distances when the data is of mixed dimensions. IEEE Transactions on Information Theory 57(3), 1692–1706.
  • Arias-Castro et al. [2016] Arias-Castro, E., D. Mason, and B. Pelletier (2016). On the estimation of the gradient lines of a density and the consistency of the mean-shift algorithm.

    The Journal of Machine Learning Research

     17(1), 1487–1514.
  • Burman and Polonik [2009] Burman, P. and W. Polonik (2009). Multivariate mode hunting: Data analytic tools with measures of significance.

    Journal of Multivariate Analysis

     100(6), 1198–1218.
  • Carreira-Perpiñán [2000] Carreira-Perpiñán, M. A. (2000).

    Mode-finding for mixtures of Gaussian distributions.

    IEEE Transactions on Pattern Analysis and Machine Intelligence 22(11), 1318–1323.
  • Carreira-Perpinán [2006] Carreira-Perpinán, M. A. (2006). Fast nonparametric clustering with Gaussian blurring mean-shift. In International Conference on Machine Learning, pp. 153–160.
  • Carreira-Perpinán [2008] Carreira-Perpinán, M. A. (2008). Generalised blurring mean-shift algorithms for nonparametric clustering. In

    2008 IEEE Conference on Computer Vision and Pattern Recognition

    , pp. 1–8.
  • Carreira-Perpiñán [2015] Carreira-Perpiñán, M. Á. (2015). Clustering methods based on kernel density estimators: Mean-shift algorithms. In C. Hennig, M. Meila, F. Murtagh, and R. Rocci (Eds.),

    Handbook of Cluster Analysis

    , pp. 404–439. Chapman and Hall/CRC.
  • Chacón [2015] Chacón, J. E. (2015). A population background for nonparametric density-based clustering. Statistical Science 30(4), 518–532.
  • Chacón [2020] Chacón, J. E. (2020). The modal age of statistics. International Statistical Review 88(1), 122–141.
  • Chaudhuri et al. [2014] Chaudhuri, K., S. Dasgupta, S. Kpotufe, and U. Von Luxburg (2014). Consistent procedures for cluster tree estimation and pruning. IEEE Transactions on Information Theory 60(12), 7900–7912.
  • Chazal et al. [2007] Chazal, F., A. Lieutier, and J. Rossignac (2007). Normal-map between normal-compatible manifolds. International Journal of Computational Geometry & Applications 17(05), 403–421.
  • Chen et al. [2017a] Chen, Y.-C., C. R. Genovese, and L. Wasserman (2017a). Density level sets: Asymptotics, inference, and visualization. Journal of the American Statistical Association 112(520), 1684–1696.
  • Chen et al. [2017b] Chen, Y.-C., C. R. Genovese, and L. Wasserman (2017b). Statistical inference using the morse-smale complex. Electronic Journal of Statistics 11(1), 1390–1433.
  • Cheng et al. [2004] Cheng, M.-Y., P. Hall, and J. A. Hartigan (2004). Estimating gradient trees. In A Festschrift for Herman Rubin, pp. 237–249. Institute of Mathematical Statistics.
  • Cheng [1995] Cheng, Y. (1995). Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 17(8), 790–799.
  • Comaniciu and Meer [2002] Comaniciu, D. and P. Meer (2002). Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(5), 603–619.
  • Dümbgen and Walther [2008] Dümbgen, L. and G. Walther (2008). Multiscale inference about a density. The Annals of Statistics 36(4), 1758–1785.
  • Eldridge et al. [2015] Eldridge, J., M. Belkin, and Y. Wang (2015). Beyond Hartigan consistency: Merge distortion metric for hierarchical clustering. In Conference on Learning Theory, pp. 588–606.
  • Ester et al. [1996] Ester, M., H.-P. Kriegel, J. Sander, and X. Xu (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Knowledge Discovery and Data Mining, pp. 226–231.
  • Federer [1959] Federer, H. (1959). Curvature measures. Transactions of the American Mathematical Society 93(3), 418–491.
  • Fisher [1958] Fisher, W. D. (1958). On grouping for maximum homogeneity. Journal of the American statistical Association 53(284), 789–798.
  • Forgy [1965] Forgy, E. (1965). Cluster analysis of multivariate data: Efficiency versus interpretability of classifications. Biometrics 21, 768–780.
  • Fukunaga and Hostetler [1975] Fukunaga, K. and L. Hostetler (1975). The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on Information Theory 21(1), 32–40.
  • Genovese et al. [2016] Genovese, C. R., M. Perone-Pacifico, I. Verdinelli, and L. Wasserman (2016). Non-parametric inference for density modes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78(1), 99–126.
  • Gower and Ross [1969] Gower, J. C. and G. J. Ross (1969). Minimum spanning trees and single linkage cluster analysis. Journal of the Royal Statistical Society: Series C 18(1), 54–64.
  • Hartigan [1975] Hartigan, J. (1975). Clustering algorithms. Wiley.
  • Hartigan and Hartigan [1985] Hartigan, J. and P. Hartigan (1985). The dip test of unimodality. The Annals of Statistics 13(1), 70–84.
  • Hartigan [1977a] Hartigan, J. A. (1977a). Clusters as modes. In First International Symposium on Data Analysis and Informatic. IRIA, Versailles.
  • Hartigan [1977b] Hartigan, J. A. (1977b). Distribution problems in clustering. In J. V. Ryzin (Ed.), Classification and Clustering, pp. 45–71. Elsevier.
  • Hartigan [1981] Hartigan, J. A. (1981). Consistency of single linkage for high-density clusters. Journal of the American Statistical Association 76(374), 388–394.
  • Hartigan and Wong [1979] Hartigan, J. A. and M. Wong (1979).

    A K-means clustering algorithm.

    Journal of the Royal Statistical Society: Series C 28(1), 100–108.
  • Hirsch et al. [2012] Hirsch, M. W., S. Smale, and R. L. Devaney (2012). Differential equations, dynamical systems, and an introduction to chaos (3rd ed.). Elsevier Science & Technology.
  • Koontz and Fukunaga [1972] Koontz, W. L. and K. Fukunaga (1972). A nonparametric valley-seeking technique for cluster analysis. IEEE Transactions on Computers 2(C-21), 171–178.
  • Lee [2013] Lee, J. M. (2013). Introduction to smooth manifolds (2nd ed.). Springer.
  • Li et al. [2007] Li, J., S. Ray, and B. G. Lindsay (2007). A nonparametric statistical approach to clustering via mode identification. Journal of Machine Learning Research 8(59), 1687–1723.
  • Lloyd [1982] Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory 28(2), 129–137. The procedure was first proposed in 1957 in unpublished work when the author was at Bell Labs.
  • MacQueen [1967] MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In

    Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability

    , pp. 281–297.
  • Mason and Polonik [2009] Mason, D. M. and W. Polonik (2009). Asymptotic normality of plug-in level set estimates. The Annals of Applied Probability 19(3), 1108–1142.
  • Max [1960] Max, J. (1960). Quantizing for minimum distortion. IEE Transactions on Information Theory 6(1), 7–12.
  • Menardi [2016] Menardi, G. (2016). A review on modal clustering. International Statistical Review 84(3), 413–433.
  • Milnor [1963] Milnor, J. (1963). Morse theory. Princeton University Press.
  • Minnotte [1997] Minnotte, M. C. (1997). Nonparametric testing of the existence of modes. The Annals of Statistics 25(4), 1646–1660.
  • Nicolaescu [2011] Nicolaescu, L. (2011). An invitation to Morse theory. Springer Science & Business Media.
  • Pearson [1894] Pearson, K. (1894). Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London: Series A 185, 71–110.
  • Penrose [1995] Penrose, M. D. (1995). Single linkage clustering and continuum percolation. Journal of Multivariate Analysis 53(1), 94–109.
  • Polonik [1995] Polonik, W. (1995). Measuring mass concentrations and estimating density contour clusters-an excess mass approach. The Annals of Statistics 23(3), 855–881.
  • Qiao [2021] Qiao, W. (2021). Nonparametric estimation of surface integrals on level sets. Bernoulli 27(1), 155–191.
  • Rigollet and Vert [2009] Rigollet, P. and R. Vert (2009). Optimal rates for plug-in estimators of density level sets. Bernoulli 15(4), 1154–1178.
  • Rinaldo et al. [2012] Rinaldo, A., A. Singh, R. Nugent, and L. Wasserman (2012). Stability of density-based clustering. Journal of Machine Learning Research 13, 905.
  • Rinaldo and Wasserman [2010] Rinaldo, A. and L. Wasserman (2010). Generalized density clustering. The Annals of Statistics 38(5), 2678–2722.
  • Roberts [1997] Roberts, S. J. (1997). Parametric and non-parametric unsupervised cluster analysis. Pattern Recognition 30(2), 261–272.
  • Silverman [1981] Silverman, B. W. (1981). Using kernel density estimates to investigate multimodality. Journal of the Royal Statistical Society: Series B (Methodological) 43(1), 97–99.
  • Singh et al. [2009] Singh, A., C. Scott, and R. Nowak (2009). Adaptive Hausdorff estimation of density level sets. The Annals of Statistics 37(5B), 2760–2782.
  • Sneath [1957] Sneath, P. H. (1957). The application of computers to taxonomy. Microbiology 17(1), 201–226.
  • Sriperumbudur and Steinwart [2012] Sriperumbudur, B. and I. Steinwart (2012). Consistency and rates for clustering with DBSCAN. In Artificial Intelligence and Statistics, pp. 1090–1098.
  • Steinwart [2011] Steinwart, I. (2011). Adaptive density level set clustering. In Proceedings of the 24th Annual Conference on Learning Theory, pp. 703–738.
  • Steinwart [2015] Steinwart, I. (2015). Fully adaptive density-based clustering. The Annals of Statistics 43(5), 2132–2167.
  • Stuetzle [2003] Stuetzle, W. (2003). Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. Journal of Classification 20(1), 25–47.
  • Stuetzle and Nugent [2010] Stuetzle, W. and R. Nugent (2010). A generalized single linkage method for estimating the cluster tree of a density. Journal of Computational and Graphical Statistics 19(2), 397–418.
  • Tsybakov [1997] Tsybakov, A. B. (1997). On nonparametric estimation of density level sets. The Annals of Statistics 25(3), 948–969.
  • Walther [1997] Walther, G. (1997). Granulometric smoothing. The Annals of Statistics 25(6), 2273–2299.
  • Wang et al. [2019] Wang, D., X. Lu, and A. Rinaldo (2019). Dbscan: Optimal rates for density-based cluster estimation. Journal of Machine Learning Research 20, 1–50.
  • Ward [1963] Ward, J. H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association 58(301), 236–244.