DBSCAN (DBSCAN) is one of the most popular clustering algorithms amongst practitioners and has had profound success in a wide range of data analysis applications. However, despite this, its statistical properties have not been fully understood. The goal of this work is to give a theoretical analysis of the procedure and to the best of our knowledge, provide the first analysis of density level-set estimation on manifolds. We also contribute ideas to related areas that may be of independent interest.
DBSCAN aims at discovering clusters which turn out to be the high-density regions of the dataset. It takes in two hyperparameters: minPts and. It defines a point as a core-point if there are at least minPts sample points in its -radius neighborhood. The points within the -radius neighborhood of a core-point are said to be directly reachable from that core-point. Then, a point is reachable from a core-point if there exists a path from to where each point is directly reachable from the next point. It is now clear that this definition of reachable gives a partitioning of the dataset (and remaining points not reachable from any core-point are considered noise). This partitioning is the clustering that is returned by DBSCAN.
The problem of analyzing DBSCAN has recently been explored in (dbscanConsistency). Their analysis is for a modified version of DBSCAN and is not focused on estimating a fixed density level. Their results have many desirable properties, but are not immediately applicable for what this paper tries to address. Using recent developments in topological data analysis along with some tools we develop in this paper, we show that it is now possible to analyze the original procedure.
The clusters DBSCAN aims at discovering can be viewed as approximations of the connected components of the level sets where is the density and is some density level. We provide the first comprehensive analysis in tuning minPts and to estimate the density level set for a particular level. Here, the density level is known to the algorithm while the density remains unknown. Density level set estimation has been studied extensively. e.g., (carmichael; hartigan; polonik95; cuevas; walther97; tysbakovMinimax; baillo; cadre; willet; biau; RV09; maier09; adaptive; generalizedDensity; S11; rinaldo12; S15; chen16; jiang17). However approaches that obtain state-of-art consistency results are largely unpractical (i.e. unimplementable). Our work shows that in actuality, DBSCAN, a procedure known for decades and has since been used widely, can achieve the strongest known results. Also, unlike much of the existing work, we show that DBSCAN can also recover the connected components of the level sets separately and bijectively.
Our work begins with the insight that DBSCAN behaves like an -neighborhood graph, which is different from, but related to the -nearest neighbor graph. The latter has been heavily used for cluster-tree estimation (CD10; stuetzle10; KV11; CDKvL14; jiang2017modal) and in this paper we adapt some of these ideas for -neighborhood graphs.
Cluster-tree estimation aims at discovering the hierarchical tree structure of the connected-components as the levels vary. balakrishnan2013cluster extends results by CD10 to the setting where the data lies on a lower dimensional manifold and provide consistency results depending on the lower dimension and independent of the ambient dimension. Here we are instead interested in how to set minPts and in order to estimate a particular level and provide rates on the Hausdorff distance error. This is different from works on cluster tree estimation which focuses on how to recover the tree structure rather than recovering a particular level. In that regard, we also require density estimation bounds in order to get a handle on the true density-levels and the empirical ones.
gives us optimal high-probability finite-sample-NN density estimation bounds which hold uniformly; this is key to obtaining optimal level-set estimation rates under the Hausdorff error. Much of the previous works on density level-set estimation, e.g. (RV09) provide rates under risk measures such as symmetric set difference. These metrics are considerably weaker than the Hausdorff metric; the latter is a uniform guarantee. There are such bounds for the histogram density estimator. This allowed adaptive to obtain optimal rates under Hausdorff metric, while having a fully adaptive procedure. This was a significant breakthrough for level set estimation, as discussed by chazal15. We believe this to be the strongest consistency results obtained thus far. However, a downside is that the histogram density estimator has little practical value. Here, aided with the desired bounds on the -NN density estimator, we can actually obtain similar results to adaptive but with the clearly practical DBSCAN.
We extend the -NN density estimation results of optimalknn
to the manifold case, as the bulk our analysis is about the more general case that the data lies on a manifold. Density-based procedures perform poorly in high-dimensions since the number of samples required increases exponentially in the dimension– the so called curse of dimensionality. Thus, the consequences of handling the manifold case are of practical significance. Since the estimation rates we obtain depend only on the intrinsic dimension, it explains why DBSCAN can do well in high dimensions if the data has low intrinsic dimension (i.e. the manifold hypothesis). Given the modern capacity of systems to collect data of increasing complexity, it has become ever more important to understand the feasibility ofpractical algorithms in high dimensions.
To analyze DBSCAN, we write minPts and in terms of the , unknown manfold dimension; , which controls the density estimator; and , which determines which level to estimate. We assume knowledge of with the goal of estimating the -level set of the density. We give a range of in terms of and corresponding consistency guarantees and estimation rates for such choices. We then adaptively tune and
in order to attain close to optimal performance with no a priori knowledge of the distribution. Adaptivity is highly desirable because it allows for automatic tuning of the hyper-parameters, which is a core tenet of unsupervised learning. To solve for the unknown dimension, we use an estimator fromintrinsicKnnNew, which we show to have considerably better finite-sample behavior than previously thought. More details and discussion of related works is in the main text. We then provide a new method of choosing such that it will asymptotically approach a value that provides near-optimal level set estimation rates.
We start by analyzing the procedure under the manifold assumption. The end of the paper will discuss the full-dimensional setting. The bulk of our contribution lies in analyzing the former situation, while the analysis of the latter uses a subset of those techniques.
Section 7 explains how to adaptively tune the parameters so that they fall within the theoretical ranges. The main contributions of this section are a stronger result about a known -nearest neighbor based approach to estimating the unknown dimension (Theorem 3) and a new way to tune to approach an optimal choice of (Theorem 4).
Section 8 gives the result when the data lives in without the manifold assumption.
3 The connection to neighborhood graphs
This section is dedicated towards the understanding of the clusters produced by DBSCAN. The algorithm can be found in (DBSCAN) and is not shown here since Lemma 1 characterizes what DBSCAN returns.
We have i.i.d. samples drawn from a distribution over .
Define the -NN radius of as
where denotes the Euclidean ball of radius centered at . Let denote the -neighborhood level graph of with vertices and an edge between and iff .
This is slightly different from -neighborhood graph, which includes all vertices. Here we exclude vertices below certain empirical density level (i.e. ).
The next definition is relevant to DBSCAN and is from (DBSCAN) but in the notation of Definition 1.
The following is with respect to fixed and .
is a core-point if .
is directly density-reachable from if and is a core-point.
is density-reachable from if there exists a sequence such that is directly density-reachable from for .
The following result is paraphrased from Lemmas 1 and 2 from (DBSCAN), which characterizes the clusters learned by DBSCAN.
(DBSCAN) Let be the clusters returned by DBSCAN(minPts, ). For any core-point , there exists with . On the other hand, for any , there exists core-point such that .
We now show the following result relating the -neighborhood level graphs and the clusters obtained from DBSCAN. Such an interpretation of DBSCAN has been given in previous works such as campello2015hierarchical.
Lemma 2 (DBSCAN and -neighborhood level graphs).
Let be the clusters obtained from on . Let be the connected components of . Then, there exists a one-to-one correspence between and such that if and correspond, then
Take any . Each point in is a core-point and by Lemma 1 and the definition of density-reachable, each point in belongs to the same . Thus, . Next we show that .
Suppose there exists core-point but and let . By Lemma 1, there exists core-point such that all points in are directly reachable from . Then there exists a path of core-points from to with pairwise edges of length at most . The same holds for to . Thus there exists such a path of core-points from to , which means that are in the same CC of , contradicting the assumption that and . Thus, in fact . The result now follows since consists of points that are at most from its core-points. ∎
We can now see that DBSCAN’s clusterings can be viewed as the connected components (CCs) of an appropriate -neighborhood level graph. Using a neighborhood graph to approximate the level-set has been studied in (generalizedDensity)
. The difference is that they use a kernel density estimator instead of a-NN density estimator and study the convergence properties under different settings.
4 Manifold Setting
We make the following regularity assumptions which are standard among works on manifold learning e.g. (manifold07; manifold12; balakrishnan2013cluster).
is supported on where:
is a -dimensional smooth compact Riemannian manifold without boundary embedded in compact subset .
The volume of is bounded above by a constant.
has condition number , which controls the curvature and prevents self-intersection.
Let be the density of with respect to the uniform measure on .
is continuous and bounded.
4.2 Basic Supporting Bounds
The following result bounds the empirical mass of Euclidean balls to the true mass under . It is a direct consequence of Lemma 6 of balakrishnan2013cluster.
Lemma 3 (Uniform convergence of empirical Euclidean balls (Lemma 6 of balakrishnan2013cluster)).
Let be a minimal fixed set such that each point in is at most distance from some point in . There exists a universal constant such that the following holds with probability at least . For all ,
where , is the empirical distribution, and .
For the rest of the paper, many results are qualified to hold with probability at least . This is precisely the event in which Lemma 3 holds.
If , then .
Next, we need the following bound on the volume of the intersection Euclidean ball and ; this is required to get a handle on the true mass of the ball under in later arguments. The upper and lower bounds follow from upperBoundBall and Lemma 5.3 of lowerBoundBall. The proof is given in the appendix.
Lemma 4 (Ball Volume).
If , and then
where is the volume of a unit ball in and is the volume w.r.t. the uniform measure on .
4.3 -NN Density Estimation
Here, we establish density estimation rates for the -NN density estimator in the manifold setting. This builds on work in density estimation on manifolds e.g. (hendriks90; pelletier05; ozakin09; kim13; berry17); thus, it may be of independent interest. The estimator is defined as follows
Definition 3 (k-NN Density Estimator).
The following extends previous work of optimalknn to the manifold case. The proofs can be found in the appendix.
Lemma 5 ( upper bound).
Lemma 6 ( lower bound).
We will often bound the density of points with low density. In low-density regions, there is less data and thus we require more points to get a tight bound. However, in many cases a tight bound is not necessary; thus the purposes of is to allow some slack. The higher the , the easier it is for the lemma conditions to be satisified. In particular, if is -Hölder continuous (i.e. ), we have .
5 Consistency and Rates
5.1 Level-Set Conditions
Much of the results will depend on the behavior of level set boundaries. Thus, we require sufficient drop-off at the boundaries, as well as separation between the CCs at a particular level set. We give the following notion of separation.
are -separated in if there exists a set such that every path from to intersects and .
Define the following shorthands for distance from a point to a set, the intersection of with a neighborhood around a set under the Euclidean distance, and the largest Euclidean distance from a point in a set to its closest sample point.
, , .
We have the following mild assumptions which ensures that the CCs can be separated from the rest of the density by sufficiently wide valleys and there is sufficient decay around the level set boundaries.
Assumption 3 (Separation Conditions).
Let and be a CCs of . There exists and such that the following holds:
For each , there exists , a connected component of such that:
separates by a valley: does not intersect with any other CC in ; and are -separated by some .
-regularity: For , we have
We can choose any . The -regularity assumption appears in e.g. (adaptive). This is very general and also allows us to make a separate global smoothness assumption.
We currently characterize the smoothness w.r.t. the Euclidean distance. One could alternatively use the geodesic distance on , . It follows from Proposition 6.3 of lowerBoundBall that when , we have . Since the distances we deal in our analysis with are of such small order, these distances can thus essentially be treated as equivalent. We use the Euclidean distance throughout the paper for simplicity.
We can define a region which isolates away from other clusters of .
5.2 Parameter Settings
Fix and . Let satisfy the following
where , and and are positive constants depending on which are implicit in the proofs later in this section.
The remainder of this section will be to show that DBSCAN(minPts, ) with
will consistently estimate each CC of . Throughout the text, we denote as the clusters returned by DBSCAN under this setting.
5.3 Separation and Connectedness
Take . We show that DBSCAN will return an estimated CC , such that does not contain any points outside of . Then, we show that contains all the sample points in . The proof ideas used are similar to that of standard results in cluster trees estimation; they can be found in the appendix.
Lemma 7 (Separation).
There exists sufficiently large and sufficiently small such that the following holds with probability at least . Let . There exists such that .
Lemma 8 (Connectedness).
There exists sufficiently large and sufficiently small such that the following holds with probability at least . Let . If there exists such that , then .
These results allow to have any dimension between to since we reason with , which contains samples, instead of simply .
5.4 Hausdorff Error
We give the estimation rate under the Hausdorff metric.
Definition 7 (Hausdorff Distance).
There exists sufficiently large and sufficiently small such that the following holds with probability at least . For each , there exists such that
Define . We show that , which involves two directions to show from the Hausdroff metric: that and .
We start by proving . Define . We have
where the first inequality holds when is chosen sufficiently small, and the last inequality holds because . Hence . Therefore, it suffices to show
We have that for , . Thus, for any and letting , we have
For chosen sufficiently small, the last equation will be large enough (i.e. of order ) so that the conditions of Lemma 5 hold. Thus, applying this for each , we obtain
We have the r.h.s. is at most for chosen appropriately and the first direction follows.
We now turn to the other direction, that . Let . Then there exists sample point by definition of and we have that . Finally, for sufficiently large, and thus . The result follows. ∎
When taking , we obtain the error rate of , ignoring logarithmic factors. When , this matches the known lower bound established in Theorem 4 of tysbakovMinimax. However, we do not obtain this rate when . In this case, the density estimation error will be of order at least due in part to the error from resolving the geodesic balls with Euclidean balls. This does not arise in the full dimensional setting, which will be described later.
6 Removal of False Clusters
The result of Theorem 1 guarantees us that for each , there exists that estimates it. In this section, we show how a second application of DBSCAN (Algorithm 1) can remove the false clusters discovered by the first application of DBSCAN with no additional parameters. This gives us the other direction, that each estimate in corresponds to a true CC in , and thus DBSCAN can identify with a one-to-one correspondence each CC of the level-set.
We state our result below. The proof is less involved and is in the appendix.
Theorem 2 (Removal of False CC Estimates).
Define , which is positive. There exists sufficiently large and sufficiently small depending on in addition to the constants mentioned in Section 5.2 so that the following holds with probability at least . For all , there exists such that
7 Adaptive Parameter Tuning
In this section, we show how to obtain the near optimal rates by estimating and adaptively choosing such that without knowledge of .
Knowing the manifold dimension is necessary to tune the parameters as described in Section 5.2. There has been much work done on estimating the intrinsic dimension as many learning procedures (including this one) require as an input. Such work in intrinsic dimension estimation include (intrinsicKegl; intrinsicBickel; intrinsicHein). intrinsicKnnOld and more recently intrinsicKnnNew take a -nearest neighbor approach. We work with the estimate of a dimension at a point proposed in the latter work:
The main result of intrinsicKnnNew gives a high-probability bound for a single sample . Here we give a high-probability bound under more mild smoothness assumptions which hold uniformly for all samples above some density-level given our new knowledge of -NN density estimation rates. This may be of independent interest.
Suppose that is -Hölder continuous for some . Choose and . Then there exists constants depending on such that if satisfies
then with probability at least ,
uniformly for all with .
We have for such that if , then by Lemma 5 for chosen appropriately large and chosen appropriately small.
We now try to get a handle on and show it is sufficiently close to . Applying Lemma 5 and 6 with and , appropriately chosen so that the conditions for the two Lemmas hold (remember that here we have ), we obtain
where the last inequality holds when is chosen sufficiently large so that is sufficiently small. On the other hand, we similarly obtain (for and appropriately chosen):
It is now clear that by the expansion , and for chosen sufficently large so that is sufficiently small, we have
The result now follows by combining this with the earlier established expression for , as desired. ∎
In intrinsicKnnNew, it is the case that ; under this setting, we match their bound with an error rate of with being the optimal choice for (ignoring log factors).
After determining , the next parameter we look at is . In particular, to obtain the optimal rate, we must choose without knowledge of . We present a consistent estimator for .
We need the following definition. The first characterizes how much varies in balls of a certain radius along the boundaries of the -level set (where denotes the boundary of ). The second is meant to be an estimate of the first, which can be computed from the data alone. The final is our estimate of .
The next is a result of how estimates .
Suppose that is -Hölder continuous for some . Let and . Then there exists positive constants and depending on such that when , then the following holds with probability at least .
Suppose that the value of is attained at and the value of is attained at . Let be the points that maximize on and , respectively. Let be the sample points that maximize on and , respectively. Now, we have
Now let be the closest sample point to in . Then,
On the other hand, we have
Let be the closest sample point to in . Then,
Thus it suffices to bound . First take and use Lemma 5 and 6 for . Using Lemma 3, we can show that . Next we bound . so we have guarantees on its value. Note that . Let . This implies that . Now since , we have . The same holds for the bounds related to . ∎
Theorem 4 ( in probability).
Suppose is -Hölder continuous for some with . Let and . Then for all ,
Based on the -regularity assumption, we have for :
Combining this with Lemma 9, we have with probability at least that
Thus with probability at least ,