1 Introduction
DBSCAN (Ester et al., 1996)
is a popular densitybased clustering algorithm which has had a wide impact on machine learning and data mining. Recent applications include superpixel segmentation
Shen et al. (2016), object tracking and detection in selfdriving Wagner et al. (2015); Guan et al. (2010), wireless networks Emadi and Mazinani (2018); Wang et al. (2019), GPS Diker and Nasibov (2012); Panahandeh and Åkerblom (2015), social network analysis Khatoon and Banu (2019); Yao et al. (2019), urban planning Fan et al. (2019); Pavlis et al. (2018), and medical imaging Tran et al. (2012); Baselice et al. (2015). The clusters that DBSCAN discovers are based on the connected components of the neighborhood graph of the data points of sufficiently high density (i.e. those with a sufficiently high number of data points in their neighborhood), where the neighborhood radius and the density threshold are the hyperparameters.
One of the main differences of densitybased clustering algorithms such as DBSCAN compared to popular objectivebased approaches such as kmeans
Arthur and Vassilvitskii (2006)Von Luxburg (2007) is that densitybased algorithms are nonparametric. As a result, DBSCAN makes very few assumptions on the data, automatically finds the number of clusters, and allows clusters to be of arbitrary shape and size Ester et al. (1996). However, one of the drawbacks is that it has a worstcase quadratic runtime Gan and Tao (2015). With the continued growth of modern datasets in both size and richness, nonparametric unsupervised procedures are becoming ever more important in understanding such datasets. Thus, there is a critical need to establish more efficient and scalable versions of these algorithms.The computation of DBSCAN can be broken up into two steps. The first is computing the neighborhood graph of the data points, where the neighborhood graph is defined with data points as vertices and edges between pairs of points that are distance at most apart. The second is processing the neighborhood graph to extract the clusters. The first step has worstcase quadratic complexity simply due to the fact that each data point may have of order linear number of points in its neighborhood for sufficiently high . However, even if the neighborhood graph does not have such an order of edges, computing this graph remains costly: for each data point, we must query for neighbors in its neighborhood, which is worstcase linear time for each point. There has been much work done in using spacepartitioning data structures such as KDTrees (Bentley, 1975) to improve neighborhood queries, but these methods still run in linear time in the worstcase. Approximate methods (e.g. Indyk and Motwani (1998); Datar et al. (2004)) answer queries in sublinear time, but such methods come with few guarantees. The second step is processing the neighborhood graph to extract the clusters, which consists in finding the connected components of the subgraph induced by nodes with degree above a certain threshold (i.e. the MinPts hyperparameter in the original DBSCAN Ester et al. (1996)). This step is linear in the number of edges in the neighborhood graph.
Our proposal is based on a simple but powerful insight: the full neighborhood graph may not be necessary to extract the desired clustering. We show that we can subsample the edges of the neighborhood graph while still preserving the connected components of the corepoints (the high density points) on which DBSCAN’s clusters are based.
To analyze this idea, we assume that the points are sampled from a distribution defined by a density function satisfying certain standard Jiang (2017); Steinwart (2011) conditions (e.g., cluster density is sufficiently high, and clusters do not become arbitrarily thin). Such an assumption is natural because DBSCAN recovers the highdensity regions as clusters Ester et al. (1996); Jiang (2017). Under this assumption we show that the minimum cut of the neighborhood graph is as large as , where is the number of datapoints. This, combined with a sampling lemma by Karger Karger (1999), implies that we can sample as little as of the edges uniformly while preserving the connected components of the neighborhood graph exactly.  Our algorithm, SNGDBSCAN, proceeds by constructing and processing this subsampled neighborhood graph all in time. Moreover, our procedure only requires access to similarity queries for random pairs of points (adding an edge between pairs if they are at most apart). Thus, unlike most implementations of DBSCAN which take advantage of spacepartitioning data structures, we don’t require the embeddings of the datapoints themselves. In particular, our method is compatible with arbitrary similarity functions instead of being restricted to a handful of distance metrics such as the Euclidean.
We provide an extensive empirical analysis showing that SNGDBSCAN is effective on real datasets. We show on large datasets (on the order of a million datapoints) that we can subsample as little as of the neighborhood graph and attain competitive performance to scikit learn’s implementation of DBSCAN while consuming far fewer resources – as much as 200x speedup and 250x less RAM consumption on cloud machines with up to 750GB of RAM. In fact, for larger settings of on these datasets, DBSCAN fails to run at all due to insufficient RAM. We also show that our method is effective even on smaller datasets. Sampling between to of the edges depending on the dataset, SNGDBSCAN shows a nice improvement in runtime while still maintaining competitive clustering performance.
2 Related Work
There is a large body of work on making DBSCAN more scalable. Due to space we can only mention some of these works here. The first approach is to more efficiently perform the nearest neighbor queries that DBSCAN uses when constructing the neighborhood graph either explicitly or implicitly Huang and Bian (2009); Kumar and Reddy (2016). However, while these methods do speed up the computation of the neighborhood graph, they may not save memory costs overall as the number of edges remains similar. Our method of subsampling the neighborhood graph brings both memory and computational savings.
A natural idea for speeding up the construction of the nearest neighbors graph is to compute it approximately, for example by using localitysensitive hashing (LSH) Indyk et al. (1997) and thus improving the overall running time Shiqiu and Qingsheng (2019); Wu et al. (2007). At the same time, since the resulting nearest neighbor graph is incomplete, the current stateoftheart LSHbased methods lack any guarantees on the quality of the solutions they produce. This is in sharp contrast with SNGDBSCAN, which, under certain assumptions, gives exactly the same result as DBSCAN.
Another approach is to first find a set of "leader" points that preserve the structure of the original dataset and cluster those "leader" points first. Then the remaining points are clustered to these "leader points" Geng et al. (2000); Viswanath and Babu (2009). Liu (2006)
modified DBSCAN by selecting clustering seeds among unlabeled core points to reduce computation time in regions that have already been clustered. Other heuristics include
(Patwary et al., 2012; Kryszkiewicz and Lasek, 2010). More recently, Jang and Jiang (2019)pointed out that it’s not necessary to compute the density estimates for each of the data points and presented a method that chooses a subsample of the data using
centers to save runtime on density computations. These approaches all reduce the number of data points on which we need to perform the expensive neighborhood graph computation. SNGDBSCAN preserves all of the data points but subsamples the edges instead.There are also a number of approaches based on leveraging parallel computing Chen et al. (2010); Patwary et al. (2012); Götz et al. (2015), which includes MapReduce based approaches Fu et al. (2011); He et al. (2011); Dai and Lin (2012); Noticewala and Vaghela (2014). Then there are also distributed approaches to DBSCAN where data is partitioned across different locations, and there may be communication cost constraints Neto et al. (2015); Lulli et al. (2016). Andrade et al. (2013) provides a GPU implementation. In this paper, we assume a single processor, although our method can be implemented in parallel which can be a future research direction.
3 Algorithm
We now introduce our algorithm SNGDBSCAN (Algorithm 1). It proceeds by first constructing the sampled neighborhood graph given a sampling rate . We do this by initializing a graph whose vertices are the data points, sampling fraction of all pairs of data points, and adding corresponding edges to the graph if points are less than apart. Compared to DBSCAN, the latter computes the full neighborhood graph, typically using spacepartitioning data structures such as kdtrees, while SNGDBSCAN can be seen as using a sampled version of bruteforce (which looks at each pair of points to compute the graph). Despite not leveraging spacepartitioning data structures, we show in the experiments that SNGDBSCAN can still be much more scalable in both runtime and memory than DBSCAN.
The remaining stage of processing the graph is the same as in DBSCAN, where we find the corepoints (points with degree at least degree MinPts), compute the connected components induced by the corepoints, and then cluster the borderpoints (those within
of a connected component). The remaining points are not clustered and are referred to as noisepoints or outliers. The entire procedure runs in
time. We will show in the theory that can be of order while still maintaining statistical consistency guarantees of recovering the true clusters under certain density assumptions. Under this setting, the procedure runs in time, and we show on simulations in Figure 1 that such a setting for is sufficient.4 Theoretical Analysis
For the analysis, we assume that we draw i.i.d. samples from a distribution . The goal is to show that the sampled version of DBSCAN (Algorithm 1) can recover the true clusters with statistical guarantees, where the true clusters are the connected components of a particular upperlevel set of the underlying density function, as it is known that DBSCAN recovers such connected components of a level set Jiang (2017); Wang et al. (2017); Sriperumbudur and Steinwart (2012).
We make the following Assumption 1 on . The assumption is threefold. The first part ensures that the true clusters are of sufficiently high density level and the noise areas are of sufficiently low density. The second part ensures that the true clusters are pairwise separated by a sufficiently wide distance and that they are also separated away from the noise regions; such an assumption is common in analyses of cluster trees (e.g. Chaudhuri and Dasgupta (2010); Chaudhuri et al. (2014); Jiang (2017)). Finally, the last part ensures that the clusters don’t become arbitrarily thin anywhere. Otherwise it would be difficult to show that the entire cluster will be recovered as one connected component. This has been used in other analyses of levelset estimation Singh et al. (2009).
Assumption 1.
Data is drawn from distribution on with density function . There exists and compact connected sets and subset such that

for all (i.e. clusters are highdensity), for all (i.e. outlier regions are lowdensity), everywhere else, and that (the cluster density is sufficiently higher than noise density).

for and (i.e clusters are separated) and for (i.e. outlier regions are away from the clusters).

For all , and we have where and is the volume of a unit dimensional ball. (i.e. clusters don’t become arbitrarily thin).
The analysis proceeds in three steps:

We give a lower bound on the mincut of the subgraph of the neighborhood graph corresponding to each cluster. This will be useful later as it determines the sampling rate we can use while still ensure these subgraphs remain connected. (Lemma 1)

We use standard concentration inequalities to show that if MinPts
is set appropriately, we will with high probability determine which samples belong to clusters and which ones don’t in the sampled graph. (Lemma
2)
We now give the lower bound on the mincut of the subgraph of the neighborhood graph corresponding to each cluster. This will be useful later as it determines the sampling rate we can use while still ensuring these subgraphs remain connected. As a reminder, for a graph , the size of the cutset of is defined as and the size of the mincut of is the smallest proper cutset size: .
Lemma 1 (Lower bound on mincut of neighborhood graph of corepoints).
Suppose that Assumption 1 holds and . Let . Then there exists a constant depending only on and such that the following holds for . Let be the neighborhood graph of . Let be the subgraph of with nodes in . Then with probability at least , for each , we have that
We next show that if MinPts is set appropriately, we will with high probability determine which samples belong to clusters and which ones don’t in the sampled graph.
Lemma 2.
Suppose that Assumption 1 holds and . Let . There exists a universal constant such that the following holds. Let the sampling rate be and suppose
and , where . Then with probability at least , all samples in for some are identified as corepoints and the rest are noise points.
The next result shows a rate at which we can sample a graph while still have it be connected, which depends on the mincut and the size of the graph. It follows from classical results in graph theory that cut sizes remain preserved under sampling Karger (1999).
Lemma 3.
There exists universal constant such that the following holds. Let be a graph with mincut and . If
then with probability at least , the graph obtained by sampling each edge of with probability is connected.
Theorem 1.
Suppose that Assumption 1 holds and . Let . There exist universal constants and constant depending only on and such that the following holds. Suppose
and
where .
Then with probability at least , Algorithm 1 returns (up to permutation) the clusters .
Remark 1.
As a consequence, we can take , and, for sufficiently large, the clusters are recovered with high probability, leading to a computational complexity of for Algorithm 1.
5 Experiments
Datasets and hyperparameter settings: We compare the performance of SNGDBSCAN against DBSCAN on 5 large (~1,000,000+ datapoints) and 12 smaller (~100 to ~100,000 datapoints) datasets from UCI Dua and Graff (2017) and OpenML Vanschoren et al. (2013). Details about each large dataset shown in the main text are summarized in Figure 2, and the rest, including all hyperparameter settings, can be found in the Appendix. Due to space constraints, we couldn’t show the results for all of the datasets in the main text. The settings of MinPts and range of that we ran SNGDBSCAN and DBSCAN on, as well as how was chosen, are shown in the Appendix. For simplicity we fixed MinPts and only tuned which is the more essential hyperparameter. We compare our implementation of SNGDBSCAN to that of scikit learn’s DBSCAN Pedregosa et al. (2011), both of which are implemented with Cython, which allows the code to have a Python API while the expensive computations are done in a C/C++ backend.
Clustering evaluation: To score cluster quality, we use two popular clustering scores: the Adjusted Rand Index (ARI) Hubert and Arabie (1985) and Adjusted Mutual Information (AMI) Vinh et al. (2010) scores. ARI is a measure of the fraction of pairs of points that are correctly clustered to the same or different clusters. AMI is a measure of the crossentropy of the clusters and the ground truth. Both are normalized and adjusted for chance, so that the perfect clustering receives a score of 1 and a random one receives a score of . The datasets we use are classification datasets where we cluster on the features and use the labels to evaluate our algorithm’s clustering performance. It is standard practice to evaluate performance using the labels as a proxy for the ground truth Jiang et al. (2018); Jang and Jiang (2019).
5.1 Performance on Large Datasets
mPts  

Australian  1,000,000  14  2  0.001  10 
Still Stisen et al. (2015)  949,983  3  6  0.001  10 
Watch Stisen et al. (2015)  3,205,431  3  7  0.01  10 
Satimage  1,000,000  36  6  0.02  10 
Phone Stisen et al. (2015)  1,000,000  3  7  0.001  10 
We now discuss the performance on the large datasets, which we ran on a cloud compute environment. Due to computational costs, we only ran DBSCAN once for each setting. We ran SNGDBSCAN 10 times for each setting and averaged the clustering scores and runtimes. The results are summarized in Figure 3, which reports the highest clustering scores for the respective algorithms along with the runtime and RAM usage required to achieve these scores. We also plot the clustering performance and runtime/RAM usage across different settings of in Figure 4.
Base ARI  SNG ARI  Base AMI  SNG AMI  

Australian  0.0933 (6h 4m)  0.0933 (1m 36s)  0.1168 (4h 28m)  0.1167 (1m 36s) 
282 GB  1.5 GB  146 GB  1.5 GB  
Still  0.7901 (14m 37s)  0.7902 (1m 21s)  0.8356 (14m 37s)  0.8368 (1m 57s) 
419 GB  1.6 GB  419 GB  1.5 GB  
Watch  0.1360 (27m 2s)  0.1400 (1h 27m)  0.1755 (7m 35s)  0.1851 (1h 29m) 
518 GB  8.7 GB  139 GB  7.6 GB  
Satimage  0.0975 (1d 3h)  0.0981 (33m 22s)  0.1019 (1d 3h)  0.1058 (33m 22s) 
11 GB  2.1 GB  11 GB  2.1 GB  
Phone  0.1902 (12m 36s)  0.1923 (59s)  0.2271 (4m 48s)  0.2344 (46s) 
138 GB  1.7 GB  32 GB  1.5 GB 
From Figure 4, we see that DBSCAN often exceeds the 750GB RAM limit on our machines and fails to complete; the amount of memory (as well as runtime) required for DBSCAN escalates quickly as increases, suggesting that the size of the neighborhood graph grows quickly. Meanwhile, SNGDBSCAN’s memory usage remains reasonable and increases slowly across . This suggests that SNGDBSCAN can run on much larger datasets infeasible for DBSCAN, opening the possibilities for applications which may have previously not been possible due to scalability constraints – all the while attaining competitive clustering quality (Figure 3).
Similarly, SNGDBSCAN shows a significant runtime improvement for almost all datasets and stays relatively constant across epsilon. We note that occasionally (e.g. Watch dataset for small ), SNGDBSCAN is slower. This is likely because SNGDBSCAN does not take advantage of spacepartitioning data structures such as kdtrees. This is more likely on lower dimensional datasets where spacepartitioning data structures tend to perform faster as the number of partitions is exponential in dimension Bentley (1975). Conversely, we see the largest speedup on Australian, which is the highest dimension large dataset we evaluated. However, DBSCAN was unable to finish clustering Watch past the first few values of due to exceeding memory limits. During preliminary experimentation, but not shown in the results, we tested DBSCAN on the UCI Character Font Images dataset which has dimension 411 and 745,000 datapoints. DBSCAN failed to finish after running on the cloud machine for over 6 days. Meanwhile, SNGDBSCAN was able to finish with the same settings in under 15 mins using the setting. We didn’t show the results here because we were unable to obtain clustering scores for DBSCAN.
5.2 Performance on smaller datasets
We now show that we don’t require large datasets to enjoy the advantages of SNGDBSCAN. Speedups are attainable for even the smallest datasets without sacrificing clustering quality. Due to space constraints, we provide the summary of the datasets and hyperparameter setting details in the Appendix. In Figure 5, we show performance metrics for some datasets under optimal tuning. The results for the rest of the datasets are in the Appendix, where we also provide charts showing performance across and different settings of to better understand the effect of the sampling rate on cluster quality– there we show that SNGDBSCAN is stable in the hyperparameter. Overall, we see that SNGDBSCAN can give considerable savings in computational costs while remaining competitive in clustering quality on smaller datasets.
Base ARI  SNG ARI  Base AMI  SNG AMI  

Wearable  0.2626 (55s)  0.3064 (8.2s)  0.4788 (26s)  0.4720 (6.6s) 
Iris  0.5681 (<0.01s)  0.5681 (<0.01s)  0.7316 (<0.01s)  0.7316 (<0.01s) 
LIBRAS  0.0713 (0.03s)  0.0903 (<0.01s)  0.2711 (0.02s)  0.3178 (<0.01s) 
Page Blocks  0.1118 (0.38s)  0.1134 (0.06s)  0.0742 (0.49s)  0.0739 (0.06s) 
kc2  0.3729 (<0.01s)  0.3733 (<0.01s)  0.1772 (<0.01s)  0.1671 (<0.01s) 
Faces  0.0345 (1.7s)  0.0409 (0.16s)  0.2399 (1.7s)  0.2781 (0.15s) 
Ozone  0.0391 (<0.01s)  0.0494 (<0.01s)  0.1214 (<0.01s)  0.1278 (<0.01s) 
Bank  0.1948 (3.4s)  0.2265 (0.03s)  0.0721 (3.3s)  0.0858 (0.03s) 
5.3 Performance against DBSCAN++
We now compare the performance of SNGDBSCAN against a recent DBSCAN speedup called DBSCAN++Jang and Jiang (2019), which proceeds by performing centers to select candidate points, computing densities for these candidate points and using these densities to identify a subset of corepoints, and finally clustering the dataset based on the neighborhood graph of these corepoints. Like our method, DBSCAN++ also has runtime. SNGDBSCAN can be seen as subsampling edges while DBSCAN++ subsamples the vertices.
We show comparative results on our large datasets in Figure 6. We ran DBSCAN++ with the same sampling rate as SNGDBSCAN. We see that SNGDBSCAN is competitive and beats DBSCAN++ on 70% of the metrics. Occasionally, DBSCAN++ fails to produce reasonable clusters with the given, as shown by the low scores for some datasets. We see that SNGDBSCAN is slower than DBSCAN++ for the lowdimensional datasets (i.e. Phone, Watch, and Still, which are all of dimension 3) while there is a considerable speedup on Australian. Like in the experiments on the large datasets against DBSCAN, this is due to the fact that DBSCAN++, like DBSCAN, leverages spacepartitioning data structures such as kdtrees, which are faster in low dimensions. Overall, we see that SNGDBSCAN is a better choice over DBSCAN++ when using the same sampling rate.
DBSCAN++ ARI  SNG ARI  DBSCAN++ AMI  SNG AMI  

Australian  0.1190 (1185s)  0.0933 (96s)  0.1166 (1178s)  0.1167 (96s) 
1.6 GB  1.5 GB  1.6 GB  1.5 GB  
Satimage  0.0176 (5891s)  0.0981 (2002s)  0.1699 (4842s)  0.1058 (2002s) 
2.3 GB  2.1 GB  2.3 GB  2.1 GB  
Phone  0.0001 (11s)  0.1923 (59s)  0.0023 (11s)  0.2344 (46s) 
1.0 GB  1.7 GB  1.0 GB  1.5 GB  
Watch  0.0000 (973s)  0.1400 (5230s)  0.0025 (1017s)  0.1851 (5371s) 
1.4 GB  8.7 GB  1.4 GB  7.6 GB  
Still  0.7900 (12s)  0.7902 (81s)  0.8370 (15s)  0.8368 (117s) 
1.4 GB  1.6 GB  1.4 GB  1.5 GB 
Conclusion
Density clustering has had a profound impact on machine learning and data mining; however, it can be very expensive to run on large datasets. We showed that the simple idea of subsampling the neighborhood graph leads to a procedure, which we call SNGDBSCAN that runs in time and comes with statistical consistency guarantees. We showed empirically on a variety of real datasets that SNGDBSCAN can offer a tremendous savings in computational resources while maintaining clustering quality. Future research directions include using adaptive instead of uniform sampling of edges, combining SNGDBSCAN with other speedup techniques, and paralellizing the procedure for even faster computation.
Broader Impact
As stated in the introduction, DBSCAN has a wide range of applications within machine learning and data mining. Our contribution is a more efficient variant of DBSCAN. The potential impact of SNGDBSCAN lies in considerable savings in computational resources and further applications of density clustering which weren’t possible before due to scalability constraints.
References
 Gdbscan: a gpu accelerated algorithm for densitybased clustering. Procedia Computer Science 18, pp. 369–378. Cited by: §2.
 Kmeans++: the advantages of careful seeding. Technical report Stanford. Cited by: §1.

A dbscan based approach for jointly segment and classify brain mr images
. In 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 2993–2996. Cited by: §1.  Multidimensional binary search trees used for associative searching. Communications of the ACM 18 (9), pp. 509–517. Cited by: §1, §5.1.
 Consistent procedures for cluster tree estimation and pruning. IEEE Transactions on Information Theory 60 (12), pp. 7900–7912. Cited by: §4.
 Rates of convergence for the cluster tree. In Advances in Neural Information Processing Systems, pp. 343–351. Cited by: §4, Lemma 4.
 Parallel dbscan with priority rtree. In Information Management and Engineering (ICIME), 2010 The 2nd IEEE International Conference on, pp. 508–511. Cited by: §2.
 Efficient map/reducebased dbscan algorithm with optimized data partition. In 2012 IEEE Fifth International Conference on Cloud Computing, pp. 59–66. Cited by: §2.
 Localitysensitive hashing scheme based on pstable distributions. In Proceedings of the twentieth annual symposium on Computational geometry, pp. 253–262. Cited by: §1.
 Estimation of traffic congestion level via fndbscan algorithm by using gps data. In 2012 IV International Conference" Problems of Cybernetics and Informatics"(PCI), pp. 1–4. Cited by: §1.
 UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §5.

A novel anomaly detection algorithm using dbscan and svm in wireless sensor networks
. Wireless Personal Communications 98 (2), pp. 2025–2035. Cited by: §1.  A densitybased algorithm for discovering clusters in large spatial databases with noise.. In Kdd, Vol. 96, pp. 226–231. Cited by: §1, §1, §1, §1.
 Consumer clusters detection with geotagged social network data using dbscan algorithm: a case study of the pearl river delta in china. GeoJournal, pp. 1–21. Cited by: §1.
 Research on parallel dbscan algorithm design based on mapreduce. In Advanced Materials Research, Vol. 301, pp. 1133–1138. Cited by: §2.
 DBSCAN revisited: misclaim, unfixability, and approximation. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 519–530. Cited by: §1.
 A fast density based clustering algorithm [j]. Journal of Computer Research and Development 11, pp. 001. Cited by: §2.
 HPDBSCAN: highly parallel dbscan. In Proceedings of the workshop on machine learning in highperformance computing environments, pp. 2. Cited by: §2.
 Improved dbscan clustering algorithm based vehicle detection using a vehiclemounted laser scanner. Transactions of Beijing Institute of Technology 30 (6), pp. 732–736. Cited by: §1.
 Mrdbscan: an efficient parallel densitybased clustering algorithm using mapreduce. In Parallel and Distributed Systems (ICPADS), 2011 IEEE 17th International Conference on, pp. 473–480. Cited by: §2.
 A grid and density based fast spatial clustering algorithm. In Artificial Intelligence and Computational Intelligence, 2009. AICI’09. International Conference on, Vol. 4, pp. 260–263. Cited by: §2.
 Comparing partitions. Journal of classification 2 (1), pp. 193–218. Cited by: §5.

Localitypreserving hashing in multidimensional spaces.
In
Proceedings of the twentyninth annual ACM symposium on Theory of computing
, pp. 618–625. Cited by: §2. 
Approximate nearest neighbors: towards removing the curse of dimensionality
. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pp. 604–613. Cited by: §1.  DBSCAN++: towards fast and scalable density clustering. In International Conference on Machine Learning, pp. 3019–3029. Cited by: §2, Figure 6, §5.3, §5.
 Quickshift++: provably good initializations for samplebased mean shift. In 35th International Conference on Machine Learning, ICML 2018, pp. 3591–3600. Cited by: §5.
 Density level set estimation on manifolds with dbscan. In International Conference on Machine Learning, pp. 1684–1693. Cited by: §1, §4, §4.
 Random sampling in cut, flow, and network design problems. Mathematics of Operations Research 24 (2), pp. 383–413. Cited by: Appendix A, §1, §4.
 An efficient method to detect communities in social networks using dbscan algorithm. Social Network Analysis and Mining 9 (1), pp. 9. Cited by: §1.
 TIdbscan: clustering with dbscan by means of the triangle inequality. In International Conference on Rough Sets and Current Trends in Computing, pp. 60–69. Cited by: §2.
 A fast dbscan clustering algorithm by accelerating neighbor searching using groups method. Pattern Recognition 58, pp. 39–48. Cited by: §2.
 A fast densitybased clustering algorithm for large databases. In Machine Learning and Cybernetics, 2006 International Conference on, pp. 996–1000. Cited by: §2.
 NGdbscan: scalable densitybased clustering for arbitrary data. Proceedings of the VLDB Endowment 10 (3), pp. 157–168. Cited by: §2.
 G2P: a partitioning approach for processing dbscan with mapreduce. In International Symposium on Web and Wireless Geographical Information Systems, pp. 191–202. Cited by: §2.
 MRidbscan: efficient parallel incremental dbscan algorithm using mapreduce. International Journal of Computer Applications 93 (4). Cited by: §2.
 Clustering driving destinations using a modified dbscan algorithm with locallydefined mapbased thresholds. In European Congress on Computational Methods in Applied Sciences and Engineering, pp. 97–103. Cited by: §1.
 A new scalable parallel dbscan algorithm using the disjointset data structure. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 62. Cited by: §2, §2.
 A modified dbscan clustering method to estimate retail center extent. Geographical Analysis 50 (2), pp. 141–161. Cited by: §1.
 Scikitlearn: machine learning in python. the Journal of machine Learning research 12, pp. 2825–2830. Cited by: §5.
 Realtime superpixel segmentation by dbscan clustering algorithm. IEEE transactions on image processing 25 (12), pp. 5933–5942. Cited by: §1.
 DBSCAN clustering algorithm based on locality sensitive hashing. In Journal of Physics: Conference Series, Vol. 1314, pp. 012177. Cited by: §2.
 The promise repository of software engineering databases. school of information technology and engineering, university of ottawa, canada. Cited by: Figure 7.
 Vehicle recognition using rule based methods. Cited by: Figure 7.
 Adaptive hausdorff estimation of density level sets. The Annals of Statistics 37 (5B), pp. 2760–2782. Cited by: §4.
 Consistency and rates for clustering with dbscan. In Artificial Intelligence and Statistics, pp. 1090–1098. Cited by: §4.
 Adaptive density level set clustering. In Proceedings of the 24th Annual Conference on Learning Theory, pp. 703–738. Cited by: §1.
 Smart devices are different: assessing and mitigatingmobile sensing heterogeneities for activity recognition. In Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems, pp. 127–140. Cited by: Figure 2.
 A densitybased segmentation for 3d images, an application for xray microtomography. Analytica chimica acta 725, pp. 14–21. Cited by: §1.
 Wearable computing: accelerometers’ data classification of body postures and movements. In Brazilian Symposium on Artificial Intelligence, pp. 52–61. Cited by: Figure 7.
 OpenML: networked science in machine learning. SIGKDD Explorations 15 (2), pp. 49–60. External Links: Link, Document Cited by: §5.
 Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. Journal of Machine Learning Research 11 (Oct), pp. 2837–2854. Cited by: §5.
 Roughdbscan: a fast hybrid density based clustering method for large data sets. Pattern Recognition Letters 30 (16), pp. 1477–1488. Cited by: §2.
 A tutorial on spectral clustering. Statistics and computing 17 (4), pp. 395–416. Cited by: §1.
 Modification of dbscan and application to range/doppler/doa measurements for pedestrian recognition with an automotive radar system. In 2015 European Radar Conference (EuRAD), pp. 269–272. Cited by: §1.

Optimal rates for cluster tree estimation using kernel density estimators
. arXiv preprint arXiv:1706.03113. Cited by: §4.  Learning to improve wlan indoor positioning accuracy based on dbscankrf algorithm from rss fingerprint data. IEEE Access 7, pp. 72308–72315. Cited by: §1.
 A linear dbscan algorithm based on lsh. In 2007 International Conference on Machine Learning and Cybernetics, Vol. 5, pp. 2608–2614. Cited by: §2.
 Densitybased community detection in geosocial networks. In Proceedings of the 16th International Symposium on Spatial and Temporal Databases, pp. 110–119. Cited by: §1.
Appendix A Proofs
Let be the empirical distribution w.r.t. . We need the following result giving uniform guarantees on the masses of empirical balls with respect to the mass of true balls w.r.t. .
Lemma 4 (Chaudhuri and Dasgupta (2010)).
Pick . Then with probability at least , for every ball and , we have
where .
Proof of Lemma 1.
Suppose that is a proper subset of the nodes in . Let be the complement of in . We lower bound the cut of and in .
Let . Then for any , we have:
Thus, by Lemma 4, there exists a sample point in . It follows that forms a cover of (i.e ). Since is connected, then there exists and such that and intersect and thus .
The cut of and in thus contains all edges from to and from to , which is at least the number of nodes in . We have
We have
which holds for sufficiently large as in the statement of the Lemma for some . By Lemma 4, we have
which also holds for sufficiently large as in the statement of the Lemma for some . The result follows. ∎
Proof of Lemma 2.
Let be the neighbors of in the sampled neighborhood graph. Suppose that for some . Then by Hoeffding’s inequality, we have
Thus, with probability at least , we have
Now, suppose that . Then by Hoeffding’s inequality, we have
Thus, with probability at least , we have
Hence, the following holds uniformly with probability at least . When for some , we have
and otherwise we have
The result follows because minPts is the threshold on that decides whether a point is a corepoint. There are no border points because and there are no points within of for . ∎
Appendix B Additional Experiments
MinPts  Range of  

Australian  1,000,000  14  2  0.001  10  [0.2, 2.5) 
Still  949,983  3  6  0.001  10  [0.1, 13.5) 
Watch  3,205,431  3  7  0.01  10  [0.001, 0.05) 
Satimage  1,000,000  36  6  0.02  10  [1, 6) 
Phone  1,000,000  3  7  0.001  10  [0.001, 0.06) 
Wearable Ugulino et al. (2012)  165,633  16  5  0.01  10  [1, 80) 
Iris  150  4  3  0.3  10  [0.1, 2.2) 
LIBRAS  360  90  15  0.3  10  [0.5, 1.6) 
Page Blocks  5,473  10  5  0.1  10  [1, 10,000) 
kc2 Shirabad and Menzies (2005)  522  21  2  0.3  10  [50, 7,000) 
Faces  400  4,096  40  0.3  10  [6, 10) 
Ozone  330  9  35  0.3  10  [100, 800) 
Bank  8,192  32  10  0.01  10  [3.7, 7.4) 
Ionosphere  351  33  2  0.3  10  [1, 5.5) 
Mozilla  15,545  5  2  0.03  2  [1, 7,000) 
Tokyo  959  44  2  0.1  2  [10K, 1,802K) 
Vehicle Siebert (1987)  846  18  4  0.3  10  [10, 40) 
Base ARI  SNG ARI  Base AMI  SNG AMI  

Wearable  0.2626 (55s)  0.3064 (8.2s)  0.4788 (26s)  0.4720 (6.6s) 
Iris  0.5681 (<0.01s)  0.5681 (<0.01s)  0.7316 (<0.01s)  0.7316 (<0.01s) 
LIBRAS  0.0713 (0.03s)  0.0903 (<0.01s)  0.2711 (0.02s)  0.3178 (<0.01s) 
Page Blocks  0.1118 (0.38s)  0.1134 (0.06s)  0.0742 (0.49s)  0.0739 (0.06s) 
kc2  0.3729 (<0.01s)  0.3733 (<0.01s)  0.1772 (<0.01s)  0.1671 (<0.01s) 
Faces  0.0345 (1.7s)  0.0409 (0.16s)  0.2399 (1.7s)  0.2781 (0.15s) 
Ozone  0.0391 (<0.01s)  0.0494 (<0.01s)  0.1214 (<0.01s)  0.1278 (<0.01s) 
Bank  0.1948 (3.4s)  0.2265 (0.03s)  0.0721 (3.3s)  0.0858 (0.03s) 
Ionosphere  0.6243 (0.01s)  0.6289 (<0.01s)  0.5606 (0.01s)  0.5437 (<0.01s) 
Mozilla  0.1943 (0.08s)  0.2642 (0.05s)  0.1452 (0.09s)  0.1558 (0.05s) 
Tokyo  0.4398 (0.02s)  0.4379 (<0.01s)  0.2872 (0.02s)  0.3053 (<0.01s) 
Vehicle  0.0905 (0.01s)  0.0845 (<0.01s)  0.1643 (<0.01s)  0.1653 (<0.01s) 
Comments
There are no comments yet.