Clustering is an unsupervised data mining technology that divides a set of objects into subgroups by maximizing inter-group distances and minimizing intra-group distances. Usually, the clustering algorithm can be divided into four classes: partition-based, hierarchy-based, grid-based and density-based. Among all the clustering algorithms, DBSCAN (Ester et al., 1996), a density-based algorithm, is one of the most popular. The key idea of DBSCAN is that, for one point of the data set in d-dimensional space , if its neighborhood within the d-dimensional ball with radius , i.e., -neighborhood, contains at least points, all the points inside this ball including formed a cluster. And is defined as a core point. Whenever a new core point is added to the cluster of , all the points within the new core point’s -neighborhood are added to the cluster. This process goes on recursively in this way until all the clusters extended to their maximum size.
DBSCAN is a computationally expensive algorithm, with time complexity as shown in (Gao and Tao, 2015), this makes it inefficient for clustering large scale data sets. Thus we focus on parallelizing DBSCAN in this study. Recently, a benchmark work compared the performance of the current parallelized implementations in terms of runtime and found that MPI based implementations are more efficient than the others (Neukirchen, 2016). In a typical MPI implementation of DBSCAN such as PDSDBSCAN-D, data points are partitioned uniformly into different processors, and core points and their -neighborhood points may be distributed among processors, thus communications are required to merge these points to one cluster. When the number of processors increases, the message number and communication frequency will increase and the communication time will be dominant (Patwary et al., 2012).
Fig. 1 illustrates the communication mode of typical MPI based DBSCAN. As in Fig. 1(a), all the core points that form a single cluster are distributed over three workers. Clearly, through common neighbor points and , node needs to route from node and to reach its parent node . Since MPI uses a peer-to-peer communication pattern, this process will generate a lot of merging requests. In general, the MPI based setting will have communication overhead when points from the same cluster are scattered over more partitions. And this scenario will be worse with the increase of worker number. More details of the communication process are in(Patwary et al., 2012).
To overcome the communication bottleneck, we employ a parameter server framework (Li et al., 2014) to implement parallel DBSCAN algorithm using the disjoint-set data structure mentioned in the paper (Patwary et al., 2012). The details about our Parameter Server framework can be found in (Chen et al., 2017; Zhou et al., 2017)
. In our proposed algorithm, a global vector that records the class label of all data points is stored in the server processors. In worker processors, we employ a fast global union approach to union the disjoint-sets locally and push the resulted label vector to servers to update the global vector. This method alleviates the communication burden. Experiments over the datasets of different scales demonstrate that PS-DBSCAN outperforms PDSDBSCAN-D with 2-10 times speedup on communication efficiency.
The remainder of this paper is organized as follows. Section 2 describes the details of our parallel implementation of DBSCAN based on Parameter Server framework, referred to as PS-DBSCAN. In section 3 we compare the speedup of communication between our algorithm and the MPI based method PDSDBSCAN-D. Section 4 demonstates the usage of our PS-DBSCAN in our PAI. In section 5, we survey the related work. Section 6 gives a brief conclusion and an overview of future work.
Our PS-DBSCAN is built based on Alibaba parameter server system called KunPeng (Zhou et al., 2017). The KunPeng architecture is shown in Fig. 2. We use SDK of KunPeng to implement the distributed algorithm.
To illustrate our algorithm, we use Fig. 3 as a running example. Our algorithm starts by randomly dividing the input data points into partitions and distributing them to workers, e.g., in Fig. 3(a), nodes are in worker and nodes are in worker . In our setting, we have servers to maintain , and local workers to maintain their own . Initially, all the workers perform clustering operations in parallel, where each worker uses to find each local data point’s -nearest neighbors and accordingly. All the will be synchronized with the servers to get . A operation is performed by each worker to create based on the -nearest neighborhood information and . With the , all the workers start to label its local data points and communicate with servers to remove labeling conflicts. The steps , , , , and are performed iteratively until no labeling conflicts found. The key steps are discussed as follows.
MarkCorePoint: A point is marked as a core point if its - neighborhood size is at least .
PropagateMaxLabel: This is a local clustering processing where all the nodes in the same cluster are labeled as the maximum local node id. As in Fig. 3(b), node are labeled with id .
MaxReduceToServer: A Synchronous Max Reduce operator is used to merge local clustering results with server results, where each node will be labeled as the maximum node id from all local workers. As in Fig. 3(c), node takes from , i.e. .
PullFromServer: This is a typical PS operator to pull results from the server. Interested readers can refer to (Li et al., 2014) for details.
GlobalUnion: This step starts from the maximum node id to 0, for each node, if its root node id does not equal to the corresponding global label, we modify it to the global label. This is an effective way to compress the path of disjoint-set and redirect each local node to its root parent. For example, in Fig. 3(c), node will directly link to its root node . Unlike Fig. 1, where node needs to route from nodes and to link to . This is the key step to reduce communication burden.
GetMaxLabel: This step is performed on the local cluster to label each data point with the maximum node id within a cluster. The detailed algorithm is described as in Fig. 4. After this step, all the local nodes are labeled as the maximum node within the cluster. As shown in Fig. 3(d), with this step, all the local nodes in are labeled as node 11.
We present our PS-DBSCAN method in Algorithm 1.
In a nutshell, comparing with the MPI-based PDSDBSCAN-D, our method has two advantages. First, each worker maintains a local cluster and we only generate merging requests when it has modified labels. This can help to reduce communication overhead. Second, with , each data point is able to find its root parent directly without generating many merge requests. This makes our algorithm 2-10 times faster than the PDSDBSCAN-D.
We quantitatively evaluated our PS-DBSCAN here. We first designed experiments to examine the communication efficiency and speedup gain of our method comparing to the MPI-based PDSDBSCAN. Our method has better scalability than PDSDBSCAN where it shows good performance with up to 1600 CPU cores.
Setup. We evaluated our methods on a cluster where each computer node has 24 cores, 4 Intel Xeon E5-2430 hex-core processors, and 96GB memory. We implemented the PDSDBSCAN-D with open source code 222http://cucis.ece.northwestern.edu/projects/Clustering/ on the cluster. As only single-threaded implementation of PDSDBSCAN-D is available, we limited to use one core in each computer node in our experiments. Note that, the cluster is used as a production cluster shared by many applications, to avoid the impact of other tasks, we repeated the experiments 6 times and take the mean results by ignoring the best and worst results.
Datasets. To investigate the performance of our PS-DBSCAN, we first generated two synthetic datasets: and . has 10 million data points and each data point has an average of 25 directly density-reachable core points (or -neighborhood), while has 100 million points and each has 15 -neighborhood. We pre-computed pair-wise distance information for both of them.
Furthermore, we used two large real-world datasets from (Götz et al., 2015), one is Geo-tagged tweets, and the other BremenSmall that contains 3D-point cloud of an old town. The Tweets was obtained using the free twitter streaming API and contains location of all geo-tagged tweets, it consists of 16,602,137 2D-points. And BremenSmall is a set of 3D-point cloud of the old town of Bremen, which contains 2,543,712 points.
3.1. Examination of Communication Efficiency
Table 1 shows the communication time of MPI-based PDSDBSCAN-D and our PS-DBSCAN on synthetic and real-word datasets using 100,200,400,800 and 1600 cores. Some important observations are discussed in order.
First, on all the datasets, the PDSDBSCAN-D tends to be slower than our PS-DBSCAN with the increase of CPU nodes. The reason is that PDSDBSCAN’s peer-to-peer communication pattern has communication overhead with a large number of CPU nodes.
Second, our PS-DBSCAN has a very limited number of communication iterations regardless of the growing number of CPU nodes. This is because our global union methods help to reduce the number of merging requests.
Third, MPI-based PDSDBSCAN-D is not stable with a large number of CPU nodes. For example, with 1600 CPU nodes, PDSDBSCAN fails to generate results, while our PS-DBSCAN still works. Furthermore, the PDSDBSCAN is severely affected by a large amount of the neighbors. For Tweets datasets with 169 -nearest neighbors when and 3600 neighbors when , PDSDBSCAN fails. Both of these problems make PDSDBSCAN not ideal for a very large data set.
Last but not least, on the largest dataset , the communication time of PS-DBSCAN decreases first and then increases as the nodes increases. Close examination shows, when the amount of the data points is too large, the total merge time will benefit from the increase in the number of nodes to some extent.
3.2. Examination of Speedup Gains
We further examined the speedup gains of our PS-DBSCAN over PDSDBSCAN-D.
As in Fig. 5, with more CPU cores, our method has a larger speedup gain. In general, PS-DBSCAN outperforms the PDSDBSCAN with 2-10 times speedup on communication efficiency.
Particularly, we found PS-DBSCAN has 10 times speedup with 800 CPUs nodes on , which is significantly larger than on other datasets. Close examination shows that MPI-based DBSCAN suffers from a large -nearest neighborhood size. To illustrate this, we used three datasets , and , corresponding to a neighborhood size of , and respectively, to evaluate their performance in Fig. 6. Clearly, PDSDBSCAN has a degenerated performance with a larger neighborhood size. The reason is that with a larger neighborhood size, each core point has more neighbors being distributed to different workers which result in generating more merging requests in MPI setting. While in PS-DBSCAN, with maintaining a global label and using , there are far fewer merging requests.
We have released our PS-DBSCAN in an algorithm platformcalled Platform of AI (PAI) in Alibaba Cloud. Below we demonstrate the usage of PS-DBSCAN in our cloud-based platform - PAI.
In this section, we demonstrate the usage of PS-DBSCAN in PAI. PAI provides an interface to interact with PS-DBSCAN component. The whole workflow is shown in Fig 7(a), where an input table named as “hxdb_sample_6” is linked to the PS-DBSCAN component “DBSCAN-1”. The output of the component is linked to an output table “hxdb_tmp_output-1”. With this workflow, the method automatically pulls the data from the input table and run the PS-DBSCAN algorithm, and the final results are stored in the output table.
We also provide an interface for users to tune the parameters, as in Fig 7(b). Specifically, we can tune the following parameters based on the interface.
Input type: vector or linkage
Dimension: input data dimension
Epsilon: the distance threshold of DBSCAN
minPts: the density threshold of DBSCAN
input format: the number of input columns
server number: the number of server nodes
worker number: the number of worker nodes
server cores: CPU cores for each server
worker cores: CPU cores for each worker
server memory: server memory
worker memory: worker memory
We present the input and output tables of our PS-DBSCAN algorithm in Fig 8. The table is stored in MaxCompute platform. Interested readers can find the details here: https://www.aliyun.com/product/odps/.
We support two types of data as input:
Vector: each node has an index and is represented by a vector, as shown in Fig 8(a).
Linkage: each record in the table is a link between two nodes.
After running this algorithm, we can get the clustering result of our input data, as shown in Fig 8(b).
To test the PS-DBSCAN method, users can register PAI online via this link https://pai.base.shuju.aliyun.com/ and search for PS-DBSCAN in the search bar.
5. Related Work
There are generally two lines of work for paralleling DBSCAN, one is on MapReduce-based big data platforms such as Apache Spark and the other is on distributed memory using Message Passing Interface-based (MPI).
The studies in (Fu et al., 2011; He et al., 2011) are the first to implement a parallel DBSCAN based on the Map-Reduce paradigm. A similar idea is used in RDD-DBSCAN (Cordova and Moh, 2015). In (Cordova and Moh, 2015; Litouka, 2014), the data space is split into roughly equal sized boxes until the data size of a box is less or equal to a threshold, or a maximum number of levels is reached, or the shortest side of a box becomes smaller than 2 eps. Each resulting box is a record of an RDD which can be processed in parallel. Another work 333Mansour Raad. 2016. https://github.com/mraad/ implements an approximation of DBSCAN algorith with faster but a bit worse results. Another work in (Raad, 2016) implements an approximation of DBSCAN algorithm which yield better efficiency in the cost of a bit worse clustering results.
However, a recent benchmark study (Neukirchen, 2016) shows that MPI based distributed implementations of DBSCAN, e.g., PDSDBSCAN, outperform other Spark implementations (Cordova and Moh, 2015; Litouka, 2014) For MPI based parallel DBSCAN implementations, many existing methods use master-slave model (Chen et al., 2010; Coppola and Vanneschi, 2002; Xu et al., 2002; Brecheisen et al., 2006; Fu et al., 2011). In the master-slave mode, the data is partitioned into the slaves, each of which clusters the local data and sends to a master node to merge. The master node sequentially merges all the local clusters to obtain the clustering result. This method has a high communication overhead which makes it inefficient in the merging stage. PDSDBSCAN proposed by Patwary et al. (Patwary et al., 2012) uses a fully distributed parallel algorithm that employs the disjoint-set structure to speed up communication process. Since data points within the same cluster may be distributed over different workers which result in several disjoint-sets, merging them incurs significant communication costs.
Another work (Götz et al., 2015) proposes to use a more scalable approach based on a grid-based data index pre-processing, in which data index are resorted and neighbor data points are assigned to the same processor to reduce communication cost. Different from that work, in our proposed algorithm, we employ a fast global union approach based on parameter server framework to union the disjoint-sets to alleviate the communication burden. Our method does not require specific data pre-processing and is communication efficient compared to the competing MPI based DBSCAN methods.
We presented a communication efficient parallel DBSCAN based on Parameter Server, named PS-DBSCAN. This algorithm uses a disjoint-set data structure from (Patwary et al., 2012) and employed a fast global union approach to union the disjoint-sets to alleviate the communication burden. We compared the performance of PS-DBSCAN with the MPI implementation PDSDBSCAN-D on Real-world datasets and synthetic datasets with different scales. Experiments show that PS-DBSCAN outperforms the MPI-based PDSDBSCAN-D with 2-10 times speedup on communication efficiency in both real-world and synthetic datasets, and the speedup increases with the number of processor cores and the dataset scale. It is shown that combining multithreading into distributed memory system can bring more speedup, we plan to employ multithreading in PS-DBSCAN to further boost the overall efficiency in future.
We have released our PS-DBSCAN in an algorithm platform called Platform of AI (PAI) in Alibaba Cloud and also demonstrated how to use it in PAI.
Acknowledgements.The authors would like to thank Jun Zhou, Xu Chen and Ang Wang for providing valuable comments and helpful suggestions. The authors would also like to thank anonymous reviews for their valuable comments.
- Brecheisen et al. (2006) Stefan Brecheisen, Hans-Peter Kriegel, and Martin Pfeifle. 2006. Parallel Density-Based Clustering of Complex Objects.. In PAKDD.
et al. (2017)
Cen Chen, Peilin Zhao,
Longfei Li, Jun Zhou,
Xiaolong Li, and Minghui Qiu.
Locally Connected Deep Learning Framework for Industrial-scale Recommender Systems.. InWWW’17.
- Chen et al. (2010) Min Chen, Xuedong Gao, and Huifei Li. 2010. Parallel DBSCAN with priority r-tree.. In The 2nd IEEE International Conference on Information Management and Engineering (ICIME). 508–501.
- Coppola and Vanneschi (2002) Massimo Coppola and Marco Vanneschi. 2002. High-performance data mining with skeleton-based structured parallel programming. Parallel Comput. 28, 5 (2002), 793–813.
- Cordova and Moh (2015) Irving Cordova and Teng-Sheng Moh. 2015. DBSCAN on Resilient Distributed Datasets. (2015), 531–540.
- Ester et al. (1996) Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In SIGKDD. 226–231.
- Fu et al. (2011) Yanxiang Fu, Weizhong Zhao, and Huifang Ma. 2011. Research on parallel DBSCAN algorithm design based on mapreduce.. In Advanced Materials Research. 1133–1138.
- Gao and Tao (2015) Junhao Gao and Yufei Tao. 2015. DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation. In SIGMOD. 519–530.
et al. (2015)
Markus Götz, Christian
Bodenstein, and Morris Riedel.
HPDBSCAN: Highly Parallel DBSCAN. In
Proc. of the Workshop on Machine Learning in High-Performance Computing Environments(MLHPC ’15). Article 2, 10 pages.
- He et al. (2011) Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng, and Jianping Fan. 2011. MR-DBSCAN: An Efficient Parallel Density-Based Clustering Algorithm Using MapReduce. In ICPADS ’11. 473–480.
- Li et al. (2014) Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server.. In OSDI. 583–598.
- Litouka (2014) Aliaksei Litouka. 2014. Spark DBSCAN source code. (2014). https://github.com/alitouka/spark_dbscan
- Neukirchen (2016) Helmut Neukirchen. 2016. Survey and Performance Evaluation of DBSCAN Spatial Clustering Implementations for Big Data and High-Performance Computing Paradigms. (2016).
- Patwary et al. (2012) Mostofa Ali Patwary, Diana Palsetia, Ankit Agrawal, Wei-keng Liao, Fredrik Manne, and Alok Choudhary. 2012. A New Scalable Parallel DBSCAN Algorithm Using the Disjoint-set Data Structure. In Proc. of the International Conference on High Performance Computing, Networking, Storage and Analysis.
- Raad (2016) Mansour Raad. 2016. DBSCAN On Spark source code. (2016). https://github.com/mraad/
- Xu et al. (2002) Xiaowei Xu, Jochen Jäger, and Hans-Peter Kriegel. 2002. A fast parallel clustering algorithm for large spatial databases.. In High Performance Data Mining. 263–290.
- Zhou et al. (2017) Jun Zhou, Xiaolong Li, Peilin Zhao, Chaochao Chen, Longfei Li, Xinxing Yang, Qing Cui, Jin Yu, Xu Chen, Yi Ding, and Yuan Alan Qi. 2017. KunPeng: Parameter Server Based Distributed Learning Systems and Its Applications in Alibaba and Ant Financial. In KDD. 1693–1702.