A Domain Adaptive Density Clustering Algorithm for Data with Varying Density Distribution

11/23/2019 ∙ by Jianguo Chen, et al. ∙ University of Illinois at Chicago 25

As one type of efficient unsupervised learning methods, clustering algorithms have been widely used in data mining and knowledge discovery with noticeable advantages. However, clustering algorithms based on density peak have limited clustering effect on data with varying density distribution (VDD), equilibrium distribution (ED), and multiple domain-density maximums (MDDM), leading to the problems of sparse cluster loss and cluster fragmentation. To address these problems, we propose a Domain-Adaptive Density Clustering (DADC) algorithm, which consists of three steps: domain-adaptive density measurement, cluster center self-identification, and cluster self-ensemble. For data with VDD features, clusters in sparse regions are often neglected by using uniform density peak thresholds, which results in the loss of sparse clusters. We define a domain-adaptive density measurement method based on K-Nearest Neighbors (KNN) to adaptively detect the density peaks of different density regions. We treat each data point and its KNN neighborhood as a subgroup to better reflect its density distribution in a domain view. In addition, for data with ED or MDDM features, a large number of density peaks with similar values can be identified, which results in cluster fragmentation. We propose a cluster center self-identification and cluster self-ensemble method to automatically extract the initial cluster centers and merge the fragmented clusters. Experimental results demonstrate that compared with other comparative algorithms, the proposed DADC algorithm can obtain more reasonable clustering results on data with VDD, ED and MDDM features. Benefitting from a few parameter requirements and non-iterative nature, DADC achieves low computational complexity and is suitable for large-scale data clustering.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Clustering algorithms have been widely used in various data analysis fields [1, 2]. Numerous clustering algorithms have been proposed, including the partitioning-based, hierarchical-based, density-based, grid-based, model-based, and density-peak-based methods [3, 4, 5, 6]. Among them, density-based methods (e.g., DBSCAN, CLIQUE, and OPTICS) can effectively discover clusters of arbitrary shape using the density connectivity of clusters, and do not require a pre-defined number of clusters [6]. In recent years, Density-Peak-based Clustering (DPC) algorithms, as a branch of density-based clustering, were introduced in [7, 8], assuming that the cluster centers are surrounded by low-density neighbors and can be detected by efficiently searching for local density peaks.

Benefitting from few parameter requirements and non-iterative nature, DPC algorithms can efficiently detect clusters of arbitrarily shape from large-scale datasets with low computational complexity. However, as shown in Fig. 1

, DPC algorithms have limited clustering effect on data with varying density distribution (VDD), multiple domain-density maximums (MDDM), or equilibrium distribution (ED). (1) For data with VDD characteristics, there are varying density regions and data points in sparse regions are usually ignored as outliers or misallocated to adjacent dense clusters by using uniform density peak thresholds, which results in the loss of sparse clusters. (2) Clustering results of DPC algorithms depend on a strict constraint that there is only one local density maximum in each candidate cluster. However, for data with MDDM or ED, there are zero or more local density maximums in a natural cluster, and DPC algorithms might lead to the problem of cluster fragmentation. (3) In addition, how to determine the parameter thresholds of local density and Delta distance in the clustering decision graph is another problem for DPC algorithms. Therefore, it is critical to address the problems of sparse cluster loss and cluster fragmentation from data with VDD, ED, and MDDM and improve clustering accuracy.

Fig. 1: Challenges of DPC algorithms on data with VDD, ED, and MDDM.
Fig. 2: Workflow of the proposed DADC algorithm.

Aiming at the problems of sparse cluster loss and cluster fragmentation, we propose a Domain-Adaptive Density Clustering (DADC) algorithm. As shown in Fig. 2, the DADC algorithm consists of three steps: domain-adaptive density measurement, cluster center self-identification, and cluster self-ensemble. A domain-adaptive density measurement method based on K-Nearest Neighbors (KNN) is defined, which can be used to adaptively detect the density peaks of different density regions. On this basis, cluster center self-identification and cluster self-ensemble methods are proposed to automatically extract the initial cluster centers and merge the fragmented clusters. Extensive experiments indicate that DADC outperforms comparison algorithms in clustering accuracy and robustness. The contributions of this paper are summarized as follows.

  • To address the problem of sparse cluster loss of data with VDD, a domain-adaptive density measurement method is proposed to detect density peaks in different density regions. According to these density peaks, cluster centers in both dense and sparse regions are effectively discovered, which well addresses the sparse cluster loss problem.

  • To automatically extract the initial cluster centers, we draw a clustering decision graph based on domain density and Delta distance. We then propose a cluster center self-identification method and automatically determine the parameter thresholds and cluster centers from the clustering decision graph.

  • To address the problem of cluster fragmentation on data with ED or MDDM, an innovative Cluster Fusion Degree (CFD) model is proposed, which consists of the inter-cluster density similarity, cluster crossover degree, and cluster density stability. Then, a cluster self-ensemble method is proposed to automatically merge the fragmented clusters by evaluating the CFD between adjacent clusters.

The rest of the paper is organized as follows. Section 2 reviews the related work. Section 3 presents the domain adaptive method for cluster center detection. A cluster self-identification method and cluster ensemble method are respectively introduced in Section 4. Experimental results and evaluations are shown in Section 5. Finally, Section 6 concludes the paper.

2 Related Work

Being an efficient and unsupervised data mining method, numerous clustering algorithms are proposed and widely applied in various applications [9, 2, 10]

. Partition-based methods (e.g., K-Means and K-Medoids)

[3] are easy to understand and implement, but it is sensitive to noisy data and can only detect round or spherical clusters. Hierarchical methods (e.g., BIRCH, CURE, and ROCK) [5] do not need to pre-define a number of clusters and can extract hierarchical relationship of clusters, but require high computational complexity. Density-based methods (e.g., DBSCAN, CLIQUE, and OPTICS) [6] also don’t require a pre-defined number of clusters and can discover clusters of arbitrary shapes, but the clustering results of these algorithms are sensitive to the threshold of their parameters.

Focusing on density-based clustering analysis, abundant improvements of traditional algorithms were presented, while novelty algorithms were explored

[7, 11, 12, 13]. Groups of Density-Peak-based Clustering (DPC) algorithms were proposed in [7, 8], where cluster centers are detected by efficiently searching of density peaks. In [7], Rodriguez et al. proposed a DPC algorithm titled “Clustering by fast search and find of density peaks” (widely quoted as CFSFDP). CFSFDP can effectively detect arbitrarily shaped clusters from large-scale datasets. Benefiting from non-iterative nature, CFSFDP achieves low computational complexity and high efficiency for big data processing. In addition, considering large-scale noisy datasets, robust clustering algorithms were discussed in [14] by detecting density peaks and assigning points based on fuzzy weighted KNN method.

For data that exhibit varying-density distribution or multiple local-density maximums, DPC algorithms face a variety of limitations, such as sparse cluster loss and cluster fragmentation. To address these problems, a variety of optimization solutions were presented in [15, 16, 17]. Zheng et al. proposed an approximate nearest neighbor search method for multiple distance functions with a single index [15]. To overcome the limitations of DPC, an adaptive method was presented in [18]

for clustering, where heat-diffusion is used to estimate density and cutoff distance is simplified. In

[19], an adaptive density-based clustering algorithm was introduced in spatial databases with noise, which uses a novel adaptive strategy for neighbor selection based on spatial object distribution to improve clustering accuracy.

Aiming at clustering ensemble, an automatic clustering approach was introduced via outward statistical testing on density metrics in [16]. A nonparametric Bayesian clustering ensemble method was explored in [20] to seek the number of clusters in consensus clustering, which achieves versatility and superior stability. Yu et al. proposed an adaptive ensemble framework for semi-supervised clustering solutions [17]. Zeng et al. proposed a framework for hierarchical ensemble clustering [5]. Yu et al

. introduced an incremental semi-supervised clustering ensemble approach for high-dimensional data clustering

[21].

Compared with the existing clustering algorithms, the proposed domain-adaptive density method in this work can adaptively detect the domain densities and cluster centers in regions with different densities. This method is very feasible and practical in actual big data applications. The proposed cluster self-identification method can effectively identify the candidate cluster centers with minimum artificial intervention. Moreover, the proposed CFD model takes full account of the relationships between clusters of large-scale datasets, including the inter-cluster density similarity, cluster crossover degree, and cluster density stability.

3 Domain-Adaptive Density Method

We propose a domain-adaptive density method to address the problem of spare cluster loss of DPC algorithms on VDD data. Domain density peaks of data points in regions with different densities are adaptively detected. In addition, candidate cluster centers are identified based on the decision parameters of the domain densities and Delta distances.

3.1 Problem Definitions

Most DPC algorithms [7, 8] are based on the assumption that a cluster center is surrounded by neighbors with lower local densities and has great Delta distances from any relative points with higher densities. For each data point , its local density and Delta distance are calculated from the higher density points. These two quantities depend only on the distances between the data points. The local density of is defined as:

(1)

where is a cutoff distance and , if ; otherwise, . Basically, is equal to the number of points closer than to . Delta distance of is measured by computing the shortest distance between and any other points with a higher density, as defined as:

(2)

For the highest density point, . Points with a high and high are considered as cluster centers, while points with a low and a high are considered as outliers. After finding the cluster centers, each remaining point is assigned to the same cluster as its nearest neighbor of higher density.

Most data in actual applications have the characteristics of noise, irregular distribution, and sparsity. In particular, the density distribution of data points is unpredictable and discrete in most of the cases. For a VDD dataset, there coexist regions with different degrees of density, such as dense and sparse regions, as defined as follows.

Definition 3.1 VDD Data. For a dataset has multiple regions, the average density of data points in each region is set as the region’s density. If there coexist regions with obvious different region densities, such as dense and sparse regions, we denote the dataset as a Varying-Density Distributed (VDD) dataset.

The CFSFDP algorithm and other DPC algorithms suffer from the limitations of sparse cluster loss on VDD datasets. According to Eq. (1), points in the relatively sparse area are easily ignored as outliers. An example of the CFSFDP clustering results on a VDD dataset is shown in Fig. 3.

(a) Data points
(b) Decision graph
Fig. 3: Example of the CFSFDP algorithm on a VDD dataset.

In Fig. 3 (a), the heart-shaped dataset has three regions with different densities. The clustering decision graph achieved by CFSFDP is shown in Fig. 3 (b), where only one point is obtained with high values of both and . Consequently, the dataset is clustered into one cluster, while the data points in the sparse regions, indicated by blue dots and purple squares, are removed as outliers or incorporated into the dense cluster.

3.2 Domain-Adaptive Density Measurements

To adaptively detect the domain-density peaks in different density areas of VDD data, a domain-adaptive density calculation method is presented in this section. Domain distance and domain density calculation methods are presented based on the KNN method [22, 23]. These methods are very useful and handy on large-scale datasets that likely contain varying distribution densities in actual applications.

To more precisely explain the locality of VDD data, we propose a new definition of domain density based on the KNN method. Given a dataset , the -distance and -density of each data point in are calculated, respectively.

Definition 3.2 -Distance. Given a dataset , the -distance of each data point refers to the average distance of to its nearest neighbors. The -distance of each data point is defined as :

(3)

where is the number of neighbors of and is the set of its neighbors. Based on the -distance, we calculate the -density for each data point.

Definition 3.3 -Density. The -density of the data point in dataset refers to the reciprocal of the -distance. The smaller -density of a data point, indicating that this data point is located in a more sparse area. The -density of data point is defined as :

(4)

After obtaining the -distance and -density, the domain-adaptive density of each data point is defined. We treat the set of each data point and its neighbors as a subgroup to observe its density distribution in .

Definition 3.4 Domain Density. The domain density of each data point in dataset is the sum of the -density of and the weighted -density of its -nearest neighbors. The domain density of the data point is defined as :

(5)

where is the weighted value of the -density between each neighbor and . Compared to the -density, the domain density can better reflect the density distribution of data points in the local area.

Given a -dimensional dataset with 50 samples as an example, we calculate the distances among data points, as shown in Fig. 4. Set , for each point, neighbors of are , , , , and , with the distances of 8.05, 8.05, 8.70, 8.79, and 12.58, respectively. According to Eq. (3), it is easy to obtain the -distance of is equal to 9.23. Hence, the -density of is equal to 0.11 and the domain density of is 0.16. In the same way, -distances and -densities of the neighbors of are calculated successively. We further calculate the domain density of these data points: , , , and . It is obvious that has a higher value of the domain density than that of its neighbors, reaching the value at 0.16.

Fig. 4: Example of domain density calculation (partial).

3.3 Clustering Decision Parameter Measurement

Based on the domain density, the Delta distance of each data point is computed as a clustering decision parameter. As defined in Eq. (2), Delta distance of is measured by calculating the shortest distance between and any other points with higher densities. In such a case, only the points with the highest global density have the maximum value of the Delta distance. The domain density peak in a sparse region yields a Delta distance value that is lower than the remaining points in a relatively dense region. An example of Delta distances of a dataset is shown in Fig. 5.

Fig. 5: Example of Delta distance calculation (partial).

In Fig. 5, domain densities of all data points are calculated. Since owns the highest domain density, Delta distance is the distance between and the point farthest against it. Namely, . Because the remaining data points do not have the highest domain density, the Delta distances of them are the shortest distance between them and the point with higher density. For example, , , , and .

Considering that a dataset has multiple regions with different densities, domain densities of data points in a dense region are higher than that of points in a sparse region. To adaptively identify the density peaks of each region, we update the definition of domain-adaptive density by combining the values of the domain density and Delta distance. The domain-adaptive density of each data point is updated as:

(6)

There are three levels of domain density for data points: global density maximum, domain density maximum, and normal density. (1) It is easy to identify the point with the highest global density and set it as a cluster center. For the global density maximum point, we set the largest distance between this point and any other point as its Delta distance. (2) For a density maximum point of a region, the point must be in another region with a greater density rather than in the current region. Therefore, to clearly identify the density peaks of a region, we multiply the domain density and Delta distance for each point. (3) For the remaining points in each region, both of their domain densities and Delta distances are much smaller than that of the peak points of the same region.

Based on the values of domain density and Delta distance , a clustering decision graph is drawn to identify the candidate cluster centers. In the clustering decision graph, the horizontal axis represents and the vertical axis represents . Points having high values of and are considered as cluster centers, while points with a low and a high are considered as outliers.

The process of domain density and Delta distance calculation of DADC is presented in Algorithm 1. Assuming that the number of data points in is equal to , for each data point in , we calculate its K-nearest neighbors and obtain its domain density. Therefore, the computational complexity of Algorithm 1 is .

0:   : the dataset for clustering;: the number of neighbors of each data point;
0:   (, ): The domain-adaptive densities and Delta distances of the data points of .
1:  calculate distance matrix for ;
2:  for each in  do
3:     obtain -nearest neighbors of ;
4:     calculate -distance ;
5:     calculate -density ;
6:     calculate domain density ;
7:  get the maximum domain density max();
8:  for each in  do
9:     if  ==  then
10:        set Delta distance max();
11:     else
12:        set Delta distance ;
13:     calculate domain-adaptive density ;
14:  return  (, ).
Algorithm 1 Domain-adaptive density and Delta distance calculation of DADC.

4 Cluster Self-identification Method

For data with ED or MDDM features, a large number of density peaks with similar values can be identified, which results in cluster fragmentation. In this section, aiming at the problem of cluster fragmentation, we propose a cluster self-identification method to extract initial cluster centers by automatically determining the parameter thresholds of the clustering decision graph. Then, a Cluster Fusion Degree (CFD) model is proposed to evaluate the relationship of adjacent clusters. Finally, a CFD-based cluster self-ensemble method is proposed to merge the fragmented clusters.

4.1 Problem Definitions

Based on domain-adaptive densities and Delta distances, candidate cluster centers of a dataset can be obtained from the corresponding clustering decision graph. After the cluster centers are identified, each of the remaining data points is assigned to the cluster to which the nearest and higher-density neighbors belong.

(1) Decision-parameter threshold determination.

A limitation of the CFSFDP algorithm is that how to determine the thresholds of the decision parameters in the clustering decision graph. In CFSFDP, data points with high values of both local density and Delta distance are regarded as cluster centers. But in practice, these parameter thresholds are often set manually. An example of the Frame dataset and the corresponding clustering decision graph are shown in Fig. 6.

(a) Frame dataset
(b) Decision graph for Frame
Fig. 6: Decision-parameter threshold determination.

Fig. 6 (b) is the clustering decision graph of CFSFDP for the dataset in Fig. 6 (a). It is difficult to make a decision whether only the points in the red box or those in both red and blue boxes should be regarded as cluster centers. Therefore, how to determine the threshold values of decision parameters in an effective way is an important issue of our algorithm.

(2) Cluster fragmentation on MDDM or ED Data.

Most DPC algorithms have limitations of cluster fragmentation on the datasets with multiple domain-density maximums (MDDM) or equilibrium distribution (ED).

Definition 4.1 MDDM Dataset. Given a dataset, the domain-adaptive densities of data points in the dataset are calculated. If multiple points with the same highest domain density coexist in a region, we call the dataset holds the characteristic of multiple domain-density maximums. A dataset with multiple domain-density maximums is defined as an MDDM dataset.

Definition 4.2 ED Dataset. Given a dataset, the domain densities of data points in the dataset are calculated. If each data point has the same value of domain density, the dataset is under an equilibrium distribution and is defined as an ED dataset. In such a case, each data point having the same value of domain density is regarded as a domain density peak and further considered as a candidate cluster center.

Clustering results of the CFSFDP algorithm depend on a strict constraint that only one local density maximum is assumed to exist in each candidate cluster. However, when there exist multiple local density maximums in a natural cluster, CFSFDP might lead to the problem of cluster fragmentation. Namely, a cluster is split into many fragmented clusters. Two examples of the clustering decision graph of CFSFDP on MDDM and ED datasets are shown in Fig. 7.

(a) MDDM dataset
(b) Decision graph
(c) ED dataset
(d) Decision graph
Fig. 7: Clustering decision graph of MDDM and ED dataset.

In Fig. 7 (b), there are as many as 29 decision points that hold high values of both domain density and Delta distance . In such a case, the dataset is divided into 29 fragmented clusters instead of 2 natural clusters as shown in Fig. 7 (a). As shown in Fig. 7 (c), there are two isolated regions in the dataset, and data points in each region are equilibration distributed. Hence, this dataset is expected to be divided into 2 natural clusters. However, lots of points exhibit similar values of local/domain densities and are regarded as cluster centers, as shown in Fig. 7 (d). Consequently, this dataset is incorrectly divided into numerous fragmented sub-clusters rather than the expected two clusters.

4.2 Initial Cluster Self-identification

4.2.1 Cluster Center Identification

We propose a self-identification method to automatically extract the cluster centers based on the clustering decision graph. To automatically determine the parameters threshold values of domain density and Delta distance, a critical point of the clustering decision graph is defined. The critical point of a clustering decision graph is defined as a splitting point by which the candidate cluster centers, outliers, and remaining points can be divided obviously, as shown in Fig. 8.

Fig. 8: Critical point of a clustering decision graph.

As the assumption of the CFSFDP and DADC algorithms, cluster centers are the points with relatively domain-density peaks, while outliers have the lowest domain densities. It is easy to get a conclusion that the values of domain density of density peaks are obviously different against to that of outliers. Therefore, we take the middle value of the maximum domain density as the horizontal axis value of the critical point. Namely, . In addition, based on extensive experiments and applications, it is a effectiveness solution to set the vertical axis value of the critical point as one quarter of the maximum value of the Delta distance. Namely, . Therefore, the value of critical point of the clustering decision graph is defined as:

(7)

where and are the maximum values of and .

Based on the critical point, data points in the clustering decision graph can be divided into three subsets, namely, cluster centers, outliers, and remaining points. The division method of data points in the clustering decision graph is defined as:

(8)

where refers to the subset belong to. In Fig. 8, after getting the value of the critical point, data points in the decision graph can be easily divided into three subsets. Red points are detected as candidate cluster centers. Black points have low values of domain density and high values of Delta distance, identifying as outliers and removed from the clustering results. Blue points refer to the remaining data points, which are assigned to the related clusters in the next step. Hence, initial cluster centers of the dataset are obtained with few parameter requirements and minimum artificial intervention.

4.2.2 Remaining Data Point Assignment

After cluster centers being detected, each of the remaining data points is assigned to the cluster that the nearest and higher domain-density neighbors belong to. For each remaining data point , the neighbors with a higher density are labeled as . For a data point with the shortest distance , if has been assigned to a cluster , then, is also assigned to . Otherwise, the cluster of is further measured iteratively. Repeat this step, until all of the remaining data points are assigned to the related clusters. An example of the remaining data points assignment is shown in Fig. 9. The process of initial cluster self-identification of the DADC algorithm is presented in Algorithm 1.

Fig. 9: Example of the remaining data points assignment.
0:   : The raw dataset for clustering;: The domain densities of the data points of ;: The Delta distances of the data points of ;
0:   : The initial clusters of .
1:  get the maximum domain density max();
2:  get the maximum Delta distance max();
3:  calculate the critical point ;
4:  for each in  do
5:     if  and  then
6:        append to the set of cluster centers ;
7:     else if  and  then
8:        append to the set of outliers ;
9:     else
10:        append to the set of remaining data points ;
11:  for each in  do
12:     append to the nearest cluster ;
13:  return  .
Algorithm 1 Cluster center self-identification of the DADC algorithm.

Algorithm 1 consists of two steps: cluster center identification and remaining data point assignment. Assuming that the number of data points in is , the number of cluster centers is , and that of remaining data points is . In general, the number of cluster centers and outliers is far less than the remaining points, which shows that . Therefore, the computational complexity of Algorithm 1 is .

Depending on the cluster self-identification method of DADC, we can obtain cluster centers and initial clusters quickly and simply. Despite the number of cluster centers might be more than the real ones in this way and caused many fragmented clusters, it does not lead to the clustering result that multiple clusters are wrongly classified as a cluster. Focus on the scenario of fragmented clusters, we introduce a cluster self-ensemble method to merge the preliminary clustering results.

4.3 Fragmented Cluster Self-ensemble

To address the limitation of cluster fragmentation of DADC on MDDM datasets, a fragmented cluster self-ensemble method is proposed in this section. As we all know, the basic principle of clustering analysis is that individuals in the same cluster have high similarities with each other, while different from individuals in different clusters. Therefore, to find out which clusters are misclassified into multiple subclusters, we propose an inter-cluster similarity measurement and cluster fusion degree model for fragmented cluster self-ensemble. Clusters with a superior density similarity and cluster fusion are merged into the same cluster.

Definition 4.3 Inter-cluster Density Similarity (IDS). Inter-cluster density similarity between two clusters refers to the degree of similarity degree of their cluster densities. The average density of a cluster is the average value of the domain densities of all data points in the cluster.

Let be the inter-cluster density similarity between cluster and . The larger the value of , the more similar is the density of the two clusters. is defined as:

(9)

where and . Let , where . Let , then . Since , then is a strictly monotonically increasing function and . Hence the closer the value of is to 1, the more similar are the two clusters and .

In addition, the distance between every two clusters is considered. In the relevant studies, different methods were introduced to calculate the distance between two clusters, such as the distance between the center points, the nearest points, or the farthest points of the two clusters [5]. However, these measures are easily affected by noise or outliers, while noisy data elimination is another challenge. We propose an innovative method to measure the distance between clusters. Crossing points between every two clusters are found and the crossover degree of the clusters is calculated.

For each boundary point in cluster , let be the -nearest neighbors of . We denote as a set of points in belonging to cluster , and be a set of points in belonging to . If the amount of neighbors belonging to is close to that of neighbors belonging to the current cluster , then, is defined as a crossing point of , and is represented as . The crossover degree of a crossing point in between clusters and is defined as:

(10)

where , , and . Based on crossover degrees of all crossing points of each cluster, we can define the crossover degree between every two clusters.

Definition 4.4 Cluster Crossover Degree (CCD). Cluster crossover degree of two clusters and is calculated by the sum of the crossover degrees of all crossing points between and . The formula of CCD is defined as:

(11)

To measure whether the data points in a cluster have similar domain densities, we give a definition of the cluster density stability. By analyzing the internal density stability of the clusters to be merged and that of the merged cluster, we can determine whether the merger is conducive to the stability of these clusters. The internal density stability of clusters is an important indicator of cluster quality.

Definition 4.5

Cluster Density Stability (CDS). Cluster density stability is the reciprocal of the cluster density variance, which is calculated by the deviation between the domain density of each point and the average domain density of the cluster. The larger the CDS of a cluster, the smaller domain density differences of each point in the cluster. The CDS of a cluster

is defined as:

(12)

where is the average value of domain densities of the data points in , and is the number of data points in . A cluster with a high CDS means that data points in the cluster have low domain-density differences. Namely, most data points in the same cluster have similar domain densities.

For two clusters and with high inter-cluster density similarity and high crossing degree, we can further calculate their CDS. Assuming that and are the CDSs of and , and is that of the new cluster merged from and . CDS between and is calculated in Eq. (13):

(13)

If the CDS of the merged cluster is close to the average value of and , it indicates that the merger of the two clusters does not reduce their overall density stability.

Based on the above indicators of clusters, including inter-cluster density similarity, cluster crossover degree, and cluster density stability, the definition of cluster fusion degree is proposed.

Definition 4.6 Cluster Fusion Degree (CFD). Cluster fusion degree of two clusters is the degree of the correlation between the clusters in terms of the location and density distribution, which is calculated depending upon the values of IDS, CCD, and CDS. Two clusters with a high degree of fusion should satisfy the following conditions: (1) having a high value of IDS, (2) having a high value of CCD, and (3) the CDS of the merged cluster should be close to the average value of the two initial clusters’ CDSs. If two adjacent and crossed clusters hold a high IDS and similar CDS, they have a high fusion degree.

Based on Definition 4.6, the fusion degree between two clusters is expressed as a triangle in an equilateral triangle framework, as shown in Fig. 10. Vertices of the triangle represent , , and , respectively. The value of each indicator variable is represented by the segment from the triangle center point to the corresponding vertex. Then, the value of between clusters and is obtained by calculating the area of the corresponding triangle consisting of , , and , as defined as:

(14)

If the value of exceeds a given threshold, then clusters and are merged to a single cluster. In Fig. 10, there are three triangles with different edge colors, representing the corresponding fusion degrees of three cluster-pairs (). Fusion degrees between the merged cluster and other clusters continue to be evaluated. The process is repeated until the CFDs of all clusters are below the threshold. An example of cluster fusion degree between three clusters (   ) is given in Fig. 10.

Fig. 10: Cluster fusion degree measurement.

An example of cluster ensemble is shown in Fig. 11. The detailed steps of the cluster self-ensemble process of DADC are presented in Algorithm 2.

(a) Fragmented sub-clusters
(b) Ensembled clusters
Fig. 11: Example of fragmented cluster ensemble.
0:   : The initial clusters of ;: the threshold value of the cluster fusion degree for cluster self-ensemble.
0:   : the merged clusters of dataset .
1:  while  do
2:     get the first cluster from ;
3:     for each in  do
4:        calculate the inter-cluster density similarity ;
5:        calculate crossing points and ;
6:        calculate cluster crossover degree ;
7:        calculate cluster density similarity , , and ;
8:        calculate cluster density similarity ;
9:        calculate cluster fusion degree ;
10:        if  then
11:           merge clusters merge();
12:           remove from ;
13:     if  then
14:        append to the merged clusters ;
15:        remove from ;
16:  return  .
Algorithm 2 Cluster self-ensemble of DADC.

In Algorithm 2, for each initial cluster in , we respectively calculate the cluster crossover degree, cluster density similarity, and cluster fusion degree between and each residual cluster . Then, in each iteration, we try to merge the two clusters with the highest cluster fusion degree. Assuming that the number of initial clusters is , the computational complexity of Algorithm 2 is .

The DADC algorithm consists of processes 1, 1, and 2, requiring the computational complexity of , , and , respectively. Thus, the computational complexity of the DADC algorithm is , where is the number of points in the dataset and is that of the initial clusters.

5 Experiments

5.1 Experiment Setup

Experiments are conducted to evaluate the proposed DADC algorithm by comparing with CFSFDP [7], OPTICS [24], DBSCAN [25] algorithms in terms of clustering results analysis and performance evaluation. The experiments are performed using a workstation equipped with Intel Core i5-6400 quad-core CPU, 8 GB DRAM, and 2 TB main memory. Two groups of datasets, e.g., synthetic and large-scale real-world datasets, are used in the experiments. These datasets are downloaded from published online benchmarks, such as the clustering benchmark datasets [26]

and UCI Machine Learning Repository

[27], as shown in Tables I and II. An implementation of DADC is available from Github at https://github.com/JianguoChen2015/DADC.

Datasets #.Samples #.Dimensions #.Clusters
Aggregation 350 2 7
Compound 399 2 6
Heartshapes 788 2 3
Yeast 1484 2 10
Gaussian clusters (G2) 2048 2 2
TABLE I: Synthetic datasets used in experiments.
Datasets #.Samples #.Dimensions #.Clusters
Individual household electric power consumption (IHEPC) 275,259 9 196
Flixster (ASU) 523,386 2 153
Heterogeneity activity recognition (HAR) 930,257 16 289
Twitter (ASU) 316,811 2 194
TABLE II: Large-scale datasets used in experiments.

5.2 Clustering Results Analysis on Synthetic Datasets

To clearly and vividly illustrate the clustering results of DADC, multiple experiments are conducted on synthetic datasets in this section by comparing the related clustering algorithms, including DADC, CFSFDP in [7], DBSCAN in [25], and OPTICS in [24]. Synthetic datasets that with the features of VDD, MDDM, and ED, are used in experiments.

5.2.1 Clustering Results on VDD Datasets

To illustrate the effectiveness of the proposed method of the domain-adaptive density measurement in DADC, we conduct experiments on VDD datasets. Fig. 3 (a) is a synthetic dataset (Heartshpes) described in Table I, which is composed of three heart-shaped regions with different densities. Each region contains 71 data points.

(a) Data points
(b) Local/domain density
(c) Decision graph of CFSFDP
(d) Decision graph of DADC
Fig. 12: Decision graphs for VDD dataset.

The local density and domain density of each data point are calculated by the original measurement of CFSFDP and the domain adaptive measurement of DADC separately, as shown in Fig. 12 (b). It is evident from Fig. 12 (b), according to CFSFDP, the local densities of data points in the second region (no. 72-142) are far higher than that of the other two relatively sparse regions. In this case, it has difficulty in detecting the local-density peaks in the sparse regions. In contrast, according to DADC, although the density distribution of the three regions are different, the domain-density peaks of each region are obviously identified. As shown in Fig. 3 (b), there is only one decision point in the CFSFDP clustering decision graph that is detected as a cluster center. More than 140 points have low values of local density and high values of Delta distance and are detected as outliers. In contrast, in the DADC clustering decision graph of Fig. 12 (d), there are three decision points with high values of both domain-adaptive density and Delta distance, which are identified as the cluster centers of three regions separately. The clustering results show that the proposed DADC algorithm can effectively detect the domain-density peaks of data points and identify clusters in different density regions.

Two groups of VDD datasets (Aggregation and Compound) are used in the experiments to further evaluate the clustering effectiveness of DADC by comparing the CFSFDP algorithm. The local density of each data point is obtained by the CFSFDP algorithm, while the KNN-density, domain density, and domain-adaptive density of each data point are calculated by the proposed DADC algorithm. The comparison results are illustrated in Fig. 13.

(a) Data points of Aggregation
(b) Local/domain densities on Aggregation
(c) Data points of Compound
(d) Local/domain densities on Compound
Fig. 13: Adaptive domain-densities on VDD datasets.

As shown in Fig. 13 (b) and Fig. 13 (d), the local densities of data points in dense regions are obviously higher than those of data points in sparse regions. It is easy to treat the data points in sparse regions as noisy data rather than independent clusters. In contrast, by the method of DADC, domain-adaptive densities of all of the data points are detected with obvious differences. Although the datasets have multiple regions with different densities, the domain-density peaks in each region are quickly identified. More comparison results on VDD datasets are described in supplementary material.

5.2.2 Clustering Results on ED Datasets

To evaluate the effect of the cluster self-identification method of the proposed DADC algorithm, experiments are performed on an ED dataset (Hexagon) by comparing to CFSFDP, OPTICS, and DBSCAN algorithms, respectively. The clustering results are shown in Fig. 14. More comparison results on ED datasets are described in supplementary material.

(a) DADC on Hexagon
(b) CFSFDP on Hexagon
(c) OPTICS on Hexagon
(d) DBSCAN on Hexagon
Fig. 14: Clustering results on equilibrium distributed datasets.

Since the ED dataset does not have local-density peaks, CFSFDP obtains numerous fragmented clusters. Making use of the cluster self-identification and cluster ensemble process, the proposed DADC algorithm can merge the fragmented clusters effectively. Therefore, DADC effectively solves this problem and obtains accurate clustering results. As shown in Fig. 14 (a), the dataset is clustered into two clusters by DADC. In contrast, as shown in Fig. 14 (b) - (d), more than 231 fragmented clusters are produced by CFSFDP. Clustering results of OPTICS and DBSCAN are very sensitive to the parameter thresholds of eps (connectivity radius) and minpts (minimum number of shared neighbors). For example, for both OPTICS and DBSCAN, we set their parameter thresholds of eps and minpts to 10 and 5, and generate 14 and 11 fragmented clusters by OPTICS and DBSCAN, respectively. The experimental results show that the proposed DADC algorithm is more accurate than other algorithms for ED datasets clustering.

5.2.3 Clustering Results on MDDM Datasets

For datasets with MDDM characteristics, multi domain-density maximums might lead to fragmented clusters, whose overall distribution is similar to that of adjacent clusters. We conduct comparison experiments on a dataset with MDDM characteristics to evaluate the clustering effect of the comparative clustering algorithms. A synthesized dataset (G2) with MDDM characteristics is used in the experiments. The clustering results are shown in 15. More comparison results on MDDM datasets are described in supplementary material.

(a) DADC on G2
(b) CFSFDP on G2
(c) OPTICS on G2
(d) DBSCAN on G2
Fig. 15: Clustering results on MDDM dataset.

After obtaining 17 density peaks using CFSFDP, 17 corresponding clusters are generated as shown in Fig. 15 (b). However, these clusters have similar overall density distribution, and it is reasonable to merge them into a single cluster. DADC can eventually merge the 17 fragmented clusters into one cluster, as shown in Fig. 15 (a). Again, the clustering results of OPTICS and DBSCAN are very sensitive to the parameter thresholds of eps and minpts. As shown in Fig. 15 (c) and (d), when we set the OPTICS and DBSCAN algorithms’ parameter thresholds of eps and minpts to 13 and 10, then, 22 and 11 fragmented clusters are clustered by OPTICS and DBSCAN, respectively. Compared with CFSFDP, OPTICS, and DBSCAN, experimental results show that DADC achieves more reasonable clustering results on MDDM datasets.

5.3 Performance Evaluation

5.3.1 Clustering Accuracy Analysis

Clustering Accuracy (CA) is introduced to evaluate the clustering algorithms. CA measures the ratio of the correctly classified/clustered instances to the pre-defined class labels. Let be the dataset in this experiment, be the set of classes/clusters detected by the corresponding algorithm, and be the set of pre-defined class labels. CA is defined in Eq. (15):

(15)

where is the data points in the -th class/cluster, is the pre-defined class labels of the data points in , and is the number of . is the number of data points that have the majority label in . The greater value of CA means that the higher accuracy of the classification / clustering algorithm, and each cluster achieves high purity. The experimental results of the clustering accuracy comparison are given in Table III.

Datasets DADC CFSFDP OPTICS DBSCAN
Heartshapes 100.00% 83.42% 91.33% 91.33%
Yeast 91.67% 83.23% 82.54% 80.41%
G2 100.00% 90.45% 84.23% 82.85%
IHEPC 92.34% 87.72% 73.98% 62.03%
Flixster 87.67% 79.09% 65.31% 55.51%
Twitter 72.26% 68.85% 53.90% 51.42%
HAR 83.29% 84.23% 58.26% 56.92%
TABLE III: Clustering accuracy comparison.

As shown in Table III, DADC outperforms others on both synthetic and large-scale real-world datasets. In the case of Friendster, the average CA of DADC is 87.67%, while that of CFSFDP is 79.09%, that of OPTICS is 65.31%, and that of DBSCAN is 55.51%. For synthetic datasets, DADC achieves a high average CA of 97.22%. The average accuracies of OPTICS and DBSCAN algorithms are noticeably lower than that of CFSFDP and DADC. For the large-scale real-world datasets, CA of DADC is higher than that of the compared algorithms, keeping in the range of 72.26% and 92.34%. It illustrates that DADC achieves higher clustering accuracy over CFSFDP, OPTICS, and DBSCAN algorithms.

5.3.2 Robustness Analysis

Experiments are conducted to evaluate the robustness of the compared algorithms on noisy datasets. Four groups of real-world datasets from practical applications described in Table II are used in the experiments with different degrees of noise. We generate different amounts of random and non-repetitive data points as noise in the value space of the original dataset. The noise-level of each dataset gradually increases from 1.0% to 15.0%. The experimental results are presented in Fig. 16.

Fig. 16: Comparison of algorithm robustness.

As observed in Fig. 16, with the increasing proportion of noise, the average clustering accuracy of each algorithm decreases, respectively. However, the average clustering accuracy of DADC drops at a minimum rate, while those of CFSFDP and OPTICS lie in the second and third, while that of DBSCAN declines at the fastest speed. When the noise-level rises from 1.0% to 15.0%, the average accuracy of DADC decreases from 96.14% to 80.39%, which indicates that DADC is most robust to different noise-level data. The average accuracy of CFSFDP drops from 94.21% to 53.18%, and that of DBSCAN decreases from 78.52% to 31.74%. For example, when the scale of noisy data increases from 1.0% to 15.0%, the average clustering accuracy of DADC reduces from 96.0% to 80.3%. Compared to the compared algorithms, DADC retains higher accuracy in each case. Therefore, DADC illustrates higher robustness than compared algorithms to noisy data.

6 Conclusions

This paper presented a domain-adaptive density clustering algorithm, which is effective in the datasets with varying- density distribution (VDD), multiple domain-density maximums (MDDM), or equilibrium distribution (ED). A domain adaptive method was proposed to calculate domain densities and detect density peaks of data points in VDD datasets, and cluster centers were identified. In addition, a cluster fusion degree model and a CFD-based cluster self-ensemble method were proposed to merge fragmented clusters with minimum artificial intervention in MDDM and ED datasets. In comparison with existing clustering algorithms, the proposed DADC algorithm requires fewer parameters and non-iterative nature, achieving outstanding advantages in terms of accuracy and robustness.

As future work, we will further research issues of big data clustering analysis, including incremental clustering, time-series data clustering, and parallel clustering in distributed and parallel computing environments.

References

  • [1] X. Liu, C. Wan, and L. Chen, “Returning clustered results for keyword search on xml documents,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 12, pp. 1811–1825, 2011.
  • [2] Y. Wang, X. Lin, L. Wu, W. Zhang, and Q. Zhang, “Exploiting correlation consensus: Towards subspace clustering for multi-modal data,” in MM’14, 2014, pp. 981–984.
  • [3]

    J. Shao, X. He, C. Bohm, Q. Yang, and C. Plant, “Synchronization-inspired partitioning and hierarchical clustering,”

    IEEE Trans. Knowl. Data Eng., vol. 25, no. 4, pp. 893–905, 2013.
  • [4]

    B. Jiang, J. Pei, Y. Tao, and X. Lin, “Clustering uncertain data based on probability distribution similarity,”

    IEEE Trans. Knowl. Data Eng., vol. 25, no. 4, pp. 751–763, 2013.
  • [5] L. Zheng, T. Li, and C. Ding, “A framework for hierarchical ensemble clustering,” ACM Trans. Knowl. Discov. Data, vol. 9, no. 2, pp. 9:1–23, 2014.
  • [6] J. Gan and Y. Tao, “Dynamic density based clustering,” in SIGMOD’17, 2017, pp. 1493–1507.
  • [7] A. Rodriguez and A. Laio, “Clustering by fast search and find of density peaks,” Science, vol. 344, no. 6191, pp. 1492–1496, 2014.
  • [8]

    M. Du, S. Ding, and H. Jia, “Study on density peaks clustering based on k-nearest neighbors and principal component analysis,”

    Knowledge-Based Systems, vol. 99, pp. 135–145, 2016.
  • [9] H. Liu, M. Shao, S. Li, and Y. Fu, “Infinite ensemble for image clustering,” in KDD’16, 2016, pp. 1745–1754.
  • [10] W. Jiang, J. Wu, F. Li, G. Wang, and H. Zheng, “Trust evaluation in online social networks using generalized network flow,” IEEE Transactions on Computers, vol. 65, no. 3, pp. 952–963, 2015.
  • [11] A. Fahad, N. Alshatri, Z. Tari, A. Alamri, I. Khalil, A. Y. Zomaya, S. Foufou, and A. Bouras, “A survey of clustering algorithms for big data: Taxonomy and empirical analysis,” IEEE Trans. Emerg. Top. Comp., vol. 2, no. 3, pp. 267–279, 2014.
  • [12] W. Jiang, G. Wang, M. Z. A. Bhuiyan, and J. Wu, “Understanding graph-based trust evaluation in online social networks: Methodologies and challenges,” ACM Computing Surveys, vol. 49, no. 1, p. 10, 2016.
  • [13] G. Li, S. Chen, J. Feng, K. lee Tan, and W.-S. Li, “Efficient location-aware influence maximization,” in SIGMOD’14, 2014, pp. 87–98.
  • [14] J. Xie, H. Gao, W. Xie, X. Liu, and P. W. Grant, “Robust clustering by detecting density peaks and assigning points based on fuzzy weighted k-nearest neighbors,” Information Sciences, vol. 354, pp. 19–40, 2016.
  • [15] Y. Zheng, Q. Guo, A. K. Tung, and S. Wu, “Lazylsh: Approximate nearest neighbor search for multiple distance functions with a single index,” in SIGMOD’16, 2016, pp. 2023–2037.
  • [16] G. Wang and Q. Song, “Automatic clustering via outward statistical testing on density metrics,” IEEE Trans. Knowl. Data Eng., vol. 28, no. 8, pp. 1971–1985, 2016.
  • [17] Z. Yu, Z. Kuang, J. Liu, H. Chen, J. Zhang, J. You, H.-S. Wong, and G. Han, “Adaptive ensembling of semi-supervised clustering solutions,” IEEE Trans. Knowl. Data Eng., vol. 29, no. 8, pp. 1577–1590, 2017.
  • [18] S. Ruan, R. Mehmood, A. Daud, H. Dawood, and J. S. Alowibdi, “An adaptive method for clustering by fast search-and-find of density peaks: Adaptive-dp,” in WWW’17, 2017, pp. 119–127.
  • [19] D. Ma and A. Zhang, “An adaptive density-based clustering algorithm for spatial database with noise,” in ICDM’04, 2004, pp. 467–470.
  • [20] P. Wang, C. Domeniconi, and K. B. Laskey, “Nonparametric bayesian clustering ensembles,” in SDM’10, 2010, pp. 331–342.
  • [21] Z. Yu, P. Luo, S. Wu, G. Han, J. You, H. Leung, H. Wong, and J. Zhang, “Incremental semi-supervised clustering ensemble for high dimensional data clustering,” in ICDE’16, 2016, pp. 1484–1485.
  • [22] S. Yang, M. A. Cheema, X. Lin, Y. Zhang, and W. Zhang, “Reverse k nearest neighbors queries and spatial reverse top-k queries,” The VLDB Journal, vol. 26, no. 2, pp. 151–176, 2017.
  • [23] D. Jiang, G. Chen, B. C. Ooi, K.-L. Tan, and S. Wu, “Epic: An extensibleand scalable system for processing big data,” in VLDB’14, 2014, pp. 541–552.
  • [24] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, “Optics: ordering points to identify the clustering structure,” in SIGMOD’99, 1999, pp. 49–60.
  • [25] M. Ester, H. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in KDD’96, 1996, pp. 226–231.
  • [26] P. F. et al, “Clustering datasets,” 2017. [Online]. Available: http://cs.uef.fi/sipu/datasets/
  • [27] U. of California, “Uci machine learning repository,” Website, 2017, http://archive.ics.uci.edu/ml/datasets.