We are living in the era of data explosion. Due to the increase in data size and types of data, traditional manual data analyses have become an impossible task without powerful tools ZHOU2003139 . In order to meet the demands, data mining, also referred to as knowledge discovery from data, has emerged to automate the process of discovering insightful, interesting, and novel patterns han2011data .
Clustering is an important and useful tool in data mining and knowledge discovery. It has been widely used for partitioning instances in a dataset such that similar instances are grouped together to form a cluster han2011data
. It is the most common unsupervised knowledge discovery technique for automatic data-labelling in various areas, e.g., information retrieval, image segmentation, and pattern recognitionkaufman2009finding . Depending on the basis of categorisation, clustering methods can be divided into several kinds, e.g., partitioning clustering versus hierarchical clustering; and density-based clustering versus representative-based clustering zaki2014dataminingbook .
Partitioning clustering methods are the simplest and most fundamental clustering methods han2011data . They are relatively fast, and easy to understand and implement. They organise data points in a given dataset into non-overlapping partitions, where each partition represents a cluster; and each point belongs to one cluster only han2011data . However, traditional distance-based partitioning methods, such as -means hartigan1979algorithm and -medoids kaufman1987clustering , which are representative-based clustering, usually cannot find clusters with arbitrary shapes aggarwal2013data . In contrast, density-based clustering algorithms can find clusters with arbitrary sizes and shapes while effectively separating noise. Thus, this kind of clustering is attracting more research and development.
DBSCAN ester1996density and DENCLUE hinneburg2007denclue are examples of an important class of density-based clustering algorithms. They define clusters as regions of high densities which are separated by regions of low densities. However, these algorithms have difficulty finding clusters with widely varied densities because a global density threshold is used to identify high-density regions ZHU2016983 .
Rodriguez et al. proposed a clustering algorithm based on density peaks (DP) rodriguez2014clustering . It identifies cluster modes111The original DP paper regards detected cluster modes as “cluster centres” rodriguez2014clustering . which have local maximum density and are well separated, and then assigns each remaining point in the dataset to a cluster mode via a linking scheme. Compared with the classic density-based clustering algorithms (e.g., DBSCAN and DENCLUE), DP has a better capability in detecting clusters with varied densities. Despite this improved capability, Chen et al. Chen2018 has recently identified a condition under which DP fails to detect all clusters with varied densities. They proposed a new measure called Local Contrast (LC) (instead of density) to enhance DP such that the resultant algorithm LC-DP is more robust against clusters with varied densities.
It is important to note that the progression from DBSCAN or DENCLUE to DP, and subsequently LC-DP, with improved clustering performance, is achieved without formally defining the types of clusters DP and LC-DP can detect.
In this paper, we are motivated to formally define the type of clusters that an algorithm is designed to detect before investigating the weaknesses of the algorithm. This approach enables us to determine two additional weaknesses of DP; and we show that the use of LC does not overcome these weaknesses.
This paper makes the following contributions:
Formalising a new type of clusters called -linked clusters; and providing a necessary condition for a clustering algorithm to correctly detect all -linked clusters in a dataset.
Uncovering that DP is a clustering algorithm which is designed detect -linked clusters; and identifying two weaknesses of DP, i.e., the conditions under which DP cannot correctly detect all clusters in a dataset.
Introducing a different view of DP as a hierarchical procedure. Instead of producing flat clusters, this procedure generates a dendrogram, enabling a user to identify clusters in a hierarchical way.
Formalising the second new type of clusters called -density-connected clusters which encompass all -linked clusters and the kind of non--linked clusters that DP fails to detect.
Proposing a density-connected hierarchical DP to overcome the identified weaknesses of DP. The new algorithm DC-HDP merges two cluster modes only if they are density-connected at a high level in the hierarchical structure.
Completing an empirical evaluation by comparing with 7 state-of-the-art clustering algorithms in 14 datasets: 3 density-based clustering algorithms (DBSCAN ester1996density , DP rodriguez2014clustering and LC-DP Chen2018 ), 2 hierarchical clustering algorithms (HML 7440832 and PHA LU20131227 ) and 2 density-based hierarchical clustering (HDBSCAN Campello:2015:HDE and OPTICS ankerst1999optics ).
The formal analysis of DP provides an insight into the key weaknesses of DP. This has enabled a simple and effective method (DC-HDP) to overcome the weaknesses. The proposed method takes advantage of the individual strengths of DBSCAN and DP, i.e., DC-HDP has an enhanced ability to identify all clusters of arbitrary shapes and varied densities; where neither DBSCAN nor DP has. In addition, the dendrogram generated by DC-HDP gives a richer information of hierarchical components of clusters in a dataset than a flat partitioning provided by DBSCAN and DP. This is achieved with the same computational time complexity as in DP, having one additional parameter only which can usually be set to a default value in practice.
Since hierarchical clustering algorithms allow a user to choose a particular clustering granularity, hierarchical clustering is very popular and has been used far more than non-hierarchical clustering GILPIN201795 . Thus, DC-HDP provides a new perspective which can be widely used in various applications.
The rest of the paper is organised as follows: we provide an overview of density-based clustering algorithms and related work in Section 2. Section 3 formalises the -linked clusters. Section 4 uncovers that DP is an algorithm which detects -linked clusters; and reveals two weaknesses of DP. Section 5 reiterates the definition of density-connected clusters used by DBSCAN, and states the known weakness of DBSCAN. Section 6 presents the definition of the second new type of clusters called -density-connected clusters. The new density-connected hierarchical clustering algorithm is proposed in Section 7. In Section 8, we empirically evaluate the performance of the proposed algorithm by comparing it with 7 other state-of-the-art clustering algorithms. Discussion and the conclusions are provided in the last two sections.
2 Related work
. They first identify points in dense regions using a density estimator and then link neighbouring points in dense regions to form clusters. They can identify arbitrarily shaped clusters on noisy datasets. DBSCAN defines the density of a point as the number of points from the dataset that lie in its-neighbourhood. A “core” point is a point having density higher than a threshold . DBSCAN visits every core point and links all core points in its -neighbourhood together, until all core points are visited. Then, points which are directly/indirectly linked are grouped into the same cluster. Finally, non-core points that are in the -neighbourhood of other core points, called boundary points, are linked to the nearest clusters. If a point is neither core point nor boundary point, then it is considered to be “noise”. DENCLUE uses a Gaussian kernel estimator to estimate density for every point and applies a hill-climbing procedure to link neighbourhood points with high densities. Although DBSCAN and DENCLUE can detect clusters with varied sizes and shapes, they have difficulty finding clusters with widely varied densities because a global density threshold is used to identify points in high-density areas ZHU2016983 .
Many variants of DBSCAN have been attempted to overcome the weakness of detecting clusters with varied densities. OPTICS ankerst1999optics draws a “reachability” plot based on the -nearest neighbour distance. In the -axis of the plot, adjacent points follow close to each other such that point is the closest to in terms of the “reachability distance” 222The “reachability-distance” of object to object is the greater one between the “core distance” of and the distance between and . The “core distance” of is the minimum that makes a “core” object (the distance to its -nearest neighbour, ).. The reachability distance for each point is shown in -axis. Since clusters centre normally has a higher density or lower reachability distance than the cluster boundaries, each cluster is visible as a “valley” in this plot. Then a hierarchical method can be used to extract different clusters. The overall clustering performance depends on the hierarchical method employed on the reachability plot.
HDBSCAN Campello:2015:HDE is a hierarchical clustering based on DBSCAN. The idea is to produce many DBSCAN clustering outcomes through increasing density thresholds.333It first builds a Minimum Spanning Tree (MST) for all points, where the weight of each edge is the mutual reachability distance and the weight for each vertex is the core distance. Then it removes edges from the MST progressively in decreasing order of weights, which is equivalent to getting many DBSCAN clustering outcomes with increasing density thresholds. As the density threshold increases, a cluster may shrink or be divided into smaller clusters or disappear. A dendrogram is built based on these clustering outcomes via a top-down method to yield the hierarchical cluster structure, e.g., the root is one cluster with all points and then split into different sub-clusters in following levels. To produce a flat partitioning, it extracts a set of ‘significant’ clusters at different levels of the dendrogram via an optimisation process. HDBSCAN can detect clusters with varied densities because it employs different density thresholds. However, it has a bias towards low-density clusters. To separate overlapping high-density clusters, HDBSCAN needs to use a high-density threshold so that points in boundary region are treated as noise. As a result, high-density clusters can lose many cluster members when a high density level is used.
Recently, density peak (DP) clustering algorithm was proposed without using any density threshold rodriguez2014clustering . It assumes that cluster modes in the given dataset are sufficiently separated from each other. The clustering process has three steps as follows. First, DP calculates the density of each point using an -neighbourhood density estimator, and the distance between a point and its nearest neighbour with a higher density value. Second, DP plots a decision graph for all points where the -axis is , sorted in descending order in the -axis. The points with the highest (i.e., high density values and relatively far nearest neighbour with a higher density) are selected as the cluster modes. Third, each remaining point is connected to its nearest neighbour of higher density, and the points connected or transitively connected to the same cluster mode are grouped into the same cluster. Noise can be detected by applying an additional step.444Once every point is assigned to a cluster, a border region is identified for each cluster—it is the set of points which is assigned to that cluster but located within the -neighbourhood of a point belonging to another cluster. Let be the highest density of all points within the border region. Noise points are all points of the cluster which have density values less than . Since DP does not rely on any density threshold to extract clusters, it avoids the weakness of DBSCAN in detecting clusters with varied densities.
Figure 1 illustrates clustering results of DBSCAN, HDBSCAN and DP on a dataset having cluster with varied densities. Only DP can near-perfectly detect all clusters in this dataset. DBSCAN performed poorly because it produced either a merged cluster of the two high-density clusters using a low-density threshold or only a few points are assigned to the low-density cluster when using a high density threshold. HDBSCAN can separate all three clusters but many boundary points of the high-density clusters are regarded as noise. This is because HDBSCAN has to use a high-density threshold to separate the two dense clusters and then the low-density points from the dense clusters become noise.
DP is one of the most promising clustering algorithms which has the ability to detect clusters with varied densities and arbitrary shapes. Without formally articulating the weaknesses of DP, some papers have proposed to improve DP. For example, a fuzzy weighted -nearest neighbour is used to more efficiently search for density mode XIE201619 ; a linear fitting method is proposed to determine more meaningful cluster modes on the decision graph XU2016200 ; and an incremental method is designed to enable DP for clustering large dynamic data in industrial Internet of Things 7882646 . Only Chen et al. Chen2018 have formally identified one necessary condition under which DP fails to detect all clusters of varied densities in a dataset, and proposed a new measure called Local Contrast (LC) to make DP more robust against clusters of varied densities.
However, to the best of our knowledge, there is no paper formalising the type of clusters that DP is designed to detect in a dataset. This knowledge is important to understand the weaknesses of DP and devise a way to overcome the weaknesses.
3 Defining clusters based on higher-density-nearest-neighbours: -linked clusters
In this section, we formally a type of clusters called -linked clusters, where stands for nearest neighbour with higher density. The main symbols and notations used are listed in Table 1.
|a point in|
|a -dimensional dataset with points|
|a point with the highest density in|
|, , ,||a cluster (a group of points)|
|the mode (point of the highest density) in a cluster|
|estimated density of via an -neighbourhood estimator|
|the distance between two points and|
|radius of neighbourhood|
|volume of a -dimensional ball of radius|
|’s nearest neighbour which has a higher density than|
|’s nearest density-connected neighbour which has a higher density than|
|an -linked path connecting and|
|an -density-connected path connecting and|
|the distance between and|
|the distance between and|
Let , denote a dataset of points, each sampled independently and identically from a distribution . Let be the -neighbourhood of , , where is the distance function (); and is a user-defined constant.
where is the volume of a -dimensional sphere of radius . Note that since is a constant for every point, it can be omitted in practice. As a result, many density-based clustering algorithms use the number of points in the -neighbourhood as a proxy to the estimated density.
Let denote the point with the global maximum density; and . For each point , let be ’s nearest neighbour which has a higher density than , i.e.,
Assuming we have cluster modes (which will be defined in Definition 3). Each of points in is then assigned to the same cluster of its nearest neighbour of higher density; and the points, which are path-linked or transitively path-linked to the same closest cluster mode in terms of their path length, are grouped into the same cluster. Such a path is defined as follows:
An -linked path connecting points and : is defined as a sequence of the smallest number of unique points starting with and ending with , where .
The length of path(x,y) is defined as
where is the number of points along the path from to .
Note that if ; and , if .
An -linked cluster , which has only one mode , is a maximal set of points having the shortest path length to its cluster mode wrt other cluster modes, i.e., .
As a result of the above definitions, three lemmas are stated as follows:
Every point in has a path to , i.e., .
Given a dataset with non-overlapping clusters, every cluster is an -linked cluster if the density distribution of every cluster is a monotonically decreasing function having its only mode at , i.e., .
Figure 1(a) illustrates an example of two clusters having monotonically decreasing functions. After selecting Peak 1 and Peak 2 as cluster modes and assigning the rest of the points based on Definition 3, the red dash-line indicates the boundary of the two clusters.
Given a dataset with non-overlapping clusters, if the density distribution of at least one cluster is a monotonically non-increasing function having its only mode at , i.e., , then may not be an -linked cluster.
This occurs when not all has the shortest . An example is shown in Figure 1(b), where some points to the left of the boundary are assigned to the left cluster, though they should belong to the right cluster. A dataset, which has clusters with uniform density distribution only, does not contain -linked clusters for the same reason.
In practice, a dataset having clusters with slow rates of density decrease from the modes, even if every cluster is a monotonically decreasing function, can raise the same issue because a density estimator may identify multiple local peaks for such a cluster due to estimation errors.
In addition, a dataset having clusters with multiple local peaks may not contain -linked clusters when the local peaks do not have the shortest path to the correct modes. Figures 1(c) and 1(d) illustrate two example density distributions with two clusters having a local peak (Peak 3) in addition to the cluster mode (Peak 2). Peak 3 in Figure 1(c) has the shortest path to Peak 2; and it is assigned to the same cluster with Peak 2. However, Peak 3 in Figure 1(d) has the shortest path to Peak 1. This is because its nearest point with higher density is the point which has the shortest path to Peak 1. Note that this occurs even if Peak 3 is just a tiny bump in the data distribution.
In a nutshell, -linked clusters are sensitive to (i) the distance separation between clusters; and (ii) the data density distribution.
Based on the first two lemmas and Definition 3, we formulate a theorem for an algorithm which identifies -linked clusters as follows:
Given a dataset with -linked clusters, a necessary condition for a clustering algorithm implementing Definition 3 to correctly identify all clusters is that it must correctly select the mode in each and every cluster.
A violation of the condition means at least one of the two following situations will occur:
If a point, other than the true mode, is selected by the algorithm as a cluster mode, then the cluster will be split into at least two clusters. This is because all cluster members with higher density than the selected cluster mode do not have the shortest path to this mode and will be assigned to other cluster(s).
If no point in a cluster is selected as a cluster mode, then each member of this cluster will be linked to a different cluster with a mode having the shortest path from this member.
Therefore, in either situation, not all clusters are identified correctly by this clustering algorithm. ∎
4 DP — An algorithm which identifies -linked clusters
Density Peak or DP rodriguez2014clustering has two main procedures as follows:
Identifying cluster modes via a ranking scheme which aims to rank all points. The top points are selected as the modes of clusters.
Linking each non-mode data point to its nearest neighbour with higher density. The points directly linked or transitively linked to the same cluster mode are assigned to the same cluster. This produces clusters at the end of the process.
Therefore, DP is an algorithm implementing Definition 3 to identify -linked clusters in a dataset (in step 2 above).
The first step is critical, as specified in Theorem 4. To effectively identify cluster modes, DP assumes that different cluster modes should have a relatively large distance between them in order to detect well-separated clusters rodriguez2014clustering . To prevent a cluster from breaking into multiple clusters when it has a slow rate of density decrease from the cluster mode, it applies a ranking scheme on all points, and then selects the top points as cluster modes. This is done as follows.
DP defines a distance function as follows:
DP selects the top points with the highest as cluster modes. This means that each cluster mode should have high density and be far away from other cluster modes.
4.1 Weaknesses of DP
While DP generally produces a better clustering result than DBSCAN (see the evaluation conducted by Chen et al Chen2018 reported in Appendix C of the paper), we identify two fundamental weaknesses of DP:
Given a dataset of -linked clusters, if the data distribution is such that the cluster modes are not ranked as the top points with the highest values, then DP cannot correctly identify these cluster modes, as stated in Theorem 4. The source of this weakness is the ranking scheme in step 1 of the DP procedure.
An example is a dataset having two Gaussian clusters and an elongated cluster with two local peaks (the left one is the cluster mode), as shown in Figure 3. DP with splits the elongated cluster into two sub-clusters because the two local peaks are ranked among the top 3 points with the highest values; and it merges the bottom two clusters into one, as shown in Figure 2(a). A better clustering result can be obtained by using which resulted in a correct identification of the two bottom clusters, as shown in Figure 2(b). But it still splits the single top cluster into two. Note that all three clusters would be correctly identified by DP if the three true cluster modes were pre-selected for DP using a different process. This data distribution is similar to that shown in Figure 1(c) which has valid -linked clusters.
Figure 3: Clustering result of DP on 3C dataset (a dataset with three clusters). The colours used in each scatter plot indicate clusters labelled by DP. Figure 4: Clustering results of DP on two datasets (2O and 2Q), when and is the same as the true number of clusters. The colours used in each scatter plot indicate clusters labelled by DP.
Given a dataset of non--linked clusters, DP cannot correctly identify these clusters because DP can detect -linked clusters only. The source of this weakness is the limitation of the cluster definition used.
Figure 4 demonstrates two examples when DP fails to correctly detect the clusters. These clusters are non--linked clusters, i.e., some points of a cluster do not have an -linked path to its cluster mode, but have an -linked path to the mode of another cluster; even though the two clusters are well separated. Specifically, the 2O dataset in Figure 3(a)
has clusters with uniform distribution; and the data distribution of the 2Q dataset in Figure3(b) has small bumps on the side of the large cluster near the smaller cluster—a similar data characteristic shown in Figure 1(d)—as specified in Lemma 3.
4.2 An existing improvement of DP
Chen et al. Chen2018 provides a method called Local Contrast (LC-DP), which aims to improve the ranking of cluster modes for detecting all clusters in a dataset with clusters of varied densities—the condition under which DP fails to correctly all clusters, i.e., the condition they have discovered.
LC-DP uses local contrast , instead of density , in Equation 2 to determine . is defined to be the number of times that has a higher density than its -nearest neighbours, which has values between 0 and . Then the ranking is based on , where is the version of based on which is determined by . Chen et al. Chen2018 show that the use of LC makes DP to be more robust to clusters with varied densities.
Here we show that LC-DP enhances DP’s clustering performance on some clusters with varied densities, e.g., the 2Q and 3C datasets. However, LC doesn’t overcome the two weaknesses of DP mentioned above. For example, LC-DP still fails to identify all clusters on the 2O dataset which does not have -linked clusters. Therefore, it is important to design a method to overcome DP’s two weaknesses. Here we propose a hierarchical method based on density-connectivity with this aim in mind.
We first reiterate the currently known definitions of density connectivity and density-connected clusters in Section 5. Then, we define a new type of clusters based on density connectivity in the following section.
5 Density-connected clusters
The classic density-based clustering, such as DBSCAN ester1996density , defines a cluster based on density connectivity as follows:
Using an -neighbourhood density estimator with density threshold , a point is density connected with another point in a sequence of unique points from , i.e., : is defined as:
A density-connected cluster , which has a mode , is a maximal set of points that are density connected with its mode, i.e., .
The following lemma is based on the density-connectivity:
Points in a density-connected cluster are density connected to each other via the mode , i.e., .
Note that a set of points having multiple modes (of the same peak density) must be density connected together in order to form a density-connected cluster.
The key characteristic of a density-connected cluster is that the cluster can have an arbitrary shape and size ester1996density . Although an -linked cluster can have arbitrary shape and size as well, DP which detects -linked clusters has issues with the two types of data distributions, mentioned in Section 4. The -linked path may link points from different clusters which are separated by low density regions, e.g., the two circle-clusters in Figure 3(a).
On the other hand, though a clustering algorithm such as DBSCAN which detects density-connected clusters does not have the above issues, DBSCAN has issues in identifying all clusters of varying densities ZHU2016983 , as shown in Figure 0(b) in Section 2.
In a nutshell, both the -linked clusters and the density-connected clusters have different limitations.
6 -density-connected clusters
To overcome the limitations of (i) -linked clusters stated in Sections 3 and 4; and (ii) density-connected clusters stated in Section 5, we strengthen the -linked path based on the density connectivity as follows:
An -density-connected path linking points and , , is defined as a sequence of the smallest number of unique points starting with and ending with such that , where is ’s nearest density-connected neighbour which has a higher density than , i.e.,
The length of is defined as
Note that if and , if .
An -density-connected cluster , which has only one mode , is a maximal set of density-connected points having the shortest density-connected path to its cluster mode wrt other cluster modes in terms of the path length, i.e., .
Based on these definitions, we have two lemmas as follows:
An -linked cluster becomes an -density-connected cluster if all points in the -linked cluster are density-connected.
In other words, if a dataset has -linked clusters, it also has -density-connected clusters when all points are density-connected with proper and settings.
If a dataset has only density-connected clusters and each cluster has only one mode, then all clusters are -density-connected clusters.
This is because when a dataset has only density-connected clusters, all points in each cluster are density-connected; while any two points from different clusters are not density-connected. Then every point in each cluster has the shortest -density-connected path to its cluster mode wrt other cluster modes. Therefore, all clusters are -density-connected clusters.
With proper neighbourhood threshold and density threshold , well-separated clusters cannot be linked together as an -density-connected cluster. This enables us to identify clusters which are density-connected but are not -linked in a dataset. Figure 5 illustrates the cluster boundaries of two clusters after selecting Peak 1 and Peak 2 as cluster modes and assigning the rest of the points based on Definition 8. It can be seen that all these clusters can be identified as -density-connected clusters now, although the clusters in Figure 4(c) and Figure 4(d) are not -linked clusters.
7 An -density-connected hierarchical clustering algorithm
Here we propose an -density-connected hierarchical clustering algorithm. It is described in two subsections. In Section 7.1, we introduce a different view of the (flat) DP as a hierarchical procedure. Instead of employing a decision graph to rank points proposed in the original DP paper rodriguez2014clustering , the proposed hierarchical procedure merges clusters bottom-up to produce a dendrogram. The dendrogram enables a user to identify clusters in a hierarchical way, which cannot be produced by the current flat DP procedure.
7.1 A different view of DP: a hierarchical procedure
We show that the DP clustering rodriguez2014clustering can be accomplished as a hierarchical clustering; and the two clustering procedures produce exactly the same flat clustering result when the same is used. If we run DP times by setting , we get a bottom-up based clustering result. To avoid running DP times, which has the time complexity of , we propose a hierarchical procedure as follows.
The initialisation step in the hierarchical DP is as follows. After calculating for all points, let every point be a cluster mode (which is equivalent to running DP with ); and each cluster mode is tagged with its . Let be which is the set used for merging in the next step.
The first merging of two clusters (which is equivalent to running DP with ) is conducted as follows. Select the cluster having the mode point with the smallest value in ; and the cluster is merged with the cluster having . is then removed from . The above merging process is repeated iteratively by merging two clusters at each iteration until .
Figure 6 illustrates the clustering results produced from the hierarchical DP as dendrograms on the three datasets used in Figures 3 and 4. Figure 5(a) shows that the elongated cluster is split at the top level in the dendrogram. The dendrogram in Figure 5(c) shows that points from the two circles are (incorrectly) merged at low levels in the hierarchical structure. Figure 5(e) illustrates that many points from the sparse-and-large cluster are linked to the dense-and-small cluster when . The flat clustering results extracted from the dendrograms produced by the hierarchical DP are the same as those produced by the flat DP shown in Figures 3 and 4.
Because the cluster modes are selected according to the ranking of , the clustering result with clusters can be obtained by setting an appropriate threshold of on the bottom-up based hierarchical clustering result such that the number of clusters below the threshold is . Since both the hierarchical DP and the flat (original) DP produce the same flat clustering result, the name DP is used hereafter to denote both the two versions, as far as the flat clustering result is concerned.
Advantages of the hierarchical DP: There are two advantages of the hierarchical DP over the flat DP. First, the former avoids the need to select cluster modes in the first step of the clustering process. Instead, after the dendrogram is produced at the end of the hierarchical clustering process, is required only if a flat clustering is to be extracted from the dendrogram. Second, the dendrogram produced by the hierarchical DP provides a richer information of the hierarchical structure of clusters in a dataset than a flat partitioning provided by the flat DP.
The hierarchical DP has the same time complexity of the flat DP, i.e., , since and are calculated for all points only once.
7.2 A density-connected hierarchical DP
In order to enhance DP to detect clusters from a larger set of data distributions than that covered by density-connected clusters or -linked clusters, the clusters based on Definition 8 is used.
Using the hierarchical DP, it turns out that only a simple rule needs to be incorporated, i.e., to check whether two cluster modes at the current level are density-connected before merging them at the next level in the hierarchical structure: two clusters and can only be merged if there is an -density-connected path between their cluster modes. This is checked at each level of the hierarchy, where the procedure selects the cluster having the mode with smallest to merge with another cluster having , where
We call the new algorithm, DC-HDP, as it employs this cluster merging rule based on the density connectivity with the hierarchical DP procedure. The DC-HDP algorithm is shown in Algorithm 1.
Note that, other than , there may exist points which do not have . Let . At the end of line 12, only those points which are not for all become isolated points.
Also note that it is possible to have more than one cluster at the end of line 12, if the clusters are not density-connected to each other. Therefore, these clusters are merged with all remaining isolated points to yield only one cluster at the top of the hierarchical structure (see line 13 in Algorithm 1).
Once the dendrogram is obtained from Algorithm 1, a global threshold can be used to select the clusters as a flat clustering result.
Note that the algorithm for the hierarchical DP is the same as the DC-HDP algorithm, except the former uses (i) instead of ; and (ii) (the whole dataset without the global peak point) instead of .
Compared with DP, DC-HDP has one more parameter , used for the density-connectivity check; and the same are used for both density estimation and density connectivity check. DC-HDP maintains the same time complexity of DP, i.e., .
DC-HDP has the ability to enhance the clustering performance of DP on a dataset having -density-connected clusters which encompass the two kinds of clusters DP is weak at, mentioned in Section 3. This is because DC-HDP does not establish any DCpath between points from different clusters which are not density-connected. Since clusters which are not density-connected are only merged at the top of the dendrogram with the highest value, a global threshold can separate all these clusters. Unlike DBSCAN, DC-HDP does not rely on a global density threshold to link points; thus, DC-HDP has the ability to detect clusters with varied densities.
In a nutshell, DC-HDP takes advantage of the individual strengths of DBSCAN and DP, i.e., it has the enhanced ability to identify all clusters of arbitrary shapes and varied densities; where neither DBSCAN nor DP has.
Figure 7 illustrates a clustering result from DC-HDP as a dendrogram on each of the three datasets used in Figure 3 and Figure 4. It shows that all clusters can be detected perfectly by DC-HDP when an appropriate threshold (blue horizontal line) is used on the dendrogram.
Furthermore, DC-HDP has an additional advantage in comparison with DP and DBSCAN, i.e., the dendrogram produced has a rich structure of clusters at different levels. This is the advantage of a hierarchical clustering over a flat clustering.
8 Empirical evaluation
This section presents experiments designed to evaluate the effectiveness of DC-HDP. We compare DC-HDP with 3 state-of-the-art density-based clustering algorithms (DBSCAN ester1996density , DP rodriguez2014clustering and LC-DP Chen2018 ), 2 most recent state-of-the-art hierarchical clustering algorithms (PHA LU20131227 and HML 7440832 ), and 2 density-based hierarchical clustering (HDBSCAN Campello:2015:HDE and OPTICS ankerst1999optics ).
The clustering performance is measured in terms of F-measure555It is worth noting that other evaluation measures such as purity and Normalized Mutual Information (NMI) strehl2002cluster only take into account the points assigned to clusters and do not account for noise. A clustering algorithm which assigns the majority of the points to noise may result in a high clustering performance. Thus the F-measure is more suitable than purity or NMI in assessing the clustering performance of density-based clustering, e.g, DBSCAN and OPTICS.: given a clustering result, we calculate the precision score and the recall score for each cluster
based on the confusion matrix, and then the F-measure score of
is the harmonic mean ofand . The overall F-measure score is the unweighted average over all clusters: F-measure.
We used 4 artificial datasets (2O, 3C, 3G and 2Q) and 10 real-world datasets with different data sizes and dimensions from UCI Machine Learning RepositoryLichman:2013 . Table 2 presents the data properties of the datasets.
All algorithms used in our experiments were implemented in Matlab.666The source codes of all algorithms used in our experiments can be obtained at: DBSCAN, DC-HDP, LC-DP and DP: https://sourceforge.net/projects/hierarchical-dp/ HDBSCAN: https://goo.gl/i86JPJ OPTICS: https://github.com/alexgkendall/OPTICS_Clustering HML: https://goo.gl/BjtWAA PHA: https://goo.gl/fGD5p6 The experiments were run on a machine with eight cores (Intel Core i7-7820X 3.60GHz) and 32GB memory. All datasets were normalised using the - normalisation to yield each attribute to be in [0,1] before the experiments began.
For DP, we normalised both and to be in [0,1] before selecting cluster modes so that these two parameters have the same weight in their product . We report the best clustering performance within a range of parameter search for each algorithm. Table 3 lists the parameters and their search ranges for each algorithm. Note that the parameter in OPTICS is used to identify downward and upward areas of the reachability plot in order to extract all clusters using a bottom-up hierarchical method ankerst1999optics . For all algorithms using -neighbourhood for density estimation, is set to be in of the maximum pairwise distance. Note that for DC-HDP, the parameter is only required to extract clusters from the dendrogram (at the end of Algorithm 1) by setting a corresponding threshold.
|Algorithm||Parameter with search range|
Table 4 shows the best F-measures of the 8 algorithms. In terms of the average F-measure (shown in the fourth bottom row), DC-HDP has the highest average F-measure of 0.82. The average F-measure of LC-DP, DP and OPTICS are 0.79, 0.76 and 0.76, respectively. HDBSCAN and DBSCAN have the average F-measures of 0.67 and 0.66, respectively. HML and PHA have the lowest average F-measures of 0.51 and 0.61, respectively.
Among the 8 algorithms, DC-HDP was the top performer on 8 out of 14 datasets, followed by LC-DP which was the top on 5 datasets. HML got very low F-measure on many datasets, e.g., Iris, Ecoli, Control and Segment datasets. This is because HML could not separate different clusters properly.
Notably on three synthetic datasets (3C, 2Q and 2O), only DC-HDP, OPTICS, HDBSCAN and DBSCAN identified all clusters perfectly, while other algorithms fail on at least one of these datasets. LC-DP could not obtain a perfect result on the 2O datasets because it failed to separate the 2 circles.
For the 3G dataset which has clusters with varied densities, DBSCAN and HML obtained the lowest F-measure of 0.67 only. HDBSCAN achieved F-measure of 0.92 only on this dataset since it assigned many high-density points to noise. While OPTICS, DP, LC-DP and DC-HDP have near-perfect clustering result, DC-HDP has the best result.
9.1 Relation to existing hierarchical clustering algorithms
Among existing hierarchical clustering methods, DC-HDP is closest to traditional agglomerative methods, where they all employ the bottom-up strategy to merge two individual clusters at each level successively in a tree structure.
However, DC-HDP and hierarchical DP differ from traditional agglomerative methods in two ways. First, when a traditional agglomerative method merges two clusters, they are selected based on a set-based dissimilarity measure. There are different measures that can be used to merge two clusters by trading off quality versus efficiency, such as single-linkage, complete-linkage, all-pairs linkage and centroid-linkage measures aggarwal2013data . In contrast, DC-HDP and hierarchical DP do not simply employ a dissimilarity measure to determine the two clusters to merge at each level. Instead, they first identify the cluster having the smallest (and ), and then select another cluster which has the shortest path length to it. While the path length may be considered as a kind of dissimilarity measure, that is a supporting measure, and the key determinant is .
Second, different from traditional methods, DC-HDP and hierarchical DP are a new agglomerative approach detecting -density-connected clusters and -linked clusters, respectively. Therefore, they can detect arbitrarily shaped clusters while existing agglomerative methods generally detect clusters with specific shapes, e.g, single-linkage measure tends to output elongated-shaped clusters, complete-linkage measure tends to detect compact-shaped clusters, all-pairs linkage and centroid-linkage measures tend to find globular clusters aggarwal2013data .
The standard algorithm for hierarchical agglomerative clustering has a time complexity of and a space complexity of Day1984 . However, many efficient hierarchical agglomerative clustering approaches have the same quadratic time complexity and space complexity as DC-HDP, e.g., SLINK Sibson1973 for single-linkage and CLINK SLINK1977 for complete-linkage clustering.
are two state-of-the-art hierarchical agglomerative clustering algorithms proposed recently. They employ similarity measures which are different from those in the traditional agglomerative methods mentioned above. HML uses a maximum likelihood method to identify two most similar clusters to be merged at each level. It was designed for biological and genomic data and can perform well for data that follows a Gaussian distribution. PHA measures the similarity between two clusters based on a hypothetical potential field that relies on both local and global data distribution information. It can detect slightly overlapping clusters with non-spherical shapes in noisy data. However, compared with density-based methods (e.g., DBSCAN, HDBSCAN, OPTICS and DC-HDP), both HML and PHA performed much worse in detecting arbitrarily shaped clusters, e.g., on the 2Q and 2O datasets.
There is another class of algorithms which employs a method to produce an initial set of subclusters from data, before applying a hierarchical clustering. For example, CHAMELEON 781637 produces a -nearest-neighbour graph from data and then breaks the graph into many small subgraphs (as subclusters). An agglomerative method is finally used to merge subclusters iteratively based on a similarity measure. The same general approach is used in two more recent methods, i.e., HDBSCAN Campello:2015:HDE and OPTICS ankerst1999optics ; though different methods are used to produce subclusters in the preprocessing before building a hierarchical structure on them.
We show that DC-HCP is a much simpler yet effective approach than this class of algorithms because DC-HDP applies agglomerative clustering directly on individual points in the given dataset without a preprocessing to create subclusters. Section 8 shows that DC-HDP produces a significantly better clustering result than the most recent representative of this class of algorithms, i.e., HDBSCAN, as well as OPTICS.
9.2 Parameter settings
DC-HDP requires two parameters and to build a dendrogram from a dataset, as shown in Table 3 where is more important as it is used in both density estimation and density connectivity check. In our experiments, we found that can be set to 2 (i.e., at least 2 points in the -neighbourhood) in most datasets in terms of getting the best clustering results.
In all empirical evaluations reported in Section 8, we used the same parameter for both the density estimation and density connectivity check for DC-HDP. However, we can split into two different parameters for the two processes individually. By doing so, we found that DC-HDP can perform even better than the results shown in Table 4 on some datasets.
9.3 Ability to detect noise
It is worth mentioning that density-based clustering has the ability to identify noise and then filter them out in clustering. For example, DBSCAN uses a global density threshold to identify noise as points with a density lower than the threshold in the first step of the algorithm ester1996density . DP rodriguez2014clustering employs a different method to identify the noise which are points with low densities at border regions of clusters (see footnote 4 for details). This is conducted at the end of the clustering process. This same method can be used by DC-HDP to identify noise. In addition, the cluster having few points can also be treated as noise, as in DBSCAN.
The lack of a cluster definition, that a state-of-the-art density-based algorithm called Density Peak (DP) can detect, has motivated the work in this paper.
We formally defined two new kinds of clusters: -linked clusters and -density-connected clusters. A further analysis revealed that DP is a clustering algorithm detecting -linked clusters; and it has weaknesses in data distributions which contain a special kind of -linked clusters or some non--linked clusters. We show that -density-connected clusters encompass all -linked clusters and the kind of non--linked clusters that DP fails to detect.
After showing for that DP clustering can be accomplished as a hierarchical clustering, we proposed a density-connected hierarchical DP clustering called DC-HDP, which is designed to detect -density-connected clusters.
By taking advantage of the individual strengths of DBSCAN and DP, DC-HDP produces clustering outputs which are superior in two key aspects. First, DC-HDP has an enhanced ability to identify clusters of arbitrary shapes and varied densities; where neither DBSCAN nor DP has. Second, the dendrogram generated by DC-HDP gives a richer information of the hierarchical structure of clusters in a dataset than a flat partitioning provided by DBSCAN and DP. DC-HDP achieved the enhanced ability with the same time complexity as DP, having an additional parameter only which can usually be set to a default value in practice.
Our empirical evaluation validates this superiority by showing that DC-HDP produces the best clustering results on 14 datasets in comparison with 7 state-of-the-art clustering algorithms, including density-based clustering, i.e., DBSCAN, DP and an existing improvement upon DP; and density-based hierarchical clustering, i.e., HDBSCAN and OPITCS.
Z.-H. Zhou, Three perspectives of data mining, Artificial Intelligence 143 (1) (2003) 139 – 146.
- (2) J. Han, M. Kamber, J. Pei, Data Mining: Concepts and Techniques, 3rd Edition, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2011.
L. Kaufman, P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons, 1990.
- (4) M. J. Zaki, J. Wagner Meira, Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press, 2014.
- (5) J. A. Hartigan, M. A. Wong, Algorithm AS 136: A k-means clustering algorithm, Journal of the Royal Statistical Society. Series C (Applied Statistics) 28 (1) (1979) 100–108.
- (6) L. Kaufman, P. Rousseeuw, Clustering by means of medoids, Statistical Data Analysis Based on the L1 Norm and Related Methods (1987) 405–416.
- (7) C. C. Aggarwal, C. K. Reddy, Data Clustering: Algorithms and Applications, Chapman and Hall/CRC Press, 2013.
- (8) M. Ester, H.-P. Kriegel, J. S, X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, in: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), AAAI Press, 1996, pp. 226–231.
A. Hinneburg, H.-H. Gabriel, DENCLUE 2.0: Fast clustering based on kernel density estimation, in: Proceedings of the 7th International Conference on Intelligent Data Analysis, IDA’07, Springer-Verlag, Berlin, Heidelberg, 2007, pp. 70–80.
- (10) Y. Zhu, K. M. Ting, M. J. Carman, Density-ratio based clustering for discovering clusters with varying densities, Pattern Recognition 60 (2016) 983 – 997.
- (11) A. Rodriguez, A. Laio, Clustering by fast search and find of density peaks, Science 344 (6191) (2014) 1492–1496.
- (12) B. Chen, K. M. Ting, T. Washio, Y. Zhu, Local contrast as an effective means to robust clustering against varying densities, Machine Learning (2018) 1–25.
- (13) A. Sharma, K. A. Boroevich, D. Shigemizu, Y. Kamatani, M. Kubo, T. Tsunoda, Hierarchical maximum likelihood clustering approach, IEEE Transactions on Biomedical Engineering 64 (1) (2017) 112–122.
- (14) Y. Lu, Y. Wan, PHA: A fast potential-based hierarchical agglomerative clustering method, Pattern Recognition 46 (5) (2013) 1227 – 1239.
R. J. G. B. Campello, D. Moulavi, A. Zimek, J. Sander, Hierarchical density estimates for data clustering, visualization, and outlier detection, ACM Transactions on Knowledge Discovery from Data 10 (1) (2015) 5:1–5:51.
- (16) M. Ankerst, M. M. Breunig, H.-P. Kriegel, J. Sander, OPTICS: Ordering points to identify the clustering structure, in: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, SIGMOD ’99, ACM, New York, NY, USA, 1999, pp. 49–60.
- (17) S. Gilpin, I. Davidson, A flexible ILP formulation for hierarchical clustering, Artificial Intelligence 244 (2017) 95 – 109.
- (18) J. Xie, H. Gao, W. Xie, X. Liu, P. W. Grant, Robust clustering by detecting density peaks and assigning points based on fuzzy weighted k-nearest neighbors, Information Sciences 354 (2016) 19 – 40.
- (19) J. Xu, G. Wang, W. Deng, DenPEHC: Density peak based efficient hierarchical clustering, Information Sciences 373 (2016) 200 – 218.
- (20) Q. Zhang, C. Zhu, L. T. Yang, Z. Chen, L. Zhao, P. Li, An incremental CFS algorithm for clustering large data in industrial internet of things, IEEE Transactions on Industrial Informatics 13 (3) (2017) 1193–1201.
- (21) A. Strehl, J. Ghosh, Cluster ensembles—a knowledge reuse framework for combining multiple partitions, Journal of Machine Learning Research 3 (Dec) (2002) 583–617.
M. Lichman, UCI Machine Learning
- (23) W. H. E. Day, H. Edelsbrunner, Efficient algorithms for agglomerative hierarchical clustering methods, Journal of Classification 1 (1) (1984) 7–24.
- (24) R. Sibson, SLINK: An optimally efficient algorithm for the single-link cluster method, The Computer Journal 16 (1) (1973) 30–34.
- (25) D. Defays, An efficient algorithm for a complete link method, The Computer Journal 20 (4) (1977) 364–366.
- (26) G. Karypis, E.-H. Han, V. Kumar, Chameleon: hierarchical clustering using dynamic modeling, Computer 32 (8) (1999) 68–75.