1 Introduction
We are living in the era of data explosion. Due to the increase in data size and types of data, traditional manual data analyses have become an impossible task without powerful tools ZHOU2003139 . In order to meet the demands, data mining, also referred to as knowledge discovery from data, has emerged to automate the process of discovering insightful, interesting, and novel patterns han2011data .
Clustering is an important and useful tool in data mining and knowledge discovery. It has been widely used for partitioning instances in a dataset such that similar instances are grouped together to form a cluster han2011data
. It is the most common unsupervised knowledge discovery technique for automatic datalabelling in various areas, e.g., information retrieval, image segmentation, and pattern recognition
kaufman2009finding . Depending on the basis of categorisation, clustering methods can be divided into several kinds, e.g., partitioning clustering versus hierarchical clustering; and densitybased clustering versus representativebased clustering zaki2014dataminingbook .Partitioning clustering methods are the simplest and most fundamental clustering methods han2011data . They are relatively fast, and easy to understand and implement. They organise data points in a given dataset into nonoverlapping partitions, where each partition represents a cluster; and each point belongs to one cluster only han2011data . However, traditional distancebased partitioning methods, such as means hartigan1979algorithm and medoids kaufman1987clustering , which are representativebased clustering, usually cannot find clusters with arbitrary shapes aggarwal2013data . In contrast, densitybased clustering algorithms can find clusters with arbitrary sizes and shapes while effectively separating noise. Thus, this kind of clustering is attracting more research and development.
DBSCAN ester1996density and DENCLUE hinneburg2007denclue are examples of an important class of densitybased clustering algorithms. They define clusters as regions of high densities which are separated by regions of low densities. However, these algorithms have difficulty finding clusters with widely varied densities because a global density threshold is used to identify highdensity regions ZHU2016983 .
Rodriguez et al. proposed a clustering algorithm based on density peaks (DP) rodriguez2014clustering . It identifies cluster modes^{1}^{1}1The original DP paper regards detected cluster modes as “cluster centres” rodriguez2014clustering . which have local maximum density and are well separated, and then assigns each remaining point in the dataset to a cluster mode via a linking scheme. Compared with the classic densitybased clustering algorithms (e.g., DBSCAN and DENCLUE), DP has a better capability in detecting clusters with varied densities. Despite this improved capability, Chen et al. Chen2018 has recently identified a condition under which DP fails to detect all clusters with varied densities. They proposed a new measure called Local Contrast (LC) (instead of density) to enhance DP such that the resultant algorithm LCDP is more robust against clusters with varied densities.
It is important to note that the progression from DBSCAN or DENCLUE to DP, and subsequently LCDP, with improved clustering performance, is achieved without formally defining the types of clusters DP and LCDP can detect.
In this paper, we are motivated to formally define the type of clusters that an algorithm is designed to detect before investigating the weaknesses of the algorithm. This approach enables us to determine two additional weaknesses of DP; and we show that the use of LC does not overcome these weaknesses.
This paper makes the following contributions:

Formalising a new type of clusters called linked clusters; and providing a necessary condition for a clustering algorithm to correctly detect all linked clusters in a dataset.

Uncovering that DP is a clustering algorithm which is designed detect linked clusters; and identifying two weaknesses of DP, i.e., the conditions under which DP cannot correctly detect all clusters in a dataset.

Introducing a different view of DP as a hierarchical procedure. Instead of producing flat clusters, this procedure generates a dendrogram, enabling a user to identify clusters in a hierarchical way.

Formalising the second new type of clusters called densityconnected clusters which encompass all linked clusters and the kind of nonlinked clusters that DP fails to detect.

Proposing a densityconnected hierarchical DP to overcome the identified weaknesses of DP. The new algorithm DCHDP merges two cluster modes only if they are densityconnected at a high level in the hierarchical structure.

Completing an empirical evaluation by comparing with 7 stateoftheart clustering algorithms in 14 datasets: 3 densitybased clustering algorithms (DBSCAN ester1996density , DP rodriguez2014clustering and LCDP Chen2018 ), 2 hierarchical clustering algorithms (HML 7440832 and PHA LU20131227 ) and 2 densitybased hierarchical clustering (HDBSCAN Campello:2015:HDE and OPTICS ankerst1999optics ).
The formal analysis of DP provides an insight into the key weaknesses of DP. This has enabled a simple and effective method (DCHDP) to overcome the weaknesses. The proposed method takes advantage of the individual strengths of DBSCAN and DP, i.e., DCHDP has an enhanced ability to identify all clusters of arbitrary shapes and varied densities; where neither DBSCAN nor DP has. In addition, the dendrogram generated by DCHDP gives a richer information of hierarchical components of clusters in a dataset than a flat partitioning provided by DBSCAN and DP. This is achieved with the same computational time complexity as in DP, having one additional parameter only which can usually be set to a default value in practice.
Since hierarchical clustering algorithms allow a user to choose a particular clustering granularity, hierarchical clustering is very popular and has been used far more than nonhierarchical clustering GILPIN201795 . Thus, DCHDP provides a new perspective which can be widely used in various applications.
The rest of the paper is organised as follows: we provide an overview of densitybased clustering algorithms and related work in Section 2. Section 3 formalises the linked clusters. Section 4 uncovers that DP is an algorithm which detects linked clusters; and reveals two weaknesses of DP. Section 5 reiterates the definition of densityconnected clusters used by DBSCAN, and states the known weakness of DBSCAN. Section 6 presents the definition of the second new type of clusters called densityconnected clusters. The new densityconnected hierarchical clustering algorithm is proposed in Section 7. In Section 8, we empirically evaluate the performance of the proposed algorithm by comparing it with 7 other stateoftheart clustering algorithms. Discussion and the conclusions are provided in the last two sections.
2 Related work
The two most representative algorithms of densitybased clustering are DBSCAN ester1996density and DENCLUE hinneburg2007denclue
. They first identify points in dense regions using a density estimator and then link neighbouring points in dense regions to form clusters. They can identify arbitrarily shaped clusters on noisy datasets. DBSCAN defines the density of a point as the number of points from the dataset that lie in its
neighbourhood. A “core” point is a point having density higher than a threshold . DBSCAN visits every core point and links all core points in its neighbourhood together, until all core points are visited. Then, points which are directly/indirectly linked are grouped into the same cluster. Finally, noncore points that are in the neighbourhood of other core points, called boundary points, are linked to the nearest clusters. If a point is neither core point nor boundary point, then it is considered to be “noise”. DENCLUE uses a Gaussian kernel estimator to estimate density for every point and applies a hillclimbing procedure to link neighbourhood points with high densities. Although DBSCAN and DENCLUE can detect clusters with varied sizes and shapes, they have difficulty finding clusters with widely varied densities because a global density threshold is used to identify points in highdensity areas ZHU2016983 .Many variants of DBSCAN have been attempted to overcome the weakness of detecting clusters with varied densities. OPTICS ankerst1999optics draws a “reachability” plot based on the nearest neighbour distance. In the axis of the plot, adjacent points follow close to each other such that point is the closest to in terms of the “reachability distance” ^{2}^{2}2The “reachabilitydistance” of object to object is the greater one between the “core distance” of and the distance between and . The “core distance” of is the minimum that makes a “core” object (the distance to its nearest neighbour, ).. The reachability distance for each point is shown in axis. Since clusters centre normally has a higher density or lower reachability distance than the cluster boundaries, each cluster is visible as a “valley” in this plot. Then a hierarchical method can be used to extract different clusters. The overall clustering performance depends on the hierarchical method employed on the reachability plot.
HDBSCAN Campello:2015:HDE is a hierarchical clustering based on DBSCAN. The idea is to produce many DBSCAN clustering outcomes through increasing density thresholds.^{3}^{3}3It first builds a Minimum Spanning Tree (MST) for all points, where the weight of each edge is the mutual reachability distance and the weight for each vertex is the core distance. Then it removes edges from the MST progressively in decreasing order of weights, which is equivalent to getting many DBSCAN clustering outcomes with increasing density thresholds. As the density threshold increases, a cluster may shrink or be divided into smaller clusters or disappear. A dendrogram is built based on these clustering outcomes via a topdown method to yield the hierarchical cluster structure, e.g., the root is one cluster with all points and then split into different subclusters in following levels. To produce a flat partitioning, it extracts a set of ‘significant’ clusters at different levels of the dendrogram via an optimisation process. HDBSCAN can detect clusters with varied densities because it employs different density thresholds. However, it has a bias towards lowdensity clusters. To separate overlapping highdensity clusters, HDBSCAN needs to use a highdensity threshold so that points in boundary region are treated as noise. As a result, highdensity clusters can lose many cluster members when a high density level is used.
Recently, density peak (DP) clustering algorithm was proposed without using any density threshold rodriguez2014clustering . It assumes that cluster modes in the given dataset are sufficiently separated from each other. The clustering process has three steps as follows. First, DP calculates the density of each point using an neighbourhood density estimator, and the distance between a point and its nearest neighbour with a higher density value. Second, DP plots a decision graph for all points where the axis is , sorted in descending order in the axis. The points with the highest (i.e., high density values and relatively far nearest neighbour with a higher density) are selected as the cluster modes. Third, each remaining point is connected to its nearest neighbour of higher density, and the points connected or transitively connected to the same cluster mode are grouped into the same cluster. Noise can be detected by applying an additional step.^{4}^{4}4Once every point is assigned to a cluster, a border region is identified for each cluster—it is the set of points which is assigned to that cluster but located within the neighbourhood of a point belonging to another cluster. Let be the highest density of all points within the border region. Noise points are all points of the cluster which have density values less than . Since DP does not rely on any density threshold to extract clusters, it avoids the weakness of DBSCAN in detecting clusters with varied densities.
Figure 1 illustrates clustering results of DBSCAN, HDBSCAN and DP on a dataset having cluster with varied densities. Only DP can nearperfectly detect all clusters in this dataset. DBSCAN performed poorly because it produced either a merged cluster of the two highdensity clusters using a lowdensity threshold or only a few points are assigned to the lowdensity cluster when using a high density threshold. HDBSCAN can separate all three clusters but many boundary points of the highdensity clusters are regarded as noise. This is because HDBSCAN has to use a highdensity threshold to separate the two dense clusters and then the lowdensity points from the dense clusters become noise.
DP is one of the most promising clustering algorithms which has the ability to detect clusters with varied densities and arbitrary shapes. Without formally articulating the weaknesses of DP, some papers have proposed to improve DP. For example, a fuzzy weighted nearest neighbour is used to more efficiently search for density mode XIE201619 ; a linear fitting method is proposed to determine more meaningful cluster modes on the decision graph XU2016200 ; and an incremental method is designed to enable DP for clustering large dynamic data in industrial Internet of Things 7882646 . Only Chen et al. Chen2018 have formally identified one necessary condition under which DP fails to detect all clusters of varied densities in a dataset, and proposed a new measure called Local Contrast (LC) to make DP more robust against clusters of varied densities.
However, to the best of our knowledge, there is no paper formalising the type of clusters that DP is designed to detect in a dataset. This knowledge is important to understand the weaknesses of DP and devise a way to overcome the weaknesses.
3 Defining clusters based on higherdensitynearestneighbours: linked clusters
In this section, we formally a type of clusters called linked clusters, where stands for nearest neighbour with higher density. The main symbols and notations used are listed in Table 1.
a point in  
a dimensional dataset with points  
a point with the highest density in  
, , ,  a cluster (a group of points) 
}  
the mode (point of the highest density) in a cluster  
density of  
estimated density of via an neighbourhood estimator  
the distance between two points and  
neighbourhood of  
radius of neighbourhood  
density threshold  
volume of a dimensional ball of radius  
’s nearest neighbour which has a higher density than  
’s nearest densityconnected neighbour which has a higher density than  
an linked path connecting and  
an densityconnected path connecting and  
length of  
length of  
the distance between and  
the distance between and  
Let , denote a dataset of points, each sampled independently and identically from a distribution . Let be the neighbourhood of , , where is the distance function (); and is a userdefined constant.
In general, the true density of , i.e., , can be estimated via an neighbourhood estimator (as used by densitybased clustering algorithms DBSCAN ester1996density and DP rodriguez2014clustering ):
(1) 
where is the volume of a dimensional sphere of radius . Note that since is a constant for every point, it can be omitted in practice. As a result, many densitybased clustering algorithms use the number of points in the neighbourhood as a proxy to the estimated density.
Let denote the point with the global maximum density; and . For each point , let be ’s nearest neighbour which has a higher density than , i.e.,
(2) 
Assuming we have cluster modes (which will be defined in Definition 3). Each of points in is then assigned to the same cluster of its nearest neighbour of higher density; and the points, which are pathlinked or transitively pathlinked to the same closest cluster mode in terms of their path length, are grouped into the same cluster. Such a path is defined as follows:
Definition 1.
An linked path connecting points and : is defined as a sequence of the smallest number of unique points starting with and ending with , where .
Definition 2.
The length of path(x,y) is defined as
(3) 
where is the number of points along the path from to .
Note that if ; and , if .
Definition 3.
An linked cluster , which has only one mode , is a maximal set of points having the shortest path length to its cluster mode wrt other cluster modes, i.e., .
As a result of the above definitions, three lemmas are stated as follows:
Lemma 0.
Every point in has a path to , i.e., .
Lemma 0.
Given a dataset with nonoverlapping clusters, every cluster is an linked cluster if the density distribution of every cluster is a monotonically decreasing function having its only mode at , i.e., .
Figure 1(a) illustrates an example of two clusters having monotonically decreasing functions. After selecting Peak 1 and Peak 2 as cluster modes and assigning the rest of the points based on Definition 3, the red dashline indicates the boundary of the two clusters.
Lemma 0.
Given a dataset with nonoverlapping clusters, if the density distribution of at least one cluster is a monotonically nonincreasing function having its only mode at , i.e., , then may not be an linked cluster.
This occurs when not all has the shortest . An example is shown in Figure 1(b), where some points to the left of the boundary are assigned to the left cluster, though they should belong to the right cluster. A dataset, which has clusters with uniform density distribution only, does not contain linked clusters for the same reason.
In practice, a dataset having clusters with slow rates of density decrease from the modes, even if every cluster is a monotonically decreasing function, can raise the same issue because a density estimator may identify multiple local peaks for such a cluster due to estimation errors.
In addition, a dataset having clusters with multiple local peaks may not contain linked clusters when the local peaks do not have the shortest path to the correct modes. Figures 1(c) and 1(d) illustrate two example density distributions with two clusters having a local peak (Peak 3) in addition to the cluster mode (Peak 2). Peak 3 in Figure 1(c) has the shortest path to Peak 2; and it is assigned to the same cluster with Peak 2. However, Peak 3 in Figure 1(d) has the shortest path to Peak 1. This is because its nearest point with higher density is the point which has the shortest path to Peak 1. Note that this occurs even if Peak 3 is just a tiny bump in the data distribution.
In a nutshell, linked clusters are sensitive to (i) the distance separation between clusters; and (ii) the data density distribution.
Based on the first two lemmas and Definition 3, we formulate a theorem for an algorithm which identifies linked clusters as follows:
Theorem 4.
Given a dataset with linked clusters, a necessary condition for a clustering algorithm implementing Definition 3 to correctly identify all clusters is that it must correctly select the mode in each and every cluster.
Proof.
A violation of the condition means at least one of the two following situations will occur:

If a point, other than the true mode, is selected by the algorithm as a cluster mode, then the cluster will be split into at least two clusters. This is because all cluster members with higher density than the selected cluster mode do not have the shortest path to this mode and will be assigned to other cluster(s).

If no point in a cluster is selected as a cluster mode, then each member of this cluster will be linked to a different cluster with a mode having the shortest path from this member.
Therefore, in either situation, not all clusters are identified correctly by this clustering algorithm. ∎
4 DP — An algorithm which identifies linked clusters
Density Peak or DP rodriguez2014clustering has two main procedures as follows:

Identifying cluster modes via a ranking scheme which aims to rank all points. The top points are selected as the modes of clusters.

Linking each nonmode data point to its nearest neighbour with higher density. The points directly linked or transitively linked to the same cluster mode are assigned to the same cluster. This produces clusters at the end of the process.
Therefore, DP is an algorithm implementing Definition 3 to identify linked clusters in a dataset (in step 2 above).
The first step is critical, as specified in Theorem 4. To effectively identify cluster modes, DP assumes that different cluster modes should have a relatively large distance between them in order to detect wellseparated clusters rodriguez2014clustering . To prevent a cluster from breaking into multiple clusters when it has a slow rate of density decrease from the cluster mode, it applies a ranking scheme on all points, and then selects the top points as cluster modes. This is done as follows.
DP defines a distance function as follows:
(4) 
DP selects the top points with the highest as cluster modes. This means that each cluster mode should have high density and be far away from other cluster modes.
4.1 Weaknesses of DP
While DP generally produces a better clustering result than DBSCAN (see the evaluation conducted by Chen et al Chen2018 reported in Appendix C of the paper), we identify two fundamental weaknesses of DP:

Given a dataset of linked clusters, if the data distribution is such that the cluster modes are not ranked as the top points with the highest values, then DP cannot correctly identify these cluster modes, as stated in Theorem 4. The source of this weakness is the ranking scheme in step 1 of the DP procedure.
An example is a dataset having two Gaussian clusters and an elongated cluster with two local peaks (the left one is the cluster mode), as shown in Figure 3. DP with splits the elongated cluster into two subclusters because the two local peaks are ranked among the top 3 points with the highest values; and it merges the bottom two clusters into one, as shown in Figure 2(a). A better clustering result can be obtained by using which resulted in a correct identification of the two bottom clusters, as shown in Figure 2(b). But it still splits the single top cluster into two. Note that all three clusters would be correctly identified by DP if the three true cluster modes were preselected for DP using a different process. This data distribution is similar to that shown in Figure 1(c) which has valid linked clusters.
Figure 3: Clustering result of DP on 3C dataset (a dataset with three clusters). The colours used in each scatter plot indicate clusters labelled by DP. Figure 4: Clustering results of DP on two datasets (2O and 2Q), when and is the same as the true number of clusters. The colours used in each scatter plot indicate clusters labelled by DP. 
Given a dataset of nonlinked clusters, DP cannot correctly identify these clusters because DP can detect linked clusters only. The source of this weakness is the limitation of the cluster definition used.
Figure 4 demonstrates two examples when DP fails to correctly detect the clusters. These clusters are nonlinked clusters, i.e., some points of a cluster do not have an linked path to its cluster mode, but have an linked path to the mode of another cluster; even though the two clusters are well separated. Specifically, the 2O dataset in Figure 3(a)
has clusters with uniform distribution; and the data distribution of the 2Q dataset in Figure
3(b) has small bumps on the side of the large cluster near the smaller cluster—a similar data characteristic shown in Figure 1(d)—as specified in Lemma 3.
4.2 An existing improvement of DP
Chen et al. Chen2018 provides a method called Local Contrast (LCDP), which aims to improve the ranking of cluster modes for detecting all clusters in a dataset with clusters of varied densities—the condition under which DP fails to correctly all clusters, i.e., the condition they have discovered.
LCDP uses local contrast , instead of density , in Equation 2 to determine . is defined to be the number of times that has a higher density than its nearest neighbours, which has values between 0 and . Then the ranking is based on , where is the version of based on which is determined by . Chen et al. Chen2018 show that the use of LC makes DP to be more robust to clusters with varied densities.
Here we show that LCDP enhances DP’s clustering performance on some clusters with varied densities, e.g., the 2Q and 3C datasets. However, LC doesn’t overcome the two weaknesses of DP mentioned above. For example, LCDP still fails to identify all clusters on the 2O dataset which does not have linked clusters. Therefore, it is important to design a method to overcome DP’s two weaknesses. Here we propose a hierarchical method based on densityconnectivity with this aim in mind.
We first reiterate the currently known definitions of density connectivity and densityconnected clusters in Section 5. Then, we define a new type of clusters based on density connectivity in the following section.
5 Densityconnected clusters
The classic densitybased clustering, such as DBSCAN ester1996density , defines a cluster based on density connectivity as follows:
Definition 4.
Using an neighbourhood density estimator with density threshold , a point is density connected with another point in a sequence of unique points from , i.e., : is defined as:
(5) 
Definition 5.
A densityconnected cluster , which has a mode , is a maximal set of points that are density connected with its mode, i.e., .
The following lemma is based on the densityconnectivity:
Lemma 0.
Points in a densityconnected cluster are density connected to each other via the mode , i.e., .
Note that a set of points having multiple modes (of the same peak density) must be density connected together in order to form a densityconnected cluster.
The key characteristic of a densityconnected cluster is that the cluster can have an arbitrary shape and size ester1996density . Although an linked cluster can have arbitrary shape and size as well, DP which detects linked clusters has issues with the two types of data distributions, mentioned in Section 4. The linked path may link points from different clusters which are separated by low density regions, e.g., the two circleclusters in Figure 3(a).
On the other hand, though a clustering algorithm such as DBSCAN which detects densityconnected clusters does not have the above issues, DBSCAN has issues in identifying all clusters of varying densities ZHU2016983 , as shown in Figure 0(b) in Section 2.
In a nutshell, both the linked clusters and the densityconnected clusters have different limitations.
6 densityconnected clusters
To overcome the limitations of (i) linked clusters stated in Sections 3 and 4; and (ii) densityconnected clusters stated in Section 5, we strengthen the linked path based on the density connectivity as follows:
Definition 6.
An densityconnected path linking points and , , is defined as a sequence of the smallest number of unique points starting with and ending with such that , where is ’s nearest densityconnected neighbour which has a higher density than , i.e.,
(6) 
Definition 7.
The length of is defined as
(7) 
Note that if and , if .
Definition 8.
An densityconnected cluster , which has only one mode , is a maximal set of densityconnected points having the shortest densityconnected path to its cluster mode wrt other cluster modes in terms of the path length, i.e., .
Based on these definitions, we have two lemmas as follows:
Lemma 0.
An linked cluster becomes an densityconnected cluster if all points in the linked cluster are densityconnected.
In other words, if a dataset has linked clusters, it also has densityconnected clusters when all points are densityconnected with proper and settings.
Lemma 0.
If a dataset has only densityconnected clusters and each cluster has only one mode, then all clusters are densityconnected clusters.
This is because when a dataset has only densityconnected clusters, all points in each cluster are densityconnected; while any two points from different clusters are not densityconnected. Then every point in each cluster has the shortest densityconnected path to its cluster mode wrt other cluster modes. Therefore, all clusters are densityconnected clusters.
With proper neighbourhood threshold and density threshold , wellseparated clusters cannot be linked together as an densityconnected cluster. This enables us to identify clusters which are densityconnected but are not linked in a dataset. Figure 5 illustrates the cluster boundaries of two clusters after selecting Peak 1 and Peak 2 as cluster modes and assigning the rest of the points based on Definition 8. It can be seen that all these clusters can be identified as densityconnected clusters now, although the clusters in Figure 4(c) and Figure 4(d) are not linked clusters.
7 An densityconnected hierarchical clustering algorithm
Here we propose an densityconnected hierarchical clustering algorithm. It is described in two subsections. In Section 7.1, we introduce a different view of the (flat) DP as a hierarchical procedure. Instead of employing a decision graph to rank points proposed in the original DP paper rodriguez2014clustering , the proposed hierarchical procedure merges clusters bottomup to produce a dendrogram. The dendrogram enables a user to identify clusters in a hierarchical way, which cannot be produced by the current flat DP procedure.
In Section 7.2, we describe how the hierarchical DP procedure is modified to identify densityconnected clusters based on Definition 8.
7.1 A different view of DP: a hierarchical procedure
We show that the DP clustering rodriguez2014clustering can be accomplished as a hierarchical clustering; and the two clustering procedures produce exactly the same flat clustering result when the same is used. If we run DP times by setting , we get a bottomup based clustering result. To avoid running DP times, which has the time complexity of , we propose a hierarchical procedure as follows.
The initialisation step in the hierarchical DP is as follows. After calculating for all points, let every point be a cluster mode (which is equivalent to running DP with ); and each cluster mode is tagged with its . Let be which is the set used for merging in the next step.
The first merging of two clusters (which is equivalent to running DP with ) is conducted as follows. Select the cluster having the mode point with the smallest value in ; and the cluster is merged with the cluster having . is then removed from . The above merging process is repeated iteratively by merging two clusters at each iteration until .
Figure 6 illustrates the clustering results produced from the hierarchical DP as dendrograms on the three datasets used in Figures 3 and 4. Figure 5(a) shows that the elongated cluster is split at the top level in the dendrogram. The dendrogram in Figure 5(c) shows that points from the two circles are (incorrectly) merged at low levels in the hierarchical structure. Figure 5(e) illustrates that many points from the sparseandlarge cluster are linked to the denseandsmall cluster when . The flat clustering results extracted from the dendrograms produced by the hierarchical DP are the same as those produced by the flat DP shown in Figures 3 and 4.
Because the cluster modes are selected according to the ranking of , the clustering result with clusters can be obtained by setting an appropriate threshold of on the bottomup based hierarchical clustering result such that the number of clusters below the threshold is . Since both the hierarchical DP and the flat (original) DP produce the same flat clustering result, the name DP is used hereafter to denote both the two versions, as far as the flat clustering result is concerned.
Advantages of the hierarchical DP: There are two advantages of the hierarchical DP over the flat DP. First, the former avoids the need to select cluster modes in the first step of the clustering process. Instead, after the dendrogram is produced at the end of the hierarchical clustering process, is required only if a flat clustering is to be extracted from the dendrogram. Second, the dendrogram produced by the hierarchical DP provides a richer information of the hierarchical structure of clusters in a dataset than a flat partitioning provided by the flat DP.
The hierarchical DP has the same time complexity of the flat DP, i.e., , since and are calculated for all points only once.
7.2 A densityconnected hierarchical DP
In order to enhance DP to detect clusters from a larger set of data distributions than that covered by densityconnected clusters or linked clusters, the clusters based on Definition 8 is used.
Using the hierarchical DP, it turns out that only a simple rule needs to be incorporated, i.e., to check whether two cluster modes at the current level are densityconnected before merging them at the next level in the hierarchical structure: two clusters and can only be merged if there is an densityconnected path between their cluster modes. This is checked at each level of the hierarchy, where the procedure selects the cluster having the mode with smallest to merge with another cluster having , where
(8) 
We call the new algorithm, DCHDP, as it employs this cluster merging rule based on the density connectivity with the hierarchical DP procedure. The DCHDP algorithm is shown in Algorithm 1.
Note that, other than , there may exist points which do not have . Let . At the end of line 12, only those points which are not for all become isolated points.
Also note that it is possible to have more than one cluster at the end of line 12, if the clusters are not densityconnected to each other. Therefore, these clusters are merged with all remaining isolated points to yield only one cluster at the top of the hierarchical structure (see line 13 in Algorithm 1).
Once the dendrogram is obtained from Algorithm 1, a global threshold can be used to select the clusters as a flat clustering result.
Note that the algorithm for the hierarchical DP is the same as the DCHDP algorithm, except the former uses (i) instead of ; and (ii) (the whole dataset without the global peak point) instead of .
Compared with DP, DCHDP has one more parameter , used for the densityconnectivity check; and the same are used for both density estimation and density connectivity check. DCHDP maintains the same time complexity of DP, i.e., .
DCHDP has the ability to enhance the clustering performance of DP on a dataset having densityconnected clusters which encompass the two kinds of clusters DP is weak at, mentioned in Section 3. This is because DCHDP does not establish any DCpath between points from different clusters which are not densityconnected. Since clusters which are not densityconnected are only merged at the top of the dendrogram with the highest value, a global threshold can separate all these clusters. Unlike DBSCAN, DCHDP does not rely on a global density threshold to link points; thus, DCHDP has the ability to detect clusters with varied densities.
In a nutshell, DCHDP takes advantage of the individual strengths of DBSCAN and DP, i.e., it has the enhanced ability to identify all clusters of arbitrary shapes and varied densities; where neither DBSCAN nor DP has.
Figure 7 illustrates a clustering result from DCHDP as a dendrogram on each of the three datasets used in Figure 3 and Figure 4. It shows that all clusters can be detected perfectly by DCHDP when an appropriate threshold (blue horizontal line) is used on the dendrogram.
Furthermore, DCHDP has an additional advantage in comparison with DP and DBSCAN, i.e., the dendrogram produced has a rich structure of clusters at different levels. This is the advantage of a hierarchical clustering over a flat clustering.
8 Empirical evaluation
This section presents experiments designed to evaluate the effectiveness of DCHDP. We compare DCHDP with 3 stateoftheart densitybased clustering algorithms (DBSCAN ester1996density , DP rodriguez2014clustering and LCDP Chen2018 ), 2 most recent stateoftheart hierarchical clustering algorithms (PHA LU20131227 and HML 7440832 ), and 2 densitybased hierarchical clustering (HDBSCAN Campello:2015:HDE and OPTICS ankerst1999optics ).
The clustering performance is measured in terms of Fmeasure^{5}^{5}5It is worth noting that other evaluation measures such as purity and Normalized Mutual Information (NMI) strehl2002cluster only take into account the points assigned to clusters and do not account for noise. A clustering algorithm which assigns the majority of the points to noise may result in a high clustering performance. Thus the Fmeasure is more suitable than purity or NMI in assessing the clustering performance of densitybased clustering, e.g, DBSCAN and OPTICS.: given a clustering result, we calculate the precision score and the recall score for each cluster
based on the confusion matrix, and then the Fmeasure score of
is the harmonic mean of
and . The overall Fmeasure score is the unweighted average over all clusters: Fmeasure.We used 4 artificial datasets (2O, 3C, 3G and 2Q) and 10 realworld datasets with different data sizes and dimensions from UCI Machine Learning Repository
Lichman:2013 . Table 2 presents the data properties of the datasets.Dataset  Data Size  #Dimensions  #Classes 

3C  900  2  3 
2Q  1100  2  2 
3G  1500  2  3 
2O  1500  2  2 
Iris  150  4  3 
GPS  163  6  2 
Sonar  208  60  2 
Haberman  306  3  2 
Ecoli  336  7  8 
Liver  345  6  2 
Wilt  500  5  2 
Control  600  60  6 
Breast  699  9  2 
Segment  2310  19  7 
All algorithms used in our experiments were implemented in Matlab.^{6}^{6}6The source codes of all algorithms used in our experiments can be obtained at: DBSCAN, DCHDP, LCDP and DP: https://sourceforge.net/projects/hierarchicaldp/ HDBSCAN: https://goo.gl/i86JPJ OPTICS: https://github.com/alexgkendall/OPTICS_Clustering HML: https://goo.gl/BjtWAA PHA: https://goo.gl/fGD5p6 The experiments were run on a machine with eight cores (Intel Core i77820X 3.60GHz) and 32GB memory. All datasets were normalised using the  normalisation to yield each attribute to be in [0,1] before the experiments began.
For DP, we normalised both and to be in [0,1] before selecting cluster modes so that these two parameters have the same weight in their product . We report the best clustering performance within a range of parameter search for each algorithm. Table 3 lists the parameters and their search ranges for each algorithm. Note that the parameter in OPTICS is used to identify downward and upward areas of the reachability plot in order to extract all clusters using a bottomup hierarchical method ankerst1999optics . For all algorithms using neighbourhood for density estimation, is set to be in of the maximum pairwise distance. Note that for DCHDP, the parameter is only required to extract clusters from the dendrogram (at the end of Algorithm 1) by setting a corresponding threshold.
Algorithm  Parameter with search range 

HML  
PHA  
DBSCAN  ; 
HDBSCAN  ; 
OPTICS  ; 
DP  ; 
LCDP  ; ; 
DCHDP  ; ; 
Table 4 shows the best Fmeasures of the 8 algorithms. In terms of the average Fmeasure (shown in the fourth bottom row), DCHDP has the highest average Fmeasure of 0.82. The average Fmeasure of LCDP, DP and OPTICS are 0.79, 0.76 and 0.76, respectively. HDBSCAN and DBSCAN have the average Fmeasures of 0.67 and 0.66, respectively. HML and PHA have the lowest average Fmeasures of 0.51 and 0.61, respectively.
Data  HML  PHA  DBSCAN  HDBSCAN  OPTICS  DP  LCDP  DCHDP 

3C  0.77  1.00  1.00  1.00  1.00  0.92  1.00  1.00 
2Q  0.87  0.84  1.00  1.00  1.00  0.97  1.00  1.00 
3G  0.67  0.92  0.67  0.92  0.98  0.99  0.99  0.997 
2O  0.48  0.51  1.00  1.00  1.00  0.68  0.84  1.00 
Iris  0.22  0.88  0.85  0.83  0.85  0.97  0.97  0.96 
GPS  0.75  0.75  0.75  0.68  0.76  0.81  0.81  0.83 
Sonar  0.59  0.38  0.38  0.38  0.67  0.67  0.66  0.69 
Haberman  0.61  0.52  0.47  0.42  0.63  0.56  0.57  0.628 
Ecoli  0.23  0.53  0.37  0.46  0.44  0.50  0.58  0.67 
Liver  0.48  0.37  0.37  0.37  0.66  0.56  0.53  0.65 
Wilt  0.39  0.39  0.38  0.39  0.53  0.54  0.55  0.58 
Control  0.08  0.50  0.53  0.60  0.64  0.72  0.77  0.74 
Breast  0.90  0.43  0.82  0.72  0.84  0.97  0.96  0.97 
Segment  0.04  0.58  0.59  0.65  0.69  0.78  0.77  0.74 
Average  0.51  0.61  0.66  0.67  0.76  0.76  0.79  0.82 
# Best  0  1  3  3  5  2  4  8 
Average rank  6.54  6.07  5.86  5.68  3.57  3.46  2.86  1.96 
Among the 8 algorithms, DCHDP was the top performer on 8 out of 14 datasets, followed by LCDP which was the top on 5 datasets. HML got very low Fmeasure on many datasets, e.g., Iris, Ecoli, Control and Segment datasets. This is because HML could not separate different clusters properly.
Notably on three synthetic datasets (3C, 2Q and 2O), only DCHDP, OPTICS, HDBSCAN and DBSCAN identified all clusters perfectly, while other algorithms fail on at least one of these datasets. LCDP could not obtain a perfect result on the 2O datasets because it failed to separate the 2 circles.
For the 3G dataset which has clusters with varied densities, DBSCAN and HML obtained the lowest Fmeasure of 0.67 only. HDBSCAN achieved Fmeasure of 0.92 only on this dataset since it assigned many highdensity points to noise. While OPTICS, DP, LCDP and DCHDP have nearperfect clustering result, DCHDP has the best result.
9 Discussion
9.1 Relation to existing hierarchical clustering algorithms
Among existing hierarchical clustering methods, DCHDP is closest to traditional agglomerative methods, where they all employ the bottomup strategy to merge two individual clusters at each level successively in a tree structure.
However, DCHDP and hierarchical DP differ from traditional agglomerative methods in two ways. First, when a traditional agglomerative method merges two clusters, they are selected based on a setbased dissimilarity measure. There are different measures that can be used to merge two clusters by trading off quality versus efficiency, such as singlelinkage, completelinkage, allpairs linkage and centroidlinkage measures aggarwal2013data . In contrast, DCHDP and hierarchical DP do not simply employ a dissimilarity measure to determine the two clusters to merge at each level. Instead, they first identify the cluster having the smallest (and ), and then select another cluster which has the shortest path length to it. While the path length may be considered as a kind of dissimilarity measure, that is a supporting measure, and the key determinant is .
Second, different from traditional methods, DCHDP and hierarchical DP are a new agglomerative approach detecting densityconnected clusters and linked clusters, respectively. Therefore, they can detect arbitrarily shaped clusters while existing agglomerative methods generally detect clusters with specific shapes, e.g, singlelinkage measure tends to output elongatedshaped clusters, completelinkage measure tends to detect compactshaped clusters, allpairs linkage and centroidlinkage measures tend to find globular clusters aggarwal2013data .
The standard algorithm for hierarchical agglomerative clustering has a time complexity of and a space complexity of Day1984 . However, many efficient hierarchical agglomerative clustering approaches have the same quadratic time complexity and space complexity as DCHDP, e.g., SLINK Sibson1973 for singlelinkage and CLINK SLINK1977 for completelinkage clustering.
HML 7440832 and PHA LU20131227
are two stateoftheart hierarchical agglomerative clustering algorithms proposed recently. They employ similarity measures which are different from those in the traditional agglomerative methods mentioned above. HML uses a maximum likelihood method to identify two most similar clusters to be merged at each level. It was designed for biological and genomic data and can perform well for data that follows a Gaussian distribution. PHA measures the similarity between two clusters based on a hypothetical potential field that relies on both local and global data distribution information. It can detect slightly overlapping clusters with nonspherical shapes in noisy data. However, compared with densitybased methods (e.g., DBSCAN, HDBSCAN, OPTICS and DCHDP), both HML and PHA performed much worse in detecting arbitrarily shaped clusters, e.g., on the 2Q and 2O datasets.
There is another class of algorithms which employs a method to produce an initial set of subclusters from data, before applying a hierarchical clustering. For example, CHAMELEON 781637 produces a nearestneighbour graph from data and then breaks the graph into many small subgraphs (as subclusters). An agglomerative method is finally used to merge subclusters iteratively based on a similarity measure. The same general approach is used in two more recent methods, i.e., HDBSCAN Campello:2015:HDE and OPTICS ankerst1999optics ; though different methods are used to produce subclusters in the preprocessing before building a hierarchical structure on them.
We show that DCHCP is a much simpler yet effective approach than this class of algorithms because DCHDP applies agglomerative clustering directly on individual points in the given dataset without a preprocessing to create subclusters. Section 8 shows that DCHDP produces a significantly better clustering result than the most recent representative of this class of algorithms, i.e., HDBSCAN, as well as OPTICS.
9.2 Parameter settings
DCHDP requires two parameters and to build a dendrogram from a dataset, as shown in Table 3 where is more important as it is used in both density estimation and density connectivity check. In our experiments, we found that can be set to 2 (i.e., at least 2 points in the neighbourhood) in most datasets in terms of getting the best clustering results.
In all empirical evaluations reported in Section 8, we used the same parameter for both the density estimation and density connectivity check for DCHDP. However, we can split into two different parameters for the two processes individually. By doing so, we found that DCHDP can perform even better than the results shown in Table 4 on some datasets.
9.3 Ability to detect noise
It is worth mentioning that densitybased clustering has the ability to identify noise and then filter them out in clustering. For example, DBSCAN uses a global density threshold to identify noise as points with a density lower than the threshold in the first step of the algorithm ester1996density . DP rodriguez2014clustering employs a different method to identify the noise which are points with low densities at border regions of clusters (see footnote 4 for details). This is conducted at the end of the clustering process. This same method can be used by DCHDP to identify noise. In addition, the cluster having few points can also be treated as noise, as in DBSCAN.
10 Conclusions
The lack of a cluster definition, that a stateoftheart densitybased algorithm called Density Peak (DP) can detect, has motivated the work in this paper.
We formally defined two new kinds of clusters: linked clusters and densityconnected clusters. A further analysis revealed that DP is a clustering algorithm detecting linked clusters; and it has weaknesses in data distributions which contain a special kind of linked clusters or some nonlinked clusters. We show that densityconnected clusters encompass all linked clusters and the kind of nonlinked clusters that DP fails to detect.
After showing for that DP clustering can be accomplished as a hierarchical clustering, we proposed a densityconnected hierarchical DP clustering called DCHDP, which is designed to detect densityconnected clusters.
By taking advantage of the individual strengths of DBSCAN and DP, DCHDP produces clustering outputs which are superior in two key aspects. First, DCHDP has an enhanced ability to identify clusters of arbitrary shapes and varied densities; where neither DBSCAN nor DP has. Second, the dendrogram generated by DCHDP gives a richer information of the hierarchical structure of clusters in a dataset than a flat partitioning provided by DBSCAN and DP. DCHDP achieved the enhanced ability with the same time complexity as DP, having an additional parameter only which can usually be set to a default value in practice.
Our empirical evaluation validates this superiority by showing that DCHDP produces the best clustering results on 14 datasets in comparison with 7 stateoftheart clustering algorithms, including densitybased clustering, i.e., DBSCAN, DP and an existing improvement upon DP; and densitybased hierarchical clustering, i.e., HDBSCAN and OPITCS.
References
References

(1)
Z.H. Zhou, Three perspectives of data mining, Artificial Intelligence 143 (1) (2003) 139 – 146.
 (2) J. Han, M. Kamber, J. Pei, Data Mining: Concepts and Techniques, 3rd Edition, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2011.

(3)
L. Kaufman, P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons, 1990.
 (4) M. J. Zaki, J. Wagner Meira, Data Mining and Analysis: Fundamental Concepts and Algorithms, Cambridge University Press, 2014.
 (5) J. A. Hartigan, M. A. Wong, Algorithm AS 136: A kmeans clustering algorithm, Journal of the Royal Statistical Society. Series C (Applied Statistics) 28 (1) (1979) 100–108.
 (6) L. Kaufman, P. Rousseeuw, Clustering by means of medoids, Statistical Data Analysis Based on the L1 Norm and Related Methods (1987) 405–416.
 (7) C. C. Aggarwal, C. K. Reddy, Data Clustering: Algorithms and Applications, Chapman and Hall/CRC Press, 2013.
 (8) M. Ester, H.P. Kriegel, J. S, X. Xu, A densitybased algorithm for discovering clusters in large spatial databases with noise, in: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD96), AAAI Press, 1996, pp. 226–231.

(9)
A. Hinneburg, H.H. Gabriel, DENCLUE 2.0: Fast clustering based on kernel density estimation, in: Proceedings of the 7th International Conference on Intelligent Data Analysis, IDA’07, SpringerVerlag, Berlin, Heidelberg, 2007, pp. 70–80.
 (10) Y. Zhu, K. M. Ting, M. J. Carman, Densityratio based clustering for discovering clusters with varying densities, Pattern Recognition 60 (2016) 983 – 997.
 (11) A. Rodriguez, A. Laio, Clustering by fast search and find of density peaks, Science 344 (6191) (2014) 1492–1496.
 (12) B. Chen, K. M. Ting, T. Washio, Y. Zhu, Local contrast as an effective means to robust clustering against varying densities, Machine Learning (2018) 1–25.
 (13) A. Sharma, K. A. Boroevich, D. Shigemizu, Y. Kamatani, M. Kubo, T. Tsunoda, Hierarchical maximum likelihood clustering approach, IEEE Transactions on Biomedical Engineering 64 (1) (2017) 112–122.
 (14) Y. Lu, Y. Wan, PHA: A fast potentialbased hierarchical agglomerative clustering method, Pattern Recognition 46 (5) (2013) 1227 – 1239.

(15)
R. J. G. B. Campello, D. Moulavi, A. Zimek, J. Sander, Hierarchical density estimates for data clustering, visualization, and outlier detection, ACM Transactions on Knowledge Discovery from Data 10 (1) (2015) 5:1–5:51.
 (16) M. Ankerst, M. M. Breunig, H.P. Kriegel, J. Sander, OPTICS: Ordering points to identify the clustering structure, in: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, SIGMOD ’99, ACM, New York, NY, USA, 1999, pp. 49–60.
 (17) S. Gilpin, I. Davidson, A flexible ILP formulation for hierarchical clustering, Artificial Intelligence 244 (2017) 95 – 109.
 (18) J. Xie, H. Gao, W. Xie, X. Liu, P. W. Grant, Robust clustering by detecting density peaks and assigning points based on fuzzy weighted knearest neighbors, Information Sciences 354 (2016) 19 – 40.
 (19) J. Xu, G. Wang, W. Deng, DenPEHC: Density peak based efficient hierarchical clustering, Information Sciences 373 (2016) 200 – 218.
 (20) Q. Zhang, C. Zhu, L. T. Yang, Z. Chen, L. Zhao, P. Li, An incremental CFS algorithm for clustering large data in industrial internet of things, IEEE Transactions on Industrial Informatics 13 (3) (2017) 1193–1201.
 (21) A. Strehl, J. Ghosh, Cluster ensembles—a knowledge reuse framework for combining multiple partitions, Journal of Machine Learning Research 3 (Dec) (2002) 583–617.

(22)
M. Lichman, UCI Machine Learning
Repository (2013).
URL http://archive.ics.uci.edu/ml  (23) W. H. E. Day, H. Edelsbrunner, Efficient algorithms for agglomerative hierarchical clustering methods, Journal of Classification 1 (1) (1984) 7–24.
 (24) R. Sibson, SLINK: An optimally efficient algorithm for the singlelink cluster method, The Computer Journal 16 (1) (1973) 30–34.
 (25) D. Defays, An efficient algorithm for a complete link method, The Computer Journal 20 (4) (1977) 364–366.
 (26) G. Karypis, E.H. Han, V. Kumar, Chameleon: hierarchical clustering using dynamic modeling, Computer 32 (8) (1999) 68–75.
Comments
There are no comments yet.