CDF Transform-Shift: An effective way to deal with inhomogeneous density datasets

10/05/2018 ∙ by Ye Zhu, et al. ∙ Deakin University IEEE Monash University Federation University Australia 0

Many distance-based algorithms exhibit bias towards dense clusters in inhomogeneous datasets (i.e., those which contain clusters in both dense and sparse regions of the space). For example, density-based clustering algorithms tend to join neighbouring dense clusters together into a single group in the presence of a sparse cluster; while distance-based anomaly detectors exhibit difficulty in detecting local anomalies which are close to a dense cluster in datasets also containing sparse clusters. In this paper, we propose the CDF Transform-Shift (CDF-TS) algorithm which is based on a multi-dimensional Cumulative Distribution Function (CDF) transformation. It effectively converts a dataset with clusters of inhomogeneous density to one with clusters of homogeneous density, i.e., the data distribution is converted to one in which all locally low/high-density locations become globally low/high-density locations. Thus, after performing the proposed Transform-Shift, a single global density threshold can be used to separate the data into clusters and their surrounding noise points. Our empirical evaluations show that CDF-TS overcomes the shortcomings of existing density-based clustering and distance-based anomaly detection algorithms and significantly improves their performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many distance-based algorithms have a bias towards dense clusters 2, 5, 13, 36, 26. For example, (i) DBSCAN 12 is biased towards grouping neighbouring dense clusters into a single cluster, in the presence of a sparse cluster 36. (ii) The NN anomaly detector 4, 27 employs the

NN density estimator

16, 23 to estimate the density of every point in a dataset, and then sorts all the points based on their estimated density in descending order. Low-density points are regarded as more likely to be anomalous than high-density points. Since the ranking is based on global densities, this bias results in the misclassification of so-called “local anomalies” (points considered anomalous with respect to their neighborhood) in high density regions as normal points 7, 8, 9.

A number of techniques have been proposed to “correct” this bias, notably, the density-ratio approach of ReCon 36, ReScale 36 and DScale 37 (for density-based clustering such as DBSCAN 14 and DP 29), and the reachability-distance approach of LOF 7 (for the NN anomaly detector 27).

The density-ratio approach 36, 37 aims to transform the data in such a way that the estimated density of each transformed point approximates the density-ratio of that point in the original space, and as a result all locally low-density points are easily separated from all locally high-density points using a single global threshold.

While the current density-ratio methods have been shown to improve the clustering performance of density-based clustering, we identify their shortcomings and propose a new algorithm to better achieving the aim of the density-ratio approach. Furthermore, we extend its application to distance-based anomaly detection.

We propose to perform a multi-dimensional Cumulative Distribution Function (CDF) based transformation on the given dataset as a preprocessing step, which learns a mapping 111The mapping depends on all points in the given dataset , where each point asserts a (local and point-based) scaling factor to . such that the resulting Euclidean distance between pairs of points

is “locally consistent” with the Euclidean distance between points in the original space (i.e., all local inter-point distances are proportional to their original values), while the final transformed dataset has a uniform distribution, as estimated by density estimator

using a certain bandwidth. This effectively achieves the desired aim of CDF transform, i.e., density equalisation w.r.t. a density estimator with a certain bandwidth, without impairing the cluster structure in the original dataset.

This paper makes the following contributions:

  1. It generalises a current CDF scaling method DScale 37 so that it can be applied to density estimators using uniform kernels with either fixed or variable bandwidths.

  2. It proposes a new Transform-Shift method called CDF-TS (CDF Transform-Shift) that changes the volumes of each point’s neighbourhood simultaneously in the full multi-dimensional space. Existing CDF transform methods are either individual attribute based (e.g., ReScale 36) or individual point based (e.g., DScale 37). As far as we know, this is the first attempt to perform a multi-dimensional CDF transform to achieve the desired effect of homogenising (making uniform) the distribution of an entire dataset w.r.t a density estimator.

  3. It applies the new Transform-Shift method to density-based clustering and distance-based anomaly detection to demonstrate its impact in overcoming the weaknesses of three existing algorithms.

The proposed approach CDF-TS differs from existing approaches in terms of:

  1. Methodology: While ReScale 36, DScale 37 and the proposed CDF-TS follow the same principled approach and have the same aim, the proposed CDF Transform-Shift method is a multi-dimensional technique which incorporates both transformation and point-shift, which greatly expand the approach’s applicability. In contrast, ReScale is a one-dimensional transformation and point-shift technique; and DScale incorporates transformation without point-shift. The ‘complete’ CDF-TS method leads to a better result in both clustering and anomaly detection, which we will show in Section 6.

  2. Metric compliance: CDF-TS and ReScale create a similarity matrix which is a metric; whereas the one produced by DScale 37 is not a metric, i.e., it does not satisfy the symmetry or triangle inequalities.

  3. Ease of use: Like Rescale, CDF-TS transforms the data as a preprocessing step. The targeted (clustering or anomaly detection) algorithm can then applied unaltered to the transformed data. In contrast, DScale requires an algorithmic modification in order to use it.

To demonstrate the effects of different dissimilarity measures on image segmentation, we took the image shown in Figure 0(a), and applied different rescaling methods (no rescaling, ReScale, DScale and CDF-TS) and clustering the 3-dimensional pixel values of the image in LAB space 32. Tables 1 show the results of applying the clustering algorithm DP 29 to produce three clusters. The scatter plots in LAB space in Figure 1 show that the three clusters become more uniform distributed with similar density and the gaps between their boundaries are much larger for CDF-TS than the other three scaling methods. As a result, DP with CDF-TS yields the best clustering result, shown in Table 1.

(a) Image
(b) No Scaling
(c) ReScale
(d) DScale
(e) CDF-TS
Figure 1: Clustering results by DP in LAB space of the image shown in (a). The colours in the scatter plots indicate the three clusters identified by DP using each of the four methods. The scatter plots of DScale and No Scaling are based on the original LAB attributes; and the other two are based on the transformed attributes.
Cluster 1 Cluster 2 Cluster 3

No Scaling

ReScale

DScale

CDF-TS

Table 1: DP’s image segmentation on the image shown in Figure 0(a).

The rest of the paper is organised as follows. We describe issues of inhomogeneous density in density-based clustering and distance-based anomaly detection in Section 2. Section 3 provides the density-ratio estimation as a principle to address the issues of inhomogenous density. Section 4 presents the two existing CDF scaling approaches based on density-ratio. Section 5 proposes the local-density transform-shift as a multi-dimensional CDF scaling method. Section 6 empirically evaluates the performance of existing density-based clustering and distance-based anomaly detection algorithms on the transformed-and-shifted datasets. Discussion and the conclusions are provided in the last two sections.

2 Issues of inhomogeneous density

2.1 Density-based clustering

Let , denote a dataset of points, each sampled independently from a distribution . Let denote the density estimate of point which approximates the true density . In addition, let be the -neighbourhood of , , where is the distance function ().

In general, the density of a point can be estimated via a small -neighbourhood (as used by the classic density-based clustering algorithm DBSCAN 14) defined as follows:

(1)

where is the volume of a -dimensional ball of radius .

A set of clusters is defined as non-empty and non-intersecting subsets: . Let denote the mode (point of the highest estimated density) for cluster ; and denote the corresponding peak density value.

DBSCAN uses a global density threshold to identify core points (which have densities higher than the threshold); and then it links neighbouring core points together to form clusters 14. It is defined as follows.

Definition 1.

A core point is a point with an estimated density above or equal to a user-specified threshold , i.e., , where denotes a set indicator function.

Definition 2.

Using a density estimator with density threshold , a point is density connected with another point in a sequence of unique points from , i.e., : is defined as:

Definition 3.

A cluster detected by the density-based algorithm DBSCAN is a maximal set of density connected instances, i.e., , where is the cluster mode.

For this kind of algorithm to find all clusters in a dataset, the data distribution must have the following necessary condition: the peak density of each cluster must be greater than the maximum over all possible paths of the minimum density along any path linking any two modes.222A path linking two modes and is defined as a sequence of unique points starting with and ending with where adjacent points lie in each other’s -neighbourhood. This condition is formally described by Zhu et al 36 as follows:

(2)

where is the largest of the minimum density along any path linking the mode of for clusters and .

This condition implies that there must exist a threshold that can be used to break all paths between the modes by assigning regions with density less than to noise. Otherwise, if the mode of some cluster has a density lower than that of a low-density region between other clusters, then this kind of density-based clustering algorithm will fail to find all clusters. Either some high-density clusters will be merged together (when a lower density threshold is used), or low-density clusters will be designated as noise (when a higher density threshold is used). To illustrate, Figure 1(a) shows that using a high threshold will cause all points in Cluster to be assigned to noise but using a low threshold will cause points in and to be assigned to the same cluster.

(a) Original data
(b) ReScaled data
Figure 2:

(a) A mixture of three Gaussian distributions that cannot be separated using a single density threshold; (b) Density distribution on ReScaled data of (a), where a single density threshold can be found to separated all three clusters. Note that point

, and are shifted to , and , respectively. Here is a larger bandwidth than .

2.2 kNN anomaly detection

A classic nearest-neighbour based anomaly detection algorithm assigns an anomaly score to an instance based on its distance to the -th nearest neighbour 27. The instances with the largest anomaly scores are identified as anomalies.

Given a dataset and the parameter , the density of can be estimated using a -th nearest neighbour density estimator (as used by the classic NN anomaly detector 27):

(3)

where is the distance between and its -th nearest neighbour in a dataset .

Note that the -th nearest neighbour distance is a proxy to the density of , i.e., high indicates low density, and vice versa.

Let be the set of all normal points in a dataset . The condition under which the classic NN anomaly detector could, with an appropriate setting of a density/distance threshold, identify every anomaly in is given as follows:

(4)

Equation 4 states that all anomalies must have the highest NN distances (or lowest densities) in order to detect them. In other words, NN anomaly detectors can detect both global anomalies and scattered anomalies which have lower densities than that of all normal points 1, 20.

However, based on this characteristic, NN anomaly detectors are unable to detect:

  1. Local anomalies. This is because a local anomaly with low density relative to nearby normal (non-anomalous) instances in a region of high average density may still have higher density than that of normal (non-anomalous) instances in regions of lower average density 7. Translating this in terms of -th NN distance, we have: . Here, some local anomalies (for example, points located around the boundaries of and ) are ranked lower than the normal points located around the cenre sparse cluster (), as shown in Figure 1(a).

  2. Anomalous clusters (sometimes referred to as clustered anomalies 21) are groups of points that are too small to be considered “true clusters” and are found in low density parts of the space (i.e. are separate from the other clusters). A purported benefit of the kNN anomaly detector is that it is able to remove such anomalous clusters providing is sufficiently large (larger than the size of the anomalous group) 1. By rescaling the data we are able to extend this property to locally anomalous clusters, i.e. to be tolerant to variations in the local (background or average) density over the space. An example is shown in Figure 3.

(a) Original data
(b) ReScaled data
Figure 3: (a) A mixture of two Gaussian distributions that C is a normal cluster and A is an anomalous cluster; (b) Density distribution on the ReScaled data of (a), where the anomalous cluster are farther to the normal cluster centre. Note that Point , and are shifted to , and , respectively.

2.3 Summary

Both density-based clustering and distance-based anomaly detector have weaknesses when it comes to handling datasets with inhomogeneous density clusters. Rather than creating a new density estimator or modifying the existing clustering and anomaly detection algorithm procedures, we advocate transforming data to be more uniformly distributed than it is in the original space such that the separation between clusters can be identified easily.

3 Density-ratio estimation

Density-ratio estimation is a principled approach to overcome the weakness of density-based clustering for detecting clusters with inhomogeneous densities 36.

The density-ratio of a point is the ratio of two density estimates calculated using the same density estimator, but with two different bandwidth settings.

Let and be density estimators using kernels of bandwidth and , respectively. Given the constraint that the denominator has larger bandwidth than the numerator , the density ratio of is estimated as

(5)

We correct the lemma from 36 regarding the density ratio value :

Lemma 1.

For any data distribution and sufficiently small values of and s.t. , if is at a local maximum density of , then ; and if is at a local minimum density of , then .

Since points located at local high-density areas (almost invariably) have density-ratio higher than points located at local low-density areas, a global density-ratio threshold around unity can be used to identify all cluster peaks and break all paths between different clusters. Thus, based on density-ratio estimation, existing density-based clustering algorithms such as DBSCAN can identify clusters as regions of locally high density, separated by regions of locally low density.

Similarly, a density-based anomaly detector is able to detect local anomalies since their density-ratio values are lower and ranked higher than normal points with locally high densities.

4 CDF scaling to overcome the issue of inhomogeneous densities

4.1 One-dimensional CDF scaling

Let and be a density estimator and a cumulative density estimator, respectively, using a kernel of bandwidth .

Let be a transformed point of using as follows:

(6)

Then, we have the property as follows 36:

(7)

and

(8)

ReScale 36 is a representative implementation algorithm based on this transform. Figure 1(b) and Figure 2(b) show ReScale rescales the data distribution on two 1-dimensional datasets, respectively. They show that clusters and anomalies are easier to identify after the application of ReScale.

Since this transform can be performed on a one-dimensional dataset only, ReScale must apply the transformation to each attribute independently for a multi-dimensional dataset.

4.2 Multi-dimensional CDF scaling

Using a distance scaling method, a multi-dimensional transform can be achieved by simply rescaling the distances between each point and all the other instances in its local neighbourhood, (thereby considering all of the dimensions of the multidimensional dataset at once).

Given a point , the distance between and all points in its -neighbourhood 333For reasons that will become obvious shortly, we now reparameterise the neighbourhood function to depend on the distance measure . can be rescaled using a scaling function :

(9)

where is the scaled distance of .

Here, the scaling function depends on both the position and size of the neighbourhood . It is defined as follows using the estimated density with the aim of making the density distribution within the -neighbourhood uniform:

(10)

where is the maximum pairwise distance in . Note that we have now reparameterised the density estimator to include the distance function s, in order to facilitate calculations below.

With reference to the uniform distribution which has density , where is the volume of the ball having radius :

(11)
(12)

That is, the process rescales sparse regions to be more dense by using , and dense regions to be more sparse by using ; such that the entire dataset is approximately uniformly distributed in the scaled -neighbourhood. More spefically, after rescaling distances with s’, the density of points in the neighbourhood of size around is the same as the density of points across the whole dataset:

(13)

Note that the above derivation is possible because (i) the scaling is isotropic about

(hence the shape of the unit ball doesn’t change only its size) and (ii) the uniform-kernel density estimator is local (i.e., its value depends only on points within the

neighbourhood).

In order to maintain the same relative ordering of distances between and all other points in the dataset, the distance between and any point not in the -neighbourhood can be normalised by a simple - normalisation:

(14)

It is interesting to note that providing is sufficiently large that the average of the density estimates across the data points approximates the average density over the space , then we have that the average rescaling factor is approximately 1: , , and thus the average distance between points is unchanged after rescaling:444Proof: For we have that since and are independent. The same can be shown for points .

(15)

The implementation of DScale 37, which is a representative algorithm based on this multi-dimensional CDF scaling, is provided in Algorithm 2 in Appendix Appendix: DScale algorithm.

A lemma about the density on the rescaled distance is given as follows:

Lemma 2.

The density with the rescaled distance is approximately proportional to the density-ratio in terms of the original distance within the -neighbourhood of .

Proof.

where and . ∎

Note that the above lemma is only valid within the -neighbourhood of , and when each is treated independently. In other words, the transform is only valid locally. In addition, the rescaled distance is asymmetric, i.e., when .

To be a valid transform globally for the entire dataset, we propose in this paper to perform an iterative process that involves two steps: distance rescaling and point shifting.

5 CDF Transform-Shift (CDF-TS)

Instead of just rescaling distances between each data instance and all other points in the dataset, we can apply the rescaling directly to the dataset by translating points in approximate accordance with the rescaled distances. Two advantages of the new approach are: (i) the shifted dataset becomes approximately uniformly distributed as a whole; and (ii) the standard Euclidean distance can then be used to measure distances between the transformed points, rather than the non-symmetric and as a result non-metric rescaled distance.

Consider two points . In order to make the distribution around point more uniform, we wish to rescale (expand or contract) the distance between and to be as defined by Equations 9 and 14. We can do this by translating the point to a new point denoted which lies along the direction as follows:

(16)

Note that the distance between and the newly transformed point satisfies the rescaled distance requirement:

(17)

Above we considered the effect of translating point to to scale the space around point . In order to approximately scale the space around all points in the dataset, point needs to be translated by the average of the above translations:

(18)

A theorem regarding the final shifted data is provided as follows:

Theorem 1.

Given a sufficiently large bandwidth , the -neighbourhood density of is expected to be more uniform than that of :

Proof.

Let us define the bandwidth for the transformed neighbourhood around as where and . We can now estimate as follows. We have

Since is in the -neighbourhood of we have , and

Supposing that varies slowly within the neighbourhood , we have:

Then based on Equation 15, given a sufficiently large such that and providing the count is sufficiently small, we have:

This means that is mainly affected by the point located in the -neighbourhood of , i.e., . Then we have

The above relation can be rewritten as:

As , the relation between and depends on whether is in the sparse or dense region:

(19)
(20)

When the density varies slowly in the , we can safely assume that each point’s nearest neighbours keep similar after transform and the density still varied slowly in -neighbourhood. Then we have and can estimate as

(21)

Here we have three scenarios, depending on the density of :

  1. In sparse region , we have , as shown in Equation 19. Using this inequality in Equations 13 and 21, we have

  2. In dense region , we have , as shown in Equation 20. Using this inequality in Equations 13 and 21, we have

Therefore, , . ∎

Since our target is to get a uniformly distributed data w.r.t. , this transform-shift process can be repeated multiple times using an iterative method to reduce the -neighbourhood density variation on the shifted dataset in the previous iteration.

This is accomplished by using an expectation-maximisation-like algorithm. The objective is to minimise the density variance in

.

A condition for terminating the iteration process is when the total distance of point-shifts from to in the latest iteration is less than a threshold , i.e.,

(22)

Note that, due to the shifts in each iteration, the value ranges of attributes of the shifted dataset are likely to be different from those of the original dataset . We can use a min-max normalisation 3 on each attribute to keep the values in the same fixed range at the end of each iteration.

Figure 4 illustrates the effects of CDF-TS on a two-dimensional data with different iterations. It shows that the original clusters with different density become increasing uniform as the number of iterations increases.

(a) A two-dimensional data
(b) After 1 iteration
(c) After 3 iterations
(d) After 5 iterations
Figure 4: Illustrating the effects of CDF-TS on a two-dimensional data with .

The implementation of CDF-TS is shown in Algorithm 1. The DScale algorithm used in step 6 is provided in Algorithm 2 in Appendix Appendix: DScale algorithm. Algorithm 1 requires two parameters and .

1: - input data matrix ( matrix); - bandwidth parameter; - threshold for the total shifted distance.
2: - data matrix after rescaling and point shifting.
3:Normalising using min-max normalisation
4:
5:
6:while  do
7:      Calculating the distance matrix for
8:      DScale(, , )
9:     for each (where is a reference point) do
10:          Shift every point to in the direction of to with magnitude
11:     end for
12:     
13:     Normalising using min-max normalisation
14:     
15:     
16:     
17:end while
18:return
Algorithm 1 CDF-TS(, , )

5.1 Density estimators applicable to CDF-TS

The following two types of density estimators, with fixed-size and variable-size bandwidths of the uniform kernel, are applicable to CDF-TS to do the CDF scaling:

(a) -neighbourhood density estimator:

where is the volume of the ball with radius .

When this density estimator is used in CDF-TS, the -neighbourhood described in Sections 4.2 and 5, i.e., , where , denoting the fixed-size bandwidth uniform kernel.

(b) -th nearest neighbour density estimator:

where is the -th nearest neighbour of ; is -th nearest neighbour distance of .

When this density estimator is used in CDF-TS, the neighbourhood size becomes dependent on the location and the distribution of surrounding points. We denote as the distance to the nearest neighbour of . In this case the density is simply calculated over the neighbourhood: , denoting the variable-size bandwidth uniform kernel, i.e., small in dense region and large in sparse region. Note that, in this circumstance, the density of the shifted points still become more uniformly w.r.t their surrounding.

In the following experiments, we employ the same density estimators as used in DBSCAB and DP (in clustering) and NN anomaly detector in CDF-TS to transform to , i.e., -neighbourhood density estimator is used in CDF-TS for clustering; and -th nearest neighbour density estimator is used in CDF-TS for anomaly detection.

6 Empirical Evaluation

Dataset Data Size #Dimensions #Clusters
Segment 2310 19 7
Mice 1080 83 8
Biodeg 1055 41 2
ILPD 579 9 2
ForestType 523 27 4
Wilt 500 5 2
Musk 476 166 2
Libras 360 90 15
Dermatology 358 34 6
Ecoli 336 7 8
Haberman 306 3 2
Seeds 210 7 3
Lung 203 3312 5
Wine 178 13 3
3L 560 2 3
4C 1250 2 4
Table 2: Properties of clustering datasets

This section presents experiments designed to evaluate the effectiveness of CDF-TS.

All algorithms used in our experiments were implemented in Matlab R2017a (the source code can be obtained at https://sourceforge.net/p/cdf-ts/). The experiments are run on a machine with eight cores CPU (Intel Core i7-7820X @ 3.60GHz), 32GB memory and a 2560 CUDA cores GPU (GeForce GTX 1080). All datasets were normalised using the - normalisation to yield each attribute to be in [0,1] before the experiments began.

For clustering, we used 2 artificial datasets and 14 real-world datasets with different data sizes and dimensions from UCI Machine Learning Repository

11. Table 2 presents the data properties of the datasets. 3L is a 2-dimensional data containing three elongated clusters with different densities, as shown in Figure 4(a). 4C is a 2-dimensional dataset containing four clusters with different densities (three Gaussian clusters and one elongated cluster), as shown in Figure 4(b). Note that DBSCAN is unable to correctly identify all clusters in both of these datasets because they do not satisfy the condition specified in Equation 2. Furthermore, clusters in 3L significantly overlap on individual attribute projections, which violates the requirement of ReScale that the one-dimensional projections allow for the identification of the density peaks of each cluster.

(a) 3L data distribution
(b) 4C data distribution
Figure 5: (a) A two-dimensional data containing three elongated clusters. (b) A two-dimensional data containing four clusters.

For anomaly detection, we compared the anomaly detection performance on 2 synthetic datasets with anomalous clusters and 10 real-world benchmark datasets555Velocity, Ant and Tomcat are from http://openscience.us/repo/defect/ck/ and others are from UCI Machine Learning Repository 11.. The data size, dimensions and percentage of anomalies are shown in Table 3. Both the Syn 1 and Syn 2 datasets contain clusters of anomalies. Their data distributions are shown in Figure 6.

Dataset Data Size #Dimensions % Anomaly
AnnThyroid 7200 6 7.42%
Pageblocks 5473 10 10.23%
Tomcat 858 20 8.97%
Ant 745 20 22.28%
BloodDonation 604 4 5.63%
Vowel 528 10 9.09%
Mfeat 410 649 2.44%
Dermatology 366 20 5.46%
Balance 302 4 4.64%
Velocity 229 20 34.06%
Syn 1 520 2 3.85%
Syn 2 860 1 6.98%
Table 3: Properties of anomaly detection datasets
(a) Data distribution on Syn 1 dataset
(b) Stacked histogram on Syn 2 dataset
Figure 6: Data distributions of the Syn 1 and Syn 2 datasets, where the red colour indicate the anomaly.

6.1 Clustering

In this section, we compare CDF-TS with ReScale and DScale using two existing density-based clustering algorithms, i.e., DBSCAN 14 and DP 29, in terms of F-measure 28: given a clustering result, we calculate the precision score and the recall score for each cluster

based on the confusion matrix, and then the F-measure score of

is the harmonic mean of

and . The overall F-measure score is the unweighted (macro) average over all clusters: F-measure.666It is worth noting that other evaluation measures such as purity and Normalised Mutual Information (NMI) 31 only take into account the points assigned to clusters and do not account for noise. A clustering algorithm which assigns the majority of the points to noise may result in a high clustering performance. Thus the F-measure is more suitable than purity or NMI in assessing the clustering performance of density-based clustering.

We report the best clustering performance after performing a parameter search for each algorithm. Table 4 lists the parameters and their search ranges for each algorithm. in ReScale controls the precision of , i.e., the number of intervals used for estimating . We set as the default value for CDF-TS.

Algorithm Parameter with search range
DBSCAN ;
DP ;
ReScale ;
DScale
CDF-TS ;
Table 4: Parameters and their search range. The search ranges of and are as used by Zhu et. al. 36.

Table 5 shows the best F-measures for DBSCAN, DP, their ReScale, DScale and CDF-TS versions. The average F-measures, showed in the second-to-last row, reveal that CDF-TS improves the clustering performance of either DBSCAN or DP with a larger performance gap than both ReScale and DScale. In addition, CDF-TS is the best performer on many more datasets than other contenders (shown in the last row of Table 5.)

Data DBSCAN DP
Orig ReS DS CDF-TS Orig ReS DS CDF-TS
Segment 0.59 0.62 0.61 0.67 0.78 0.77 0.80 0.84
Mice 0.99 0.99 0.99 0.993 1.00 1.00 1.00 1.00
Biodeg 0.45 0.44 0.47 0.52 0.72 0.74 0.73 0.76
ILPD 0.41 0.42 0.56 0.52 0.60 0.63 0.62 0.64
ForestType 0.27 0.51 0.48 0.65 0.69 0.83 0.70 0.85
Wilt 0.38 0.39 0.54 0.55 0.54 0.68 0.54 0.74
Musk 0.51 0.52 0.51 0.53 0.55 0.58 0.61 0.62
Libras 0.41 0.44 0.46 0.49 0.376 0.375 0.376 0.38
Dermatology 0.52 0.73 0.74 0.83 0.91 0.97 0.91 0.96
Ecoli 0.37 0.40 0.54 0.60 0.48 0.55 0.63 0.64
Haberman 0.47 0.64 0.59 0.66 0.56 0.63 0.58 0.67
Seeds 0.75 0.88 0.85 0.83 0.91 0.92 0.92 0.94
Lung 0.49 0.54 0.49 0.66 0.70 0.74 0.64 0.72
Wine 0.64 0.86 0.80 0.90 0.93 0.95 0.96 0.962
3L 0.59 0.63 0.90 0.88 0.82 0.81 0.86 0.89
4C 0.71 0.90 0.92 0.95 0.87 0.92 0.95 0.91
Average 0.53 0.62 0.65 0.70 0.72 0.75 0.74 0.78
#Top 1 0 1 2 13 1 3 2 13
Table 5: Best F-measure of DBSCAN, DP, and their ReScale, DScale and CDF-TS versions. For each clustering algorithm, the best performer in each dataset is boldfaced. Ori, ReS, and DS represent the Original algorithm, ReScale, and DScale respectively.

It is interesting to see that CDF-TS exhibits a large performance improvement over both DBSCAN and DP on many datasets, such as ForestType, Wilt, Dermatology, Ecoli, Lung and Wine. On many of these datasets, CDF-TS also shows a large performance gap in comparison with ReScale and/or DScale.

Note that the performance gap is smaller for DP than it is for DBSCAN because DP is a more powerful algorithm which does not rely on a single density threshold to identify clusters.

To evaluate whether the performance difference among the three scaling algorithms is significant, we conduct the Friedman test with the post-hoc Nemenyi test 10.

Figures 6(a) and  6(b) show the results of the significance test for DBSCAN and DP, respectively. The results show that the CDF-TS versions are significantly better than the DScale and ReScale versions for both DBSCAN and DP.

(a) Significance test for DBSCAN
(b) Significance test for DP
Figure 7: Critical difference (CD) diagram of the post-hoc Nemenyi test (). Two algorithms are significant different if the gap between their ranks is larger than CD. Otherwise, there is a line linking them.

Here we compare the different effects on the transformed datasets due to ReScale and CDF-TS using the MDS visualisation777Multidimensional scaling (MDS) 6 is used for visualising a high-dimensional dataset in a 2-dimensional space through a projection method which preserves as well as possible the original pairwise dissimilarities between instances..

The effects on the two synthetic datasets are shown in Figure 8 and Figure 10. They show that both the rescaled datasets are more axis-parallel distributed using ReScale than those using CDF-TS. For the 3L dataset, the blue cluster is still very sparse after running ReScale, as shown in Figure 7(a). In contrast, it becomes denser using CDF-TS, as shown in Figure 8(a). As a result, DBSCAN has much better performance with CDF-TS on the 3L dataset.

Figure 10 shows the MDS plots on the Wilt dataset which CDF-TS has the largest performance improvement over both DBSCAN and DP. The figure shows that the two clusters are more uniformly distributed in the new space which makes their boundary easier to be identified than that in the original space. In contrast, based on distance, most points of the red cluster are surrounded by points of the blue cluster, as shown in Figure 9(a).

(a) 3L data distribution
(b) 4C data distribution
Figure 8: Data distribution after running ReScale when achieving the best DBSCAN clustering performance
(a) 3L data distribution
(b) 4C data distribution
(a) Based on distance
(b) Based on CDF-TS
Figure 9: Data distribution after running CDF-TS when achieving the best DBSCAN clustering performance
Figure 10: MDS plots on the Wilt dataset
Figure 9: Data distribution after running CDF-TS when achieving the best DBSCAN clustering performance

6.2 Anomaly detection

In this section, we evaluate the ability of CDF-TS to detect local anomalies based on NN anomaly detection.

Two state-of-the-art local anomaly detector, Local Outlier Factor (LOF)

7 and iForest 19, 22, are also used in the comparison. Table 6 lists the parameters and their search ranges for each algorithm. Parameters and in iForest control the sub-sample size and number of iTrees, respectively. We report the best performance of each algorithm on each dataset in terms of best AUC (Area under the Curve of ROC) 15.

Algorithm Parameters and their search range
NN/LOF
iForest ;
ReScale ;
DScale
CDF-TS ;
Table 6: Parameters and their search ranges.

Table 7 compares the best AUC score of each algorithm. CDF-TS-NN achieves the highest average AUC of 0.90 and performs the best on 6 out of 12 datasets. For the Syn 1 dataset which has overlapping on the individual attribute projection, DScale and CDF-TS have much better results than ReScale.

Data NN LOF iForest Res DS CDF-TS
AnnThyroid 0.65 0.68 0.88 0.76 0.71 0.94
Pageblocks 0.89 0.93 0.90 0.88 0.86 0.94
Tomcat 0.63 0.67 0.81 0.77 0.69 0.78
Ant 0.67 0.68 0.77 0.76 0.71 0.76
BloodDonation 0.69 0.79 0.82 0.84 0.80 0.86
Vowel 0.93 0.93 0.92 0.932 0.90 0.94
Mfeat 0.99 0.98 0.98 0.99 1.00 0.99
Dermatology 0.91 0.99 0.86 0.99 0.96 1.00
Balance 0.94 0.95 0.91 0.91 0.83 0.92
Velocity 0.66 0.62 0.65 0.67 0.694 0.69
Syn 1 0.92 0.93 0.90 0.89 0.95 0.98
Syn 2 0.88 0.88 0.86 0.92 0.95 0.94
Average 0.81 0.83 0.86 0.86 0.84 0.90
#Top 1 0 1 2 1 3 6
Table 7: Best AUC on 12 datasets. The best performer on each dataset is boldfaced. ReS, and DS represent ReScale-NN, and DScale-NN respectively.
Figure 11: Critical difference (CD) diagram of the post-hoc Nemenyi test () for anomaly detection algorithms.

Figure 11 shows the significance test on the three algorithms applied to NN anomaly detector. It can be seen from the results that CDF-TS-NN is significantly better than DScale-NN and ReScale-NN.

It is interesting to mention that CDF-TS-NN has the largest AUC improvement over NN on Dermatology and AnnThyroid, shown in Table 7. Table 8 compares their MDS plots between the original datasets and the shifted (density equalised) datasets. It shows that most anomalies become anomalous clusters and farther away from normal points on the shifted datasets, making them easier to detect by NN anomaly detector. Note that some anomalies are closed to normal clusters and have higher densities than many normal instances in the original dataset.

Distance CDF-TS

Dermatology

AnnThyroid

Table 8: MDS plots on two datasets, where red points indicate anomalies.

6.3 Run-time

ReScale, DScale and CDF-TS are all pre-processing methods, their computational complexities are shown in Table 9. Because many existing density-based clustering algorithms have time and space complexities of , all these methods does not increase their overall complexities.

Algorithm Time complexity Space complexity
ReScale
DScale and CDF-TS
Table 9: Computational complexity of ReScale, DScale and CDF-TS.

Table 10 shows the runtime of the dissimilarity matrix calculation for each of the three methods on the Mfeat, Segment, Pageblocks and AnnThyroid datasets. Their parameters are set to be the same for all datasets, i.e., , and . The result shows that CDF-TS took the longest runtime because it requires multiple DScale calculations.

Dataset CPU GPU
ReScale DScale CDF-TS ReScale DScale CDF-TS
Mfeat 0.69 0.03 1.26 0.05 0.02 2.89 [t]
Segment 0.29 0.28 2.68 0.02 0.05 0.11
Pageblocks 0.16 1.64 18.64 0.02 0.17 2.84
AnnThyroid 0.19 3.09 24.40 0.02 0.26 2.97
Table 10: Runtime comparison of dissimilarity matrix calculation (in seconds).

7 Discussion

7.1 Parameter sensitivity

Both DScale and CDF-TS have one critical parameter to define the -neighbourhood. The density-ratio based on a small will approximate 1 and provides no information; while on a large will approximate the estimated density and have no advantage. Generally, and in DBSCAN and DP shall be set slightly smaller than .

The parameter in CDF-TS controls the number of iterations, i.e., a smaller usually results in a higher number of iterations in the CDF-TS process, and makes the shifted dataset more uniform. In our experiments, we set as default value and got significantly better results than existing algorithms. When tuning the best for different datasets, CDF-TS would get better results on most datasets.

7.2 Relation to LOF

LOF 7 is a local density-based approach for anomaly detection. The LOF score of is the ratio of the density of to the average density of ’s -nearest neighbours, where the density of is inversely proportional to the average distance to its -nearest-neighbours 7.

Though LOF is a density-ratio approach, it does not employ CDF transform, which is the core technique in CDF-TS. While LOF has the ability to detect local anomalies, it is weaker in identifying anomalous clusters than NN which employs CDF-TS.

7.3 Relation to metric learning

There are many unsupervised distance metric learning algorithms which transform data e.g., global methods such as PCA 18, KPCA 30 and KUMMP 35); and local methods such as Isomap 33, t-SNE 24 and LPP 17. These methods are usually used as dimension reduction methods such that data clusters in the learned low-dimensional space are adjusted based on some objective function 34.

CDF-TS is an unsupervised algorithm based on the CDF transform such that the data are more uniform in the scaled space. Though it is a linear and local method, CDF-TS is not a metric learning method for dimension reduction.

We have examined the performance of t-SNE in density-based clustering and NN anomaly detection. We set the dimensionality of its output space to be the same as the original data. We found that t-SNE has inconsistent clustering performance. For example, while it produced the best clustering results on some high-dimensional datasets such as the ForestType and Lung datasets; but it yielded the worst results on the Breast and 3L datasets. In addition, t-SNE could not improve the performance on NN anomaly detector. This is because anomalies are still dense or close to some dense clusters in the t-SNE transformed space.

7.4 Comparison with a density equalisation method

In the context of kernel -means 25, NN kernel 26 has been suggested to be a way to equalise the density of a given dataset to reduce the bias of kernel -means algorithm and to improve the clustering performance on datasets with inhomogeneous densities. The dissimilarity matrix generated by NN kernel is a binary matrix can be seen as the adjacent matrix of NN graph such that “1” means a point is in the set of nearest neighbours of another point; and “0” otherwise. Thus, a -means clustering algorithm can be used to group points based on their NN graph.

However, NN kernel cannot be applied to both density-based clustering and NN anomaly detection. This is because it converts all points to have the same density in the new space regardless of the bandwidth used by a density estimator used by the intended algorithm. Thus, DBSCAN will either group neighbouring clusters into a single cluster or assign all clusters to noise. This outcome is not conducive in producing a good clustering result.

DP also cannot work with NN kernel for the same reason. DP links a point to another point with a higher density to form clusters in the last step of the clustering process 29. DP would not be able to link points when all points have the same density.

For NN anomaly detection, replacing the distance measure with the NN kernel produces either 0 or 1 similarity between any two points. This does not work for anomaly detection.

8 Conclusions

We introduce a CDF Transform-Shift (CDF-TS) algorithm based on a multi-dimensional CDF transform to effectively deal with datasets with inhomogeneous densities. It is the first attempt to change the volumes of every point’s neighbourhood in the entire multi-dimensional space in order to get the desired effect of uniform distribution for the whole dataset w.r.t a density estimator. Existing CDF transform methods such as ReScale 36 and DScale 37 achieved a much reduced effect on the recaled dataset. We show that this more sophisticated CDF transform brings about a much better outcome.

In addition, CDF-TS is more generic than existing CDF transform methods with the ability to use different density estimators, i.e, uniform kernels with either fixed or variable bandwidths. As a result, CDF-TS can be applied to more algorithms/tasks with a better outcome than existing methods such as NN kernel 26.

Through intensive evaluations, we show that CDF-TS significantly improves the performance of two existing density-based clustering algorithms and one existing distance-based anomaly detector.

Appendix: DScale algorithm

Algorithm 2 is a generalisation of the implementation of DScale 37 which can employ a density estimator with a bandwidth parameter . It requires one parameter only.

1: - input distance matrix ( matrix); - bandwidth parameter; - dimensionality of the dataset.
2: - distance matrix after scaling.
3: the maximum distance in
4:Initialising matrix
5:for  to  do
6:     
7: