Adaptive Wavelet Clustering for High Noise Data

11/27/2018 ∙ by Zengjian Chen, et al. ∙ University of Massachusetts Amherst cornell university 0

In this paper we make progress on the unsupervised task of mining arbitrarily shaped clusters in highly noisy datasets, which is present in many real-world applications. Based on the fundamental work that first applies a wavelet transform to data clustering, we propose an adaptive clustering algorithm, denoted as AdaWave, which exhibits favorable characteristics for clustering. By a self-adaptive thresholding technique, AdaWave is parameter free and can handle data in various situations. It is deterministic, fast in linear time, order-insensitive, shape-insensitive, robust to highly noisy data, and requires no pre-knowledge on data models. Moreover, AdaWave inherits the ability from the wavelet transform to clustering data in different resolutions. We adopt the `grid labeling' data structure to drastically reduce the memory consumption of the wavelet transform so that AdaWave can be used for relatively high dimensional data. Experiments on synthetic as well as natural datasets demonstrate the effectiveness and efficiency of the proposed method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Clustering is a fundamental data mining task that finds groups of similar objects while keeping dissimilar objects separated in different groups or in the group of noise (noisy points) [15, 17]

. The objects can be spatial data points, feature vectors, or patterns. Typical clustering techniques include centroid-based clustering 

[18]

, spectral clustering 

[16], density based clustering [17], etc. These techniques usually perform well on “clean” data. However, they face a big challenge for real-world applications where patterns usually “swim” in a sea of clutter or noise. Furthermore, complex and irregualarly shaped groups render these once-effective clustering techniques intractable, because typical clustering approaches are either deficient of a clear “noise” concept or limited in specific situations due to its shape-sensitive property.

This work addresses the problem of effectively uncovering arbitrarily shaped clusters when the noise is extremely high. Based on the pioneering work of Sheikholeslami that applies wavelet transform, originally used for signal processing, on spatial data clustering [12], we propose a new wavelet based algorithm called AdaWave that can adaptively and effectively uncover clusters in highly noisy data. To tackle general applications, we assume that the clusters in a dataset do not follow any specific distribution and can be of arbitrarily shaped.

To show the hardness of the clustering task, we first design a highly noisy running example, as illustrated in Fig. 1 (a), which is highly noisy with clusters in various types. Fig. 1

(b) shows that AdaWave can correctly locate these clusters. Without any estimation of the explicit models on the data, the proposed AdaWave algorithm exhibits properties that are favorable to large scale real-world applications.

(a) raw data
(b) AdaWave clustering
Fig. 1: A running example.

For comparison, we illustrate the results of several typical clustering methods on the running example: -means [23, 24] as the representative for centroid-based clustering methods, DBSCAN [19] as the representative for density-based clustering methods [17, 20], and SkinnyDip [14] as a newly proposed method that handles extremely noisy environments. Besides illustration, we also use Adjusted Mutual Information (AMI) [21], a standard metric ranging from 0 at worst to 1 at best, to evaluate the performances.

  • Centroid-based clustering algorithms tend to lack a clear notion of “noise”, and behave poorly on very noisy data. As shown in Fig. 2(b), the standard -means yields poor results, and the AMI is very low at 0.25.

  • Density-based clustering algorithms usually perform well in the case where clusters have heterogeneous form. As a representative, DBSCAN is known to robustly discover clusters of varying shapes in noisy environments. After fine-tuning the parameters that requires graphical inspection, the result of DBSCAN is shown in Fig. 2(c). DBSCAN roughly finds the shapes of the clusters, but there is still much room for improvement. It detects 21 clusters with an AMI of 0.28, primarily because there are various areas in which, through the randomness of noise, the local density threshold is exceeded. Also, we have tested various parameters and did not find a parameter combination for DBSCAN that can solve the problem.

  • SkinnyDip is known to robustly cluster spatial data in extremely noisy environments. For the running example, we sample from the nodes of a [13], which is regarded as the best mechanism for choosing a starting point for gradient-ascent in SkinnyDip. However, the clustering result in Fig. 2(c) shows that Skinnydip is performs poorly as the datasets do not satisfy its assumption that the projection of clusters to each dimension is in a unimodal shape.

The above methods, which are heavily rely on the estimation of explicit models on the data, produce high quality results when the data is organized according to the assumed models. However, when facing more complicated and unknown cluster shapes or highly noisy environments, these methods do not perform well as expected. By comparison, as shown in Fig. 2(d), the proposed AdaWave, which applies wavelet decomposition to denoising and “elbow theory” in adaptive threshold setting, correctly detects all the five clusters and groups all the noisy points as the sixth cluster. AdaWave achieves a AMI value as high as 0.76 and, furthermore, computes deterministically and runs in linear time.

Fig. 2: Clutering results on the running example.

We propose AdaWave111The data and code will be public after the review. to cluster highly noisy data which frequently appears in real-world applications where clusters are in extreme-noise settings with arbitrary shapes. We also demonstrate the efficiency and effectiveness of AdaWave on synthetic as well as real-world data. The main characteristics of AdaWave that distinguish the algorithm from existing clustering methods are as follows:

  • AdaWave is deterministic, fast, order-insensitive, shape-insensitive, robusts in highly noisy environment, and requires no pre-knowledge on the data models. To our knowledge, there is no other clustering method that meets all these properties.

  • We design a new data structure for wavelet transform such that comparing to the classic WaveCluster algorithm, AdaWave is able to handle high-dimensional data meanwhile remain storage-friendly in such situation.

  • We propose a heuristic method based on

    elbow theory to adaptively estimate the best threshold for noise filtering. By implementing the self-adaptive operation on the threshold, AdaWave exhibits high robustness in extremely noisy environment.

The rest of this paper is organized as follows. Section II provides a quick review for existing clustering methods. Section III analyzes the properties of wavelet transform and explains why wavelet transform exhibits distinctive strength in clustering and automatical denoising. In Section IV we discuss the in detail and summarize the proposed clustering algorithm AdaWave. Section V carries out experiments and does comparisons with related clustering methods. The results are discussed in Section VI, Section VII concludes our work.

Ii Related Work

Various approaches have been proposed to improve the robustness of clustering algorithms in noisy data [19, 9, 14]. Here we highlight several algorithms most related to this special problem and focus on illustrating the preconditions for these clustering methods.

DBSCAN [19] is a typical density-based clustering method designed to reveal clusters of arbitrary shapes in noisy data. When varying the noise level on the running example, we find that DBSCAN performs well only when the noise is controlled below 15%. Its performance derogates drastically as we continue to increase the noise percentage. Meanwhile, the overall average time complexity for DBSCAN is for data points and, in the worst case, . Thus, its running time can also be a limiting factor when dealing with large scale datasets. Another density-based clustering method, Sync [21], exhibits the same weakness on time complexity (reliance on pair-wise distance, with time complexity of .

Regarding data with high noise, as early as 1998, Dasgupta et al. [9] proposed an algorithm to detect minefields and seismic faults from “cluttered” data. Their method is limited to the two-dimensional case, and an extended version for slightly higher dimension () requires significant parameter tuning.

In 2016, Samuel et al. [14] proposed an intriguing method called SkinnyDip [14]. SkinnyDip optimizes DipMeans [7] with an elegant dip test of unimodality [22]. It focuses on the high noise situation, and yields good result when taking a unimodal form on each coordinate direction. However, the condition is very strict that the projections of clusters have to be of unimodal shapes in every dimension. When such condition does not exist, SkinnyDip could not uncover clusters.

A newly proposed work in 2017 [25] applies a sparse and latent decomposition of the similarity graph used in spectral clustering to jointly learn the spectral embedding and the corrupted data. It proposes a robust spectral clustering (RSC) technique that are not sensitive to noisy data. Their experiments demonstrated the robustness of RSC against spectral clustering methods. However, it can only deal with low-noise data (up to and of noise).

Our proposed AdaWave algorithm targets extreme-noise settings, as SkinnyDip does. Besides, as a grid-based algorithm, AdaWave shares the common characteristic with STING [10] and CLIQUE [11]: fast and independent of the number of data objects. Though the time complexity of AdaWave is , slightly higher than that of SkinnyDip, it still runs in linear time and yields good results when dataset consists of clusters in irregular shapes.

Experiments in Section V show that AdaWave outperforms other algorithms, especially in the following scenarios that happen commonly in large scale real applications, when 1) the data contains clusters of irregular shapes such as a ring, 2) is a very large dataset in relatively high dimensions, 3) the dataset contains a very high percentage (for example 80%) of noise.

Iii Wavelet Transform

Wavelet transform has been known as an efficient denoising technology. Overpassing its predecessor Fourier Transform, wavelet transform can analyze the frequency attributes of a signal when its spatial information is retained.

Iii-a Principles of Wavelet Transform

In this section, we focus on (DWT) which is applied in AdaWave algorithm. The ‘transform’ in DWT separates the original signal into scale space and wavelet space. The scale space stores an approximation of the outline of the signal while the wavelet space stores the detail of the signal. Thanks to the Mallat algorithm [1], we can simplify the complicated process of DWT into two filters.

Iii-A1 1D Discrete Wavelet Transform

As shown in Fig. 3, signal pass two filters and and is down-sampled by 2. According to the Mallat algorithm[12], signal can be decomposed into scale space and wavelet space by passing a low-pass filter and a high-pass filter correspondingly. Choosing different wavelet functions, we can get related filters by looking up a precalculated table.

Fig. 3: Mallat Algorithm [12].

Just by passing the signal through two filters and down sample by 2, we are able to decompose it into a space that only contains the outline of the signal and another space that only contains the detail. The signal discussed here includes high dimensional signals. Passing a filter for a -dimensional signal is just repeating the process of 1D signal for times.

In both data science and information theory, noise is defined as the collection of unstable, nonsense points in signal (dataset). As shown in Fig.

4, in denoising task, we are trying to maintain the outline of the signal and amplify the contrast between high value and low value, which is a perfect stage for DWT.

Fig. 4: Illustration on denoising.

Iii-A2 2D Discrete Wavelet Transform

To further show DWT’s denoising feature regarding space data, we apply two separate 1D wavelet transform on 2D dataset and filter co-efficient in transformed feature space. Referring to  [3], the 2D feature space is first convolved along the horizontal () dimension, resulting in a low-pass feature space and a high-pass feature space . We then downsample each of the convolved space in the dimension by 2. Both and are then convolved along the vertical () dimension, resulting in four subspaces: (average signal), (horizontal features), (vertical features), and (diagonal features). Next, by extracting the signal part () and discarding low-value coefficients, we obtain the transformed feature space as shown in Fig. 5.

Intuitively, according to Fig. 5

, dense regions in the original feature space act as attractors to the nearby points and at the same time as inhibitors to the points that are not close enough. This means that clusters in the data and the clear regions around them automatically stand out and become more distinct. Also, it is evident that the number of points sparsely scattered (outliers) in the transformed feature space lower than that in the original space. The decrese in outliers reveals the robustness of DWT regarding extreme noise.

(a) original feature space
(b) transformed feature space
Fig. 5: 2D discrete wavelet transform

As mentioned earlier, the above wavelet model can similarly be generalized in -dimensional feature space, where one-dimensional wavelet transform will be applied times.

Iii-B Properties of Wavelet Transform

Besides its astonishing ability of analyzing frequency and spatial information at the same time, other properties of DWT also distinguish it among various denoising methods.

Low entropy. The sparse distribution of wavelet coefficients reduces the entropy after DWT. Thus after the signal decomposition, many wavelet coefficients are close to zero, which generally refers to the noise. The main component of the signal is more concentrated in some wavelet basis, therefore removing the low-value coefficients is an effective denoising method that can better retain the original signal.

Multi-resolution. As shown in Fig. 3, DWT is implemented in a layered structure. In each layer of decomposition, we only decompose which represents the wavelet space, also known as the detailed feature of the signal, and preserve the scale space . Such layered structure gives us the possibility to observe signal in different resolutions of scale space, to , in a single application of DWT. Even though we didn’t implement this idea in AdaWave, this property may inspire new algorithm design.

De-correlation. As illustrated above, DWT will separate signals into scale space and wavelet space. By such separation, DWT de-correlated the ‘detail’ (the part that oscillated very fast in Fig. 4). With the de-correlation property, DWT works especially well on separating noise from highly noisy data.

Flexibility of choosing basis. Wavelet transform can be applied with a flexible choice of wavelet function. As described above, for each kind of wavelet function, there exists a pre-calculated filter. According to different requirements, various wavelet families have been designed, including Daubechies, Biorthogonal and so on. The users of AdaWave have the flexibility to choose any kind of wavelet basis, which makes the algorithm universal under various requirements.

Iv Wavelet Based Adaptive Clustering

AdaWave is an efficient algorithm based on wavelet decomposition. It is generally divided into four parts, as shown in Algorithm 1. In the first step we propose a new data structure for clustering high-dimension data. With space quantization, we divide data into grids and project them to a high-scale space. Step 2 is wavelet decomposition that preliminarily denoises by removing wavelet coefficients close to zero. Step 3 is threshold filtering, which is a further step to eliminate noise. And we apply “elbow theory” in adaptively setting the threshold. In the last step, we label and make the lookup table, thereby transforming grids to the original data.

1:Data Matrix
2:Clustered objects
3:Quantize feature space, then assign objects to the grids.
4:Apply wavelet transform to the quantized feature space.
5:Adaptively find the threshold and filter the noise.
6:Find the connected components (clusters) in the subbands of transformed feature space at different levels.
7:Assign labels to the grids.
8:Make the lookup table and map objects to clusters.
Algorithm 1 AdaWave algorithm

Iv-a Quantization

The first step of AdaWave is to quantize the feature space. Assume that the -dimension in the feature space will be divided into intervals. By making such division in every dimension, we separate the feature space into multiple grids. Objects are allocated into these grids according to the coordinates at each dimension.

In the original wavelet clustering algorithm[12], a unit = contains an object =, if ,1 j We may recall that is the right open interval in the partitioning of . For each unit (or grid), we use the number of objects contained in the unit, as the value of grid density, to represent the statistical feature of the grid. The selection of the number of grids to generate and statistical feature to use can significantly affect the performance of AdaWave.

It is easy to get trapped by storing all grids in the feature space to keep the quantized data. Even though quantifying the data can be finished in linear time, it can lead to an exponential consumption of memory related to the dimension . In AdaWave, inspired by DENCLUE [2] algorithm, we successfully achieved the goal of “only storing the grid with non-zero density” by “labeling” the grids in the data space. When considering low dimensional data, only storing grids with non-zero density cannot demonstrate its advantage because of the high data density in the entire space. However, in high dimension, the number of grids far exceeds the number of data points. When a lot of grids are of zero density, the above strategy can save considerable amount of memory, making it possible to apply AdaWave to high dimensional clustering problems.

1:Data Matrix
2:Quantized grids set , stored as {:}
3:for   do
4:     /*Calculate the coordinates(id) of the grid*/
5:      = getCoordinate()
6:     if   then
7:         /*If .id exists, add 1 to its density*/
8:         .get() += 1
9:     else
10:         /*Else, set its density as 1*/
11:         .add( = 1)
12:     end if
13:end for
Algorithm 2 Data space quantification

Iv-B Transformation and Coefficient Denoising

At the second step, discrete wavelet transform will be applied to the quantized feature space. Applying wavelet transform to the grids in { : 1 } results in a new feature space, and hence new grids { : 1 }. Given the set of grids { : 1 }, AdaWave detects the connected components in the transformed feature space. Each connected component is a set of grids in { : 1 } and is considered as a cluster. Corresponding to each resolution of the wavelet transform, there is a set of clusters , where, usually at the coarser resolutions, the number of clusters is low. In experiments, we applied each of the three-level wavelet transforms, Daubechies, Cohen-Daubechies-Feauveau ((4,2) and (2,2)). (feature spaces) give approximations of the original feature space at different scales to find clusters at different levels.

1:Grids after quantization , stored as {:}
2:The result after wavelet decomposition , stored as {:}
3:/*Wavelet decomposition by dimension*/
4:for  in range(0,do
5:     /*Traverse the grids set G*/
6:     for ,  do
7:          = dwt(:)
8:         /*The length of the basis may be greater than 2, may be an overlap.*/
9:         if .id .id() then
10:              .get(.id) += .density
11:         else
12:              .add()
13:         end if
14:     end for
15:end for
Algorithm 3 Wavelet decomposition

Iv-C Threshold Filtering

At the third step, the key step of AdaWave, we identify noise and clusters by removing the noise grids from the grid set. For a high noise percentage, it is hard for the original WaveCluster to eliminate noise by applying wavelet transform to the original feature space and take advantage of the low-pass filters used in the wavelet transform to automatically remove the noise.

After performing Wavelet transform and eliminating the low value coefficients, we can preliminarily remove the outliers with the sparse density. In other words, when the noise percentage is below 20%, Wavelet transform gains outstanding performance, and the computation complexity is . The automatic and efficient removal of the noise advantageously enables AdaWave to outperform many other clustering algorithms.

If 50 percent of the dataset is noise, many noise grids would also have high density and WaveCluster cannot distinguish them from clusters. Therefore, an additional technique would be applied to further eliminate the noise grids.

(a) sorted grids
(b) adaptively find the threshold
Fig. 6: Threshold choosing

The density chart after sorting is shown in Fig. 6(a) and the grid densities occur as shown is because the entire data space is “averaged” during the wavelet transform. The essence of the wavelet transformation is a process of “filtering”. Since the filter corresponding to the scale space in wavelet transformation is a “low-pass filter”, the high frequency part that changes rapidly in grid data, is filtered out, leaving the low frequency data to represent the signal profile.

After low-pass filtering, the density of the grids in the signal space should be roughly divided into three categories: signal data, middle data, noise data. The chart of these three kinds of grids is statistically fitted with three line segment. The grid density of signal data should be the largest, represented by the leftmost vertical line in Fig. 6 (b). The middle part consists of the grids between clusters and noise. Due to the low-pass filtering in wavelet decomposition, these grids are lower in density than the grids in the class but higher than that of the noise-only grids. The density of these grids decreases according to its distance from the class. The noise part also appears to be relatively smooth due to the low-pass filtering. Since the density of the noise data is much lower than that of the data in the class, the noise data is represented by the third line that is almost horizontal.

According to lots of experiments on various datasets, the position where “middle line” and “noise line” intersects is generally the best threshold. The algorithm below is the adaptive technique to find the statistically best threshold.

1:Sorted grids set , stored as {:}
2:The threshold for filtering
3: = ;
4:/*Scan the sorted grids*/
5:for  in  do
6:      = ;
7:      = .next();
8:      = .next().next();
9:      = computeAngle(,,)
10:     /*Find the turning point*/
11:     if   then return .value
12:     end if
13:end for
Algorithm 4 Thresholding choosing

Iv-D Label and Make Lookup Table

Each cluster , , has a cluster number . The fourth step of the algorithm labels the grids in the feature space included in a cluster with the cluster number. That is, , where is the label of the grid . The clusters that are found are in the transformed feature space and are based on wavelet coefficients. Thus, they cannot be directly used to define the clusters in the original feature space. AdaWave builds a lookup table LT to map the grids in the transformed feature space to the grids in the original feature space. Each entry in the table specifies the relationship between one grid in the transformed feature space and the corresponding grid(s) of the original feature space. Therefore, the label of each grid in the original feature space can be easily determined. Finally, AdaWave assigns the label of each grid in the feature space to all of the objects whose feature vector is inside that grid, and thus determines the clusters. Formally, , , , where is the cluster label of object .

Iv-E Time Complexity

Let be the number of objects in the database and be the number of grids. Assuming that the feature vectors of objects are -dimensional,we have a -dimensional feature space. Here we suppose is large and is comparatively small.

For the first step, the time complexity is , because it scans all the database objects and assigns them to the corresponding grids, where each dimension in the d-dimensional feature space will be divided into intervals. Assuming for each dimension of the feature space, there would be grids [4].

For the second step, the complexity of applying wavelet transform to the feature space is , where is a small constant representing the length of the filter used in the wavelet transform. As is very small, we regard as a constant and . If we apply wavelet transform to decomposition levels to sample in each level downward, the required time is [5]. That is, the cost of wavelet transform is less than , which indicates that to use the multi-resolution cluster can be very effective.

The second step aims at finding a suitable threshold for further denoising. The sorting algorithm has a time complexity of , and filtering takes . Therefore, the total complexity of this step is .

To find the connected components in the feature space, the required time will be , where is a small constant. Making the lookup table requires time. After reading data objects, data processing is performed at steps 2 to 5. Thus the time complexity of processing data (without considering I/O) would in fact be , which is independent of the number of data objects . The time complexity of the last step is .

Since we assume that AdaWave is applied to very large datasets, that is , . Thus, the overall time complexity of the AdaWave algorithm is .

Iv-F Properties of AdaWave

AdaWave is able to detect clusters in irregular shapes. In AdaWave algorithm, the spatial relationship in data has been preserved during the quantization and wavelet transform. By clustering connected grids into the same cluster, AdaWave makes no assumption on the shape of the clusters. It can find convex, concave, or nested clusters.

AdaWave is noise robust. Wavelet transform is broadly known for its great performance in denoising. AdaWave takes advantage of this property and can automatically remove the noise and outliers from the feature space.

AdaWave is memory efficient. We overcome the problem of exponentially memory grow in wavelet transform for high dimensional data. By using ‘grid labeling’ and the strategy of ‘only store none-zero grids’, AdaWave is able to process data in comparatively high dimension space, which drastically expands the limit of WaveCluster algorithm.

AdaWave is computationally efficient. As will be theoretically and experimentally discussed later, the time complexity of AdaWave is where denotes the number of data points and denotes the number of grids stored. Because in most occasions , AdaWave is very efficient for large datasets. AdaWave will be especially efficient for the cases where the number of grids and the number of feature space dimensions are low.

AdaWave is input-order insensitive. When the objects are assigned to the grids in the quantization step, the final content of the grids is independent of the order in which the objects are presented. The following steps of the algorithm will only be performed on these grids. Hence, the algorithm will have the same results with any order of the input data.

AdaWave can cluster in multi-resolution simultaneously by taking advantage of the multi-resolution attribute from the wavelet transform. By tweaking the quantization parameters or the wavelet basis, users can choose different resolution for clustering.

V Experimental Evaluation

In this section, we turn to a practical evaluation of AdaWave. First, according to general classification of acknowledged clustering methods, we choose the state-of-the-art representatives from different families for comparison. Then, we generate a synthetic dataset which exhibits the challenging properties on which we focus and compare AdaWave with the selected algorithms. We apply our method to real-world datasets, and do runtime experiment to evaluate the efficiency of AdaWave.

V-a Competitors

For comparison, we evaluate AdaWave against a set of representative competitors from different clustering families.

We begin with -means [23, 24], which is a widely known technique of Centroid-based clustering methods. To gain the best performance of -means, we correct the parameter settings of the -means. With DBSCAN [19], we have the popular member of the density-based family, which is famous for clustering arbitrary shape groups. EM [26]

focuses on probability instead of distance computation; a multivariate Gaussian probability distribution model is used to estimate the probability that a data point belongs to a cluster, and each cluster is regarded as a Gaussian model.

Next, we compare AdaWave to the state-of-the-art clustering methods. RIC [8], which fine-tunes an initial coarse clustering based on the minimum description length principle (a model-selection strategy based on balancing accuracy with complexity through data-compression). DipMeans [7] is a wrapper for automating EM and -means respectively. Self-tuning spectral clustering (STSC) [21] is a popular automated approach to spectral clustering. Moreover, we omit the comparison with SkinnyDip [14] because it is a newly proposed method that performs well on handling high-noise. The continuity properties of the dip enable SkinnyDip to exploit multimodal projection pursuit and find an appropriate basis for clustering.

V-B Synthetic datasets

In the synthetic data, we try to mimic the general situation for clustering under very high noise data. In the dataset, we emulate different shapes of clusters and various space relations between two clusters. Thus, clusters may not be able to create a uniform shape when projected to a dimension.

By default, we choose five clusters of 5600 objects each in two dimensions. One typical cluster is an approximate rectangular distribution that has no overlap in all dimensions and is Gaussian with a standard deviation of 0.005. To increase the difficulty of clustering and fully show AdaWave’s strength in finding arbitrarily shaped clusters. The next two clusters are of overlapping circular distributions in the directions of

and . The remaining two are parallel lines between two classes: circular in concentric distribution and approximate rectangles that are not completely overlapped. To evaluate AdaWave with competitors at different levels of noise, we systematically vary the noise percentage

= 20,25,…,90 by sampling from the uniform distribution over the unit square. One of the dataset we obtain (50% noise) is shown in Fig.

7.

Fig. 7: A synthetic data set (noise 50%).

As a parameter-free algorithm, AdaWave uses its default value of in all cases. We likewise use the default parameter values for the provided implementations of the baseline algorithms which require neither obscure parameters nor additional processing. To automate DBSCAN, we fix = 8 and run DBSCAN for all , reporting the best AMI result from these parameter combinations in each case. For -means, we similarly set the correct to achieve automatic clustering and ensure the best AMI result.

Fig. 8 presents the results of AdaWave and the baselines of these data. A point corresponds to the mean AMI value from all datasets corresponding to that parameter combination. With regard to fairness of the techniques that have no concept of noise (e.g. centroid-based techniques), the AMI only considers the objects which truly belong to a cluster (non-noise points). The results are discussed in Section VI.

Fig. 8: Experimental results on the synthetic dataset.

V-C Real-World Datasets

For real-world data, we analyzed nine datasets of varying size from the UCI222http://archive.ics.uci.edu/ml/. These classification-style data are often used to quantitatively evaluate clustering techniques in real-world settings and the nine datasets we use include 2D map data, and other higher dimensional data. However, it should be noted that every point is assigned a semantic class label (none of the data include a noise label). For this reason, we run the -means iteration (based on Euclidean distance) on the final AdaWave result to assign any detected noise objects to a “true” cluster. Class information is used as ground truth for the validation.

Table I summarizes the results for different datasets( denotes the total number and the dimension of dataset). In a quantitative sense, AdaWave’s results are promising when compared to the competitors. It achieves the strongest AMI value on six of the nine datasets, and is ranked third on two of the remaining datasets. On the Seeds dataset it ranks the fourth.

Dataset (,) Seeds 210,7 Roadmap 434874,2 Iris 150,4 Glass 214,9 DUMDH 869,13 HTRU2 17898,9 Derm. 366,33 Motor 94,3 Whol. 440,8 AdaWave 0.475 0.735 0.663 0.467 0.470 0.217 0.667 1.000 0.735 SkinnyDip 0.348 0.484 0.306 0.268 0.348 0.154 0.638 1.000 0.866 DBSCAN 0.000 0.313 0.604 0.170 0.073 0.000 0.620 1.000 0.696 EM 0.512 0.246 0.750 0.243 0.343 0.151 0.336 0.705 0.578 K-Means 0.607 0.619 0.601 0.136 0.213 0.116 0.465 0.835 0.826 STSC 0.523 0.564 0.734 0.367 *0.000 *0.000 0.608 1.000 0.568 DipMean 0.000 0.459 0.657 0.135 0.000 *0.000 0.296 1.000 0.426 RIC 0.003 0.001 0.424 0.350 0.131 0.000 0.053 0.522 0.308 Three failed due to non-trivial bugs in provided implementations.

TABLE I: Experimental result on real-world datasets.

As a qualitative supplement, we investigate two of the results in detail. Roadmap dataset was constructed by adding elevation information to a 2D road network in North Jutland, Denmark (covering a region of 185 x 135 ). In this experiment, we choose the original 2D road network as the dataset for clustering. The horizontal and vertical axes in Fig. 9 represent latitude and longitude respectively and every data point represents a road segment. Roadmap is clearly a typical highly noisy dataset because the majority of road segments can be termed as “noise”: long arterials connecting cities, or less-dense road networks in the relatively sparse-populated countryside. In our AdaWave algorithm, we apply 2D on the highly noisy Roadmap and further filters on the transformed grids, so that dense groups are automatically detected (with an optimal AMI value of 0.735). The clusters AdaWave detected are generally highly-populated areas (cities like Aalborg, Hjørring and Frederikshavn - with populations over 20,000), which also verify the correctness of our results.

Fig. 9: AdaWave result on Roadmap

For the second example of Glass Identification, there are has 9 attributes: refractive index and 8 chemical elements (Na, Mg etc.). Due to the high-dimension and weak correlation with class in an attribute (as shown in Table II), most techniques, according to our results, produce bad performances in glass classification (with average AMI value of less than 0.3). Instead of projected in all directions independently, AdaWave detects the connected 3 grids in the 9D feature space after the discrete wavelet transform, where a one-dimensional wavelet transform will be applied nine times. Though the clustering result of AdaWave is not particularly good with AMI value of 0.467, it is nonetheless better than its competitors. The results of Glass also reveals the difficulty of clustering in high-dimension data that is weakly corelated with class in a separate dimension.

Attribute RI Na Mg Al Si
Corelation -0.1642 0.5030 -0.7447 0.5988 0.1515
Attribute K Ca Ba Fe
Corelation -0.0100 0.0007 0.5751 -0.1879
TABLE II: Each attribute’s correlation with class (Glass)

V-D Runtime comparison

To further explore the efficiency of AdaWave and compare with the commonly used clustering algorithms, we carry out runtime experiment on synthetic data with scale (the number of objects). AdaWave is implemented in python, SkinnyDip is provided in R, and the remaining algorithms are provided in Java. Due to the difference of languages (interpreted vs compiled languages), we focus only on the asymptotic trends.

Fig. 10: Comparison on running time.

We generate our experimental data by increasing the number of objects in each cluster in equal cases. By this means, we are able to scale (the noise percentage is fixed at 75%). Then, the experiments are conducted on a computer with Intel Core i5 Processor 4GHz CPU, 500GB Memory, and Windows 10 Professional Service Operation System.

Fig. 10 illustrates the comparison on running time, AdaWave ranks the second among the five algorithms with regard to the cost. Comparing to methods like -means and DBSCAN that have the computation complexity of , AdaWave is much faster. Although the cost of AdaWave is slightly higher than SkinnyDip, which grows sub-linearly, AdaWave gains a much higher AMI in such situation. In other words, the high efficiency of SkinnyDip clustering is at the cost of shape limitation.

In general, it’s not surprising that AdaWave performs well in practical run-time growth. AdaWave is essentially a grid-based algorithm. The number of grids is much less than that of the objects, especially when the dimension is high. Due to process of quantization and the linear complexity of wavelet, AdaWave can eventually outperform other algorithms in runtime experiments.

Vi Discussion

In this section, we turn to a discussion of AdaWave and its results compared with the related clustering methods. As in the previous section, the competitors include representatives of clustering families and some state-of-the-art automatic clustering methods, which enable us to discuss AdaWave in depth.

We first make observation on the performance of the competitors. The standard -means [23, 24], which tends to lack a clear notion of “noise”, has low AMI in synthetic data even when the correct is given. Model-based approaches like EM [26] also fail to gain good performance when the cluster shapes do not fit a simple model. The popular DBSCAN [19] method is known to robustly discover clusters of varying shapes in noisy environment. With the increasement of noise, however, the limitation of DBSCAN (finding clusters of varying density) is magnified when the dataset also contains large amount of noise. As shown in Fig. 7, the AMI of DBSCAN suffers a sharp decline when the noise percentage is up to 20% and DBSCAN performs much worse than the others in extremely noisy environments.

STSC [21] and DipMeans [7] are two well-known approaches to automatic clustering, but they are unable to differentiate themselves from the other techniques in the experiments (particularly on the challenging, high-noise synthetic cases). RIC[8] is designed to be a wrapper for arbitrary clustering techniques: given a preliminary clustering, RIC purifies these clusters from noise and adjusts itself based on information-theoretical concepts. Unfortunately it meets difficulties when the amount of noise in the dataset is non-trivial: for almost all our experiments the number of clusters detected is one (giving an AMI value of zero). The dip-based technique SkinnyDip [14], which is specialized to deal with high noise data, also meets challenge in the synthetic experiment because the projection of clusters in every dimension is usually not a unimodal shape. Due to the strict precondition, SkinnyDip does not work well in real-world data either.

In general, our results on synthetic and real-world data show that AdaWave has the ability to outperform the baselines by a wide margin. The margin by which AdaWave exceeds the baselines on the data is more or less maintained as the number of clusters and level of noise are varied. It is mainly due to the process of wavelet transform and threshold filtering. The hat-shape part of wavelet filters emphasizes regions where points cluster and simultaneously suppresses weaker information in the boundary, thus automatically removing the outliers and making clusters more salient. Threshold filtering further uncovers true clusters from dense-noise grids to ensure the strong robustness of noise. The main reason for the inferiority of the baselines can be seen with an investigation into the ring-shape case: the clusters are in two overlapping circular distributions with dense noise arround, for which the comparison methods tend to group together as one or separate them as rectangle-style clusters (causing a zero AMI). Applying DWT to wavelet feature space to process coefficient reduction and threshold filtering is a feasible solution for such problems that AdaWave focuses on.

Like other grid-based techniques, with the increasement on the dimension , the limitations of AdaWave begin to emerge. Due to the high dimension, the number of grid cells rises sharply (exponential increase) when rasterized in every dimension. Grid-based clustering tends to be ineffective due to the high computational complexity. Our new data structure provides us a storage-friendly solution; however, it cannot avoid high-dimension problem in the fundamental sense. On the other hand, according to our experimental results, AdaWave gives low performance when the noise percentage is low (less than 10%) and even inferior to those with nearly 30% noise. The main reason is that the “elbow theory” may not work well in the low-noise situation. Threshold filtering aims at further eliminating dense-noise grids. However, most outliers ( ) are removed in the process of coefficient reduction, thus influencing the determination of threshold. Practically this rarely happens, because AdaWave focuses on clustering of the datasets with extremely high noise and of arbitrary shape.

Vii Conclusion

In this paper, we propose a novel clustering algorithm AdaWave for clustering data in extremely noisy environments. AdaWave is a grid-based clustering algorithm implemented by applying a wavelet transform to the quantized feature space.

AdaWave exhibits strong robustness to noise, outperforming other clustering algorithms including SkinnyDip, DBSCAN and -means on both synthetic data and natural data. Also, AdaWave doesn’t require the computation of pair-wise distance, resulting in a low complexity. On the other hand, by deriving the “grid labeling” method, we drastically reduce the memory consumption of wavelet transform and thus AdaWave can be used in analyzing dataset in relatively high dimension. Furthermore, AdaWave is able to detect arbitrary clusters, which the SkinnyDip algorithm cannot. By wavelet transform, AdaWave inherits the ability to analyze data in different resolutions.

Such properties enable AdaWave to fundamentally distinguish itself from the centroid-, distribution-, density- and connectivity-based approaches. AdaWave outperforms competing techniques especially when clusters are of irregular shapes and the environment is extremely noisy.

References

  • [1] Stéphane Mallat, “A Theory for Multiresolution Signal Decomposition: The Wavelet Representation”, IEEE Trans. Pattern Anal. Mach. Intell., vol. 11, pp.674–693, 1989.
  • [2] Alexander Hinneburg and Daniel A. Keim, “An Efficient Approach to Clustering in Large Multimedia Databases with Noise,” SIGKDD, pp. 58–65, 1998.
  • [3] Michael L. Hilton, Bjorn D. Jawerth and Ayan Sengupta, “Compressing Still and Moving Images with Wavelets.” Multimedia Systems, pp. 218–227, 1994..
  • [4] Wu, Guowei and Yao, Lin and Yao, Kai, “An Adaptive Clustering Algorithm for Intrusion Detection,” Information Acquisition, 2006 IEEE International Conference on, pp. 1443–1447, 2006.
  • [5] Zhao, Mingwei and Liu, Yang and Jiang, Rongan, “Research of WaveCluster Algorithm in Intrusion Detection System,”Computational Intelligence and Security, 2008. CIS’08. International Conference on, vol. 1, pp. 259–263, 2008.
  • [6] Maurus, Samuel and Plant, Claudia, “Skinny-dip: Clustering in a Sea of Noise,” SIGKDD, pp. 1055–1064, 2016.
  • [7] Kalogeratos, Argyris and Likas, Aristidis, “Dip-means: an incremental clustering method for estimating the number of clusters,”NIPS, pp. 2393–2401, 2012.
  • [8] Böhm, Christian and Faloutsos, Christos and Pan, Jia-Yu and Plant, Claudia, “Robust information-theoretic clustering,” SIGKDD, pp. 65–75, 2006.
  • [9] Dasgupta, Abhijit and Raftery, Adrian E, “Detecting features in spatial point processes with clutter via model-based clustering,” Journal of the American Statistical Association, vol. 93, pp. 294–302, 1998.
  • [10] Wang, Wei and Yang, Jiong and Muntz, Richard and others, ‘STING: A statistical information grid approach to spatial data mining,” VLDB, vol. 97, pp. 186–195, 1997.
  • [11] Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos and Prabhakar Raghavan, “Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications,” SIGMOD, pp. 94–105, 1998.
  • [12] Gholamhosein Sheikholeslami, Surojit Chatterjee and Aidong Zhang, “WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases,” VLDB, pp. 428–439, 1998.
  • [13] Bungartz, Hans-Joachim and Griebel, Michael, “Sparse grids,” SIGKDD, pp. 1055–1064, 2016.
  • [14] Maurus, Samuel and Plant, Claudia, “Skinny-dip: Clustering in a Sea of Noise,” Acta numerica, vol. 13, pp. 147–269, 2004.
  • [15] Jain, Anil K and Murty, M Narasimha and Flynn, Patrick J, “Data clustering: a Review,” ACM computing surveys (CSUR), vol.31, pp. 264–323, 1999.
  • [16] Von Luxburg, Ulrike, “A tutorial on spectral clustering,” Statistics and computing, vol.17, pp. 395–416, 2007.
  • [17] Hans-Peter Kriegel, Peer Kröger, Jörg Sander and Arthur Zimek, “Density-based Clustering,” Wiley Interdisc. Rew.: Data Mining and Knowledge Discovery, vol. 1, pp. 231–240, 2011.
  • [18]

    Jain, Anil K, “Data clustering: 50 years beyond K-means,” Pattern recognition letters, vol.31, pp. 651–666, 2010.

  • [19] Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu, “A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” SIGKDD, pp. 226–231, 1996.
  • [20] Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel and Jörg Sander, “OPTICS: Ordering Points To Identify the Clustering Structure,” SIGMOD, pp. 49–60, 1999.
  • [21] Christian Böhm, Claudia Plant, Junming Shao and Qinli Yang, “Clustering by Synchronization,” SIGKDD, pp. 583–592, 2010.
  • [22] Hartigan, John A and Hartigan, Pamela M, “The dip test of unimodality,” The Annals of Statistics, pp. 70–84, 1985.
  • [23] Steinhaus, H., “Sur la division des corps matériels en parties,” Bull. Acad. Polon. Sci. (in French), vol.4, pp. 801–804, 1957.
  • [24]

    E.W. Forgy, “Cluster Analysis of Multivariate Data: Efficiency Versus Interpretability of Classifications,” Biometrics, vol.21, pp. 768–769, 1965.

  • [25] Aleksandar Bojchevski, Yves Matkovic and Stephan Günnemann, “Robust Spectral Clustering for Noisy Data: Modeling Sparse Corruptions Improves Latent Embeddings,” SIGKDD, pp. 737–746, 2017.
  • [26] Celeux G and Govaert G, “A classification EM algorithm for clustering and two stochastic versions,” Elsevier Science Publishers B.V, pp. 315–332, 1992.