I Introduction
Clustering is a fundamental data mining task that finds groups of similar objects while keeping dissimilar objects separated in different groups or in the group of noise (noisy points) [15, 17]
. The objects can be spatial data points, feature vectors, or patterns. Typical clustering techniques include centroidbased clustering
[18][16], density based clustering [17], etc. These techniques usually perform well on “clean” data. However, they face a big challenge when dealing with realworld applications where patterns are usually mixed with noise. Furthermore, complex and irregualarly shaped groups render these onceeffective clustering techniques intractable, because typical clustering approaches are either deficient of a clear “noise” concept or limited in specific situations due to its shapesensitive property.This work addresses the problem of effectively uncovering arbitrarily shaped clusters when the noise is extremely high. Based on the pioneering work of Sheikholeslami that applies wavelet transform, originally used for signal processing, on spatial data clustering [12], we propose a new wavelet based algorithm called AdaWave that can adaptively and effectively uncover clusters in highly noisy data. To tackle general applications, we assume that the clusters in a dataset do not follow any specific distribution and can be arbitrarily shaped.
To show the hardness of the clustering task, we first design a highly noisy running example with clusters in various types, as illustrated in Fig. 1(a). Fig. 1
(b) shows that AdaWave can correctly locate these clusters. Without any estimation of the explicit models on the data, the proposed AdaWave algorithm exhibits properties that are favorable to large scale realworld applications.
For comparison, we illustrate the results of several typical clustering methods on the running example: means [23, 24] as the representative for centroidbased clustering methods, DBSCAN [19] as the representative for densitybased clustering methods [17, 20], and SkinnyDip [14] as a newly proposed method that handles extremely noisy environments. Besides illustration, we also use Adjusted Mutual Information (AMI) [21], a standard metric ranging from 0 at worst to 1 at best, to evaluate the performances.

Centroidbased clustering algorithms tend to lack a clear notion of “noise”, and behave poorly on very noisy data. As shown in Fig. 2(b), the standard means yields poor results and the AMI is very low at 0.25.

Densitybased clustering algorithms usually perform well in the case where clusters have heterogeneous form. As a representative, DBSCAN is known to robustly discover clusters of varying shapes in noisy environments. After finetuning the parameters that requires graphical inspection, the result of DBSCAN is shown in Fig. 2(c). DBSCAN roughly finds the shapes of the clusters, but there is still much room for improvement. It detects 21 clusters with an AMI of 0.28, primarily because there are various areas in which, through the randomness of noise, the local density threshold is exceeded. Also, we have tested various parameters and did not find a parameter combination for DBSCAN that can solve the problem.

SkinnyDip is known to robustly cluster spatial data in extremely noisy environments. For the running example, we sample from the nodes of a sparse grid [13], which is regarded as the best mechanism for choosing a starting point for gradientascent in SkinnyDip. However, the clustering result in Fig. 2(c) shows that Skinnydip performs poorly as the datasets do not satisfy its assumption that the projection of clusters to each dimension is in a unimodal shape.
The above methods heavily rely on the estimation of explicit models on the data, and produce high quality results when the data is organized according to the assumed models. However, when facing more complicated and unknown cluster shapes or highly noisy environments, these methods do not perform well as expected. By comparison, as shown in Fig. 2(d), the proposed AdaWave, which applies wavelet decomposition to denoising and “elbow theory” in adaptive threshold setting, correctly detects all the five clusters and groups all the noisy points as the sixth cluster. AdaWave achieves an AMI value as high as 0.76, and furthermore, computes deterministically and runs in linear time.
We propose AdaWave^{1}^{1}1The data and code will be public after the review. to cluster highly noisy data which frequently appears in realworld applications where clusters are in extremenoise settings with arbitrary shapes. We also demonstrate the efficiency and effectiveness of AdaWave on synthetic as well as realworld data. The main characteristics of AdaWave that distinguish the algorithm from existing clustering methods are as follows:

AdaWave is deterministic, fast, orderinsensitive, shapeinsensitive, robusts in highly noisy environment, and requires no preknowledge on the data models. To our knowledge, there is no other clustering method that meets all these properties.

We design a new data structure for wavelet transform such that comparing to the classic WaveCluster algorithm, AdaWave is able to handle highdimensional data meanwhile remain storagefriendly in such situation.

We propose a heuristic method based on
elbow theory to adaptively estimate the best threshold for noise filtering. By implementing the selfadaptive operation on the threshold, AdaWave exhibits high robustness in extremely noisy environment, and outperforms the stateofart baselines by experiments.
Ii Related Work
Various approaches have been proposed to improve the robustness of clustering algorithms in noisy data [19, 9, 14]. Here we highlight several algorithms most related to this problem and focus on illustrating the preconditions for these clustering methods.
DBSCAN [19] is a typical densitybased clustering method designed to reveal clusters of arbitrary shapes in noisy data. When varying the noise level on the running example, we find that DBSCAN performs well only when the noise is controlled below 15%. Its performance derogates drastically as we continue to increase the noise percentage. Meanwhile, the overall average time complexity for DBSCAN is for data points and, in the worst case, . Thus, its running time can also be a limiting factor when dealing with large scale datasets. Another densitybased clustering method, Sync [21], exhibits the same weakness on time complexity (reliance on pairwise distance, with time complexity of .
Regarding data with high noise, as early as 1998, Dasgupta et al. [9] proposed an algorithm to detect minefields and seismic faults from “cluttered” data. Their method is limited to the twodimensional case, and an extended version for slightly higher dimension () requires significant parameter tuning.
In 2016, Samuel et al. [14] proposed an intriguing method called SkinnyDip [14]. SkinnyDip optimizes DipMeans [7] with an elegant dip test of unimodality [22]. It focuses on the high noise situation, and yields good result when taking a unimodal form on each coordinate direction. However, the condition is very strict that the projections of clusters have to be of unimodal shapes in every dimension. When such condition does not exist, SkinnyDip could not uncover clusters correctly.
A newly proposed work in 2017 [25] applies a sparse and latent decomposition of the similarity graph used in spectral clustering to jointly learn the spectral embedding and the corrupted data. It proposes a robust spectral clustering (RSC) technique that are not sensitive to noisy data. Their experiments demonstrated the robustness of RSC against spectral clustering methods. However, it can only deal with lownoise data (up to and of noise).
Our proposed AdaWave algorithm targets extremenoise settings as SkinnyDip does. Besides, as a gridbased algorithm, AdaWave shares the common characteristic with STING [10] and CLIQUE [11]: fast and independent of the number of data objects. Though the time complexity of AdaWave is , where is the number of grids. The complexity is slightly higher than that of SkinnyDip, but AdaWave still runs in linear time and yields good results when dataset consists of clusters in irregular shapes.
Experiments in Section V show that AdaWave outperforms other algorithms, especially in the following scenarios that happen commonly in large scale real applications, when the data 1) contains clusters of irregular shapes such as rings, 2) is a very large dataset in relatively high dimensions, 3) contains a very high percentage (for example 80%) of noise.
Iii Wavelet Transform
Wavelet transform has been known as an efficient denoising technology. Overpassing its predecessor Fourier Transform, wavelet transform can analyze the frequency attributes of a signal when its spatial information is retained.
Iiia Principles of Wavelet Transform
In this section, we focus on (DWT) which is applied in AdaWave algorithm. The ‘transform’ in DWT separates the original signal into scale space and wavelet space. The scale space stores an approximation of the outline of the signal while the wavelet space stores the detail of the signal. Referring to the Mallat algorithm [1], we can simplify the complicated process of DWT into two filters.
IiiA1 1D Discrete Wavelet Transform
As shown in Fig. 3, signal pass two filters and and is downsampled by 2. According to the Mallat algorithm[12], signal can be decomposed into scale space and wavelet space by passing a lowpass filter and a highpass filter correspondingly. Choosing different wavelet functions, we can get related filters by looking up a precalculated table.
Just by passing the signal through two filters and down sample by 2, we are able to decompose the signal into a space that only contains the outline of the signal and another space that only contains the detail. The signal discussed here includes high dimensional signals. Passing a filter for a dimensional signal is just repeating the process of 1D signal for times.
In both data science and information theory, noise is defined as the collection of unstable, nonsense points in signal (dataset). As shown in Fig.
4, in denoising task, we are trying to maintain the outline of the signal and amplify the contrast between high value and low value, which is a perfect stage for DWT.IiiA2 2D Discrete Wavelet Transform
To further show DWT’s denoising feature regarding space data, we apply two separate 1D wavelet transform on 2D dataset and filter coefficient in transformed feature space. Referring to [3], the 2D feature space is first convolved along the horizontal () dimension, resulting in a lowpass feature space and a highpass feature space . We then downsample each of the convolved space in the dimension by 2. Both and are then convolved along the vertical () dimension, resulting in four subspaces: (average signal), (horizontal features), (vertical features), and (diagonal features). Next, by extracting the signal part () and discarding lowvalue coefficients, we obtain the transformed feature space, as shown in Fig. 5.
Intuitively, according to Fig. 5
, dense regions in the original feature space act as attractors to the nearby points and at the same time as inhibitors to the points that are not close enough. This means that clusters in the data and the clear regions around them automatically stand out and become more distinct. Also, it is evident that the number of points sparsely scattered (outliers) in the transformed feature space is lower than that in the original space. The decrease in outliers reveals the robustness of DWT regarding extreme noise.
As mentioned earlier, the above wavelet model can similarly be generalized in dimensional feature space, where onedimensional wavelet transform will be applied times.
IiiB Properties of Wavelet Transform
Besides its great ability of analyzing frequency and spatial information at the same time, other properties of DWT also distinguish it among various denoising methods.
Low entropy. The sparse distribution of wavelet coefficients reduces the entropy after DWT. Thus after the signal decomposition, many wavelet coefficients are close to zero, which generally refers to the noise. The main component of the signal is more concentrated in some wavelet basis, therefore removing the lowvalue coefficients is an effective denoising method that can better retain the original signal.
Multiresolution. As shown in Fig. 3, DWT is implemented in a layered structure. In each layer of decomposition, we only decompose which represents the wavelet space, also known as the detailed feature of the signal, and preserve the scale space . Such layered structure gives us the possibility to observe signal in different resolutions of scale space, to , in a single application of DWT.
Decorrelation. As illustrated above, DWT can separate signals into scale space and wavelet space. By such separation, DWT decorrelated the ‘detail’ (the part that oscillated very fast in Fig. 4). With the decorrelation property, DWT works especially well on separating noise from highly noisy data.
Flexibility of choosing basis. Wavelet transform can be applied with a flexible choice of wavelet function. As described above, for each kind of wavelet function, there exists a precalculated filter. According to different requirements, various wavelet families have been designed, including Daubechies, Biorthogonal and so on. The users of AdaWave have the flexibility to choose any kind of wavelet basis, which makes the algorithm universal under various requirements.
Iv Wavelet Based Adaptive Clustering
Given a dataset with data points in dimensional space, and the coordinate of each data is (), labeled in the groundtruth as either in a cluster or noise, the clustering problem is to cluster the data and filter the noise. AdaWave is an efficient algorithm based on wavelet transform. It is generally divided into four parts, as shown in Algorithm 1.
At the first step we propose a new data structure for clustering highdimensional data. With space quantization, we divide the data into grids and project them to a highscale space. Step 2 is a wavelet transform that preliminarily denoises by removing wavelet coefficients close to zero. Step 3 is threshold filtering, which is a further step to eliminate noise, and we apply “elbow theory” in adaptively setting the threshold. At the last step, we label and make the lookup table, thereby transforming grids to the original data.
Iva Quantization
The first step of AdaWave is to quantize the feature space. Assume that the domain at the dimension in the feature space is divided into intervals. By making such division in each dimension, we separate the feature space into multiple grids. Objects (data points) are allocated into these grids according to the coordinates at each dimension.
In the original wavelet clustering algorithm WaveCluster[12], each grid is the intersection of one interval from each dimension and is a right open interval in the partition of . An object (data point) = is contained in a grid if for 1 For each grid, we use the number of objects contained in the grid as the value of grid density to represent its statistical feature. The selection of the number of grids to generate and statistical feature to use can significantly affect the performance of AdaWave.
It is easy to get trapped by storing all grids in the feature space to keep the quantized data. Even though quantifying the data can be completed in linear time, it can lead to an exponential memory consumption with regard to dimension . In AdaWave, we successfully achieve the goal of “only storing the grid with nonzero density” by “labeling” the grids in the data space. When considering low dimensional data, only storing grids with nonzero density cannot demonstrate its advantage because of the high data density in the entire space. However, in high dimension, the number of grids far exceeds the number of data points. When a lot of grids are of zero density, the above strategy can save considerable amount of memory, making it possible to apply AdaWave to high dimensional clustering problems.
IvB Transformation and Coefficient Denoising
At the second step, we apply wavelet transform to the dimensional grid set to transform the grids into a new feature space. According to Eq. (1), the original data can be represented by wavelet coefficients and scaling coefficients in the feature space, determined by orthonormal basis and .
(1) 
We calculate the coefficients by Eq. (2), where scaling coefficients {} represent the signal after lowpass filtering that preserves the smooth shape, while wavelet coefficients represent the signal after highpass filtering and usually correspond to the noise part.
(2) 
After discrete wavelet transform, we remove the wavelet coefficients and the low value of scaling coefficients (noise part of the signal), then reconstruct the new grid set with the remaining coefficients. In this way, noise (outliers) is automatically eliminated.
IvC Threshold Filtering
At the third step, a key step of AdaWave, we identify noise by removing the noise grids from the grid set. For a high noise percentage, it is hard for the original WaveCluster to eliminate noise by applying wavelet transform to the original feature space and take advantage of the lowpass filters used in the wavelet transform to automatically remove the noise.
After performing wavelet transform and eliminating the low value coefficients, we can preliminarily remove the outliers with the sparse density. In other words, when the noise percentage is below 20%, wavelet transform gains outstanding performance, and the computation complexity is . The automatic and efficient removal of the noise enables AdaWave to outperform many other clustering algorithms.
If 50% of the dataset is noise, many noise grids would also have high density and purely applying wavelet transform cannot distinguish noise from clusters. Therefore, an additional technique is applied to further eliminate the noise grids.
The density chart after sorting is shown in Fig. 6(a) and the grid densities occur as shown because the entire data space is “averaged” during the wavelet transform. The essence of the wavelet transform is a process of “filtering”. Since the filter corresponding to the scale space in wavelet transform is a “lowpass filter”, the high frequency part that changes rapidly in grid data is filtered out, leaving the low frequency data to represent the signal profile.
After lowpass filtering, the density of the grids in the signal space is roughly divided into three categories: signal data, middle data, and noise data. The chart of these three types of grids is statistically fitted with three line segments. The grid density of signal data should be the largest, represented by the leftmost vertical line in Fig. 6(b). The middle part consists of the grids between clusters and noise. Due to the lowpass filtering in wavelet transform, these grids has density lower than the grids in the class but higher than that of the noiseonly grids. The density of these grids decreases according to its distance from the class. The noise part also appears to be relatively smooth due to the lowpass filtering. Since the density of the noise data is much lower than that of the data in the class, the noise data is represented by the third line that is almost horizontal.
Based on our test experiments on various datasets, the position where the “middle line” and the “noise line” intersects is generally the best threshold. The algorithm below is the adaptive technique to find the statistically best threshold.
IvD Label and Make Lookup Table
Each cluster has a cluster ID. The forth step of the AdaWave labels the grids in the feature space included in a cluster with the cluster ID. The clusters found in the transformed feature space cannot be directly used to define the clusters in the original feature space, because they are only based on wavelet coefficients. AdaWave builds a lookup table to map the grids in the transformed feature space to the grids in the original feature space. Each entry in the table specifies the relationship between one grid in the transformed feature space and the corresponding grid(s) in the original feature space. Therefore, the label of each grid in the original feature space can be easily determined. Finally, AdaWave assigns the label of each grid in the feature space to all objects whose feature vector is inside that grid, and thus determines the clusters.
IvE Time Complexity
Let be the number of objects in the database and be the number of grids. Assuming that the feature vectors of the objects are in dimensions. Here we suppose is large and is comparatively small.
For the first step, the time complexity is , because AdaWave scans all the objects and assigns them to the corresponding grids, where the domain at each dimension in the dimensional feature space will be divided into intervals. Assuming the number of intervals for each dimension of the feature space, there would be grids [4].
For the second step, the complexity of applying wavelet transform to the feature space is , where is a small constant representing the length of the filter used in the wavelet transform. As is small, we regard as a constant and . If we apply wavelet transform to decomposition levels to sample in each level downward, the required time is less than [5]. That is, the cost of wavelet transform is less than , indicating that a multiresolution clustering can be very effective.
The third step aims at finding a suitable threshold for further denoising. The sorting algorithm has a time complexity of , and filtering takes . Therefore, the total complexity of this step is .
To find the connected components in the feature space, the required time will be , where is a small constant. Making the lookup table requires time. After reading data objects, data processing is performed sequentially. Thus the time complexity of processing data (without considering I/O) would be , which is independent of the number of data objects . The time complexity of the last step is .
For very large datasets, , , the overall time complexity of AdaWave is .
IvF Properties of AdaWave
AdaWave is able to detect clusters in irregular shapes. In AdaWave algorithm, the spatial relationship in data has been preserved during the quantization and wavelet transform. By clustering connected grids into the same cluster, AdaWave makes no assumption on the shape of the clusters. It can find convex, concave, or nested clusters.
AdaWave is noise robust. Wavelet transform is broadly known for its great performance in denoising. AdaWave takes advantage of this property and can automatically remove the noise and outliers from the feature space.
AdaWave is memory efficient. We overcome the problem of exponentially memory grow in wavelet transform for high dimensional data. By using ‘grid labeling’ and the strategy of ‘only store nonezero grids’, AdaWave is able to process data in comparatively high dimensional space, which drastically expands the limit of WaveCluster algorithm.
AdaWave is computationally efficient. The time complexity of AdaWave is where denotes the number of data points and denotes the number of grids stored. AdaWave is very efficient for large datasets where , and .
AdaWave is inputorder insensitive. When the objects are assigned to the grids in the quantization step, the final content of the grids is independent of the order in which the objects are presented. The following steps of the algorithm will only be performed on these grids. Hence, the algorithm will have the same results with any order of the input data.
AdaWave can cluster in multiresolution simultaneously by taking advantage of the multiresolution attribute from the wavelet transform. By tweaking the quantization parameters or the wavelet basis, users can choose different resolution for clustering.
V Experimental Evaluation
In this section, we turn to a practical evaluation of AdaWave. First, according to general classification of acknowledged clustering methods, we choose the stateoftheart representatives from different families for comparison. Then, we generate a synthetic dataset which exhibits the challenging properties on which we focus and compare AdaWave with the selected algorithms. We apply our method to realworld datasets, and do runtime experiment to evaluate the efficiency of AdaWave.
Va Baselines
For comparison, we evaluate AdaWave against a set of representative baselines from different clustering families and the stateofart algorithms.
We begin with means [23, 24], which is a widely known technique of centroidbased clustering methods. To achieve the best performance of means, we set the correct the parameter for . With DBSCAN [19], we have the popular member of the densitybased family that is famous for clustering arbitrary shape groups. EM [26]
focuses on probability instead of distance computation; a multivariate Gaussian probability distribution model is used to estimate the probability that a data point belongs to a cluster, with each cluster regarded as a Gaussian model.
Next, we compare AdaWave to advanced clustering methods proposed recently. RIC [8], which finetunes an initial coarse clustering based on the minimum description length principle (a modelselection strategy based on balancing accuracy with complexity through datacompression). DipMeans [7] is a wrapper for automating EM and means respectively. Selftuning spectral clustering (STSC) [21] is a popular automated approach to spectral clustering. Moreover, we consider SkinnyDip [14] because it is a newly proposed method that performs well on handling highnoise. The continuity properties of the dip enable SkinnyDip to exploit multimodal projection pursuit and find an appropriate basis for clustering.
VB Synthetic datasets
In the synthetic data, we try to mimic the general situation for clustering under very high noise. In the dataset, we simulate different shapes of clusters and various space relations between two clusters. Thus, clusters may not be able to create a uniform shape when projected to a dimension.
By default, we generate five clusters of 5600 objects each in two dimensions, as shown in Fig. 7
. There is a typical cluster roughly within an ellipse with each data in Gaussian distribution with a standard deviation of 0.005. To increase the difficulty of clustering, the next two clusters are of circular distributions overlapping in the directions of
and . The remaining two clusters are in the shape of parallel sloping lines. To evaluate AdaWave with baselines at different degree of noise, we systematically vary the noise percentage= {20,25, … ,90} by sampling from the uniform distribution over the whole square. In Fig.
7, the noise is 50%.As a parameterfree algorithm, AdaWave uses its default value of with parameters in all cases, and we choose CohenDaubechiesFeauveau (2,2) for the wavelet transform. We likewise use the default parameter values for the provided implementations of the baseline algorithms which require neither obscure parameters nor additional processing. To automate DBSCAN, we fix = 8 and run DBSCAN for all , reporting the best AMI result from these parameter combinations in each case. For means, we similarly set the correct to achieve automatic clustering and ensure the best AMI result.
Fig. 8 presents the results of AdaWave and the baselines on the synthetic data. With regard to fairness of the techniques that have no aware of noise (e.g. centroidbased techniques), the AMI only considers the objects which truly belong to a cluster (nonnoise points). AdaWave clearly outperforms all the baselines for every parameter setting of the noise, and is less sensitive to the noise increase. Even with 90% noise, AdaWave still has a high AMI of 0.55 while others are around 0.20 except DBSCAN. EM, SkinnyDip, means and WaveCluster behave similarly, while WaveCluster yields the lowest, and means the second lowest. DBSCAN can be as good as AdaWave in low noise environment (20%), but its performance decays quickly and it could not find any clusters when the noise is above 60%.
VC RealWorld Datasets
For realworld data, we analyzed nine datasets of varying size from the UCI^{2}^{2}2http://archive.ics.uci.edu/ml/ repository, namely Seeds, Roadmap, Iris, Glass, DUMDH, HTRU2, Dermatology (Derm.), Motor and Wholesale customers (Whol.). These classificationstyle data are often used to quantitatively evaluate clustering techniques in realworld settings and the nine datasets we use include a 2D map data and higher dimensional datasets. For some real data where every point is assigned a semantic class label (none of the data include a noise label), we run the means iteration (based on Euclidean distance) on the final AdaWave result to assign every detected noise objects to a “true” cluster. Class information is used as ground truth for the validation.
Table I summarizes the results for different datasets ( denotes the number of data points and the dimension of the data). In a quantitative sense, AdaWave’s results are promising when comparing to the baselines. It achieves the highest AMI value on six of the nine datasets, and ranks the third on two of the remaining datasets (Iris and Whol.). Only on the Seeds dataset AdaWave ranks the fourth among the eight algorithms. In general, AdaWave behaves the best with an average AMI of 0.60, followed by means, SkinnyDip and STSC with an average AMI of around 0.49.
VD Case Study
As a qualitative case study, we investigate two of the clustering results in detail.
Roadmap dataset was constructed by adding elevation information to a 2D road network in North Jutland, Denmark (covering a region of 185 x 135 ). In this experiment, we choose the original 2D road network as the dataset for clustering. The horizontal and vertical axes in Fig. 9 represent latitude and longitude respectively and every data point represents a road segment. Roadmap is clearly a typical highly noisy dataset because the majority of road segments can be termed as “noise”: long arterials connecting cities, or lessdense road networks in the relatively sparsepopulated countryside. In our AdaWave algorithm, we apply 2D on the highly noisy Roadmap and further filters on the transformed grids, so that dense groups are automatically detected (with a highest AMI value of 0.735). The clusters AdaWave detected are generally highlypopulated areas (cities like Aalborg, Hjørring and Frederikshavn with populations over 20,000), which also verify the correctness of our result.
The second example is the Glass identification. There are nine attributes: refractive index and 8 chemical elements (Na, Mg etc.). As the dimension is relatively high and the correlation of some attribute to the class is weak (shown in Table II), most techniques produce poor performances for the Glass classification (the AMI value is less than 0.3). Instead of projected in all directions independently, AdaWave detects the connected 3 grids in the 9 dimensional feature space after the discrete wavelet transform, where a onedimensional wavelet transform will be applied nine times. Though the clustering result of AdaWave with an AMI value of 0.467 is not particularly good, it is considerably much better than the baselines. The results of Glass also reveals the difficulty of clustering in highdimension data that has weakly correlation with class in each separate dimension.
Attribute  RI  Na  Mg  Al  Si 
Corelation  0.1642  0.5030  0.7447  0.5988  0.1515 
Attribute  K  Ca  Ba  Fe  
Corelation  0.0100  0.0007  0.5751  0.1879 
VE Runtime comparison
To further explore the efficiency of AdaWave, we carry out runtime experiment on synthetic data with scale (the number of objects) and compare with several typical clustering algorithms. AdaWave is implemented in python, SkinnyDip is provided in R, and the remaining algorithms are provided in Java. Due to the difference of languages (interpreted vs compiled languages), we focus only on the asymptotic trends.
We generate the experimental data by increasing the number of objects in each cluster in equal cases. By this means, we are able to scale (the noise percentage is fixed at 75%). Then, the experiments are conducted on a computer with Intel Core i5 Processor 4GHz CPU, 500GB Memory, and Windows 10 Professional Service Operation System.
Fig. 10 illustrates the comparison on running time, AdaWave ranks the second among the five algorithms with regard to the cost. Comparing to methods like means and DBSCAN that have a computation complexity of , AdaWave is much faster. Although the cost of AdaWave is slightly higher than SkinnyDip, which grows sublinearly, AdaWave gains a much higher AMI in such situation. In other words, the high efficiency of SkinnyDip clustering is at the cost of shape limitation.
In general, AdaWave performs well in practical runtime growth. AdaWave is essentially a gridbased algorithm. The number of grids is much less than that of the objects, especially when the dimension is high. Due to process of quantization and the linear complexity of wavelet, AdaWave can eventually outperform other algorithms in runtime experiments.
Vi Discussion
In this section, we discuss in detail on AdaWave and the baselines to have a deeper understanding on these clustering algorithms. We first make observation on the performances of the baselines.

Modelbased approaches like EM [26] also fail to gain good performance when the cluster shapes do not fit a simple model. For Roadmap dataset which is of irregular shape, EM performs poorly and is the second worst of the 8 methods.

The popular DBSCAN [19] method is known to robustly discover clusters of varying shapes in noisy environment. With the increase of noise, however, the limitation of DBSCAN (finding clusters of varying density) is magnified when the dataset also contains large amount of noise. As shown in Fig. 8, the AMI of DBSCAN suffers a sharp decline when the noise percentage is above 20% and DBSCAN performs much worse than the others in extremely noisy environments.

RIC[8] is designed to be a wrapper for arbitrary clustering techniques. Given a preliminary clustering, RIC purifies these clusters from noise and adjusts itself based on informationtheoretical concepts. Unfortunately it meets difficulties when the amount of noise in the dataset is nontrivial: for almost all of our experiments with noisy data, the number of clusters detected is one with a corresponding AMI value of zero.

The dipbased technique SkinnyDip [14], which is specialized to deal with high noise data, also meets challenge in the synthetic experiment because the projection of clusters in every dimension is usually not a unimodal shape as the algorithm desires. In general, SkinnyDip is better than all other baselines on both synthetic and real data. However, due to its strict precondition, SkinnyDip does not work as well as AdaWave in realworld data.
In general, mainly due to the process of wavelet transform and threshold filtering, AdaWave outperforms the baselines on synthetic and realworld data by a large margin. The margin is more or less maintained as the number of clusters and level of noise vary. The wavelet filters emphasize regions where points are densely located and simultaneously suppress weaker information on the boundary, thereby automatically removing the outliers and making the clusters more salient. Threshold filtering further uncovers true clusters from densenoise grids to ensure the strong robustness to noise. The main reason for the inferiority of the baselines can be seen via an investigation on the ringshape case: the clusters are in two overlapping circular distributions with dense noise around, for which the comparison methods tend to group together as one or separate them as rectanglestyle clusters (causing a zero AMI). Applying DWT to wavelet feature space to process coefficient reduction and threshold filtering proves to be a feasible solution for such problems.
Like other gridbased techniques, with the increase on the dimension , the limitations of AdaWave begin to emerge. Due to the high dimension, the number of grid cells rises sharply (exponential increase) when rasterized in every dimension. Gridbased clustering tends to be ineffective due to the high computational complexity. However, our new data structure provides us a storagefriendly solution that can handle the high dimensional data which is sparse in most real applications.
Vii Conclusion
In this paper, we propose a novel clustering algorithm AdaWave for clustering data in extremely noisy environments. AdaWave is a gridbased clustering algorithm implemented by applying a wavelet transform to the quantized feature space. AdaWave exhibits strong robustness to noise, outperforming other clustering algorithms including SkinnyDip, DBSCAN and means on both synthetic data and natural data. Also, AdaWave doesn’t require the computation of pairwise distance, resulting in a low complexity. On the other hand, by deriving the “grid labeling” method, we drastically reduce the memory consumption of wavelet transform and thus AdaWave can be used in analyzing dataset in relatively high dimension. Furthermore, AdaWave is able to detect arbitrary clusters, which the SkinnyDip algorithm cannot. By wavelet transform, AdaWave inherits the ability to analyze data in different resolutions. Such properties enable AdaWave to fundamentally distinguish itself from the centroid, distribution, density and connectivitybased approaches.
References
 [1] Stéphane Mallat, “A Theory for Multiresolution Signal Decomposition: The Wavelet Representation”, IEEE Trans. Pattern Anal. Mach. Intell., vol. 11, pp.674–693, 1989.
 [2] Alexander Hinneburg and Daniel A. Keim, “An Efficient Approach to Clustering in Large Multimedia Databases with Noise,” SIGKDD, pp. 58–65, 1998.
 [3] Michael L. Hilton, Bjorn D. Jawerth and Ayan Sengupta, “Compressing Still and Moving Images with Wavelets.” Multimedia Systems, pp. 218–227, 1994..
 [4] Wu, Guowei and Yao, Lin and Yao, Kai, “An Adaptive Clustering Algorithm for Intrusion Detection,” 2006 IEEE International Conference on, Information Acquisition, pp. 1443–1447, 2006.
 [5] Zhao, Mingwei and Liu, Yang and Jiang, Rongan, “Research of WaveCluster Algorithm in Intrusion Detection System,”Computational Intelligence and Security, 2008. International Conference on CIS., vol. 1, pp. 259–263, 2008.
 [6] Maurus, Samuel and Plant, Claudia, “Skinnydip: Clustering in a Sea of Noise,” SIGKDD, pp. 1055–1064, 2016.
 [7] Kalogeratos, Argyris and Likas, Aristidis, “Dipmeans: an incremental clustering method for estimating the number of clusters,”NIPS, pp. 2393–2401, 2012.
 [8] Böhm, Christian and Faloutsos, Christos and Pan, JiaYu and Plant, Claudia, “Robust informationtheoretic clustering,” SIGKDD, pp. 65–75, 2006.
 [9] Dasgupta, Abhijit and Raftery, Adrian E, “Detecting features in spatial point processes with clutter via modelbased clustering,” Journal of the American Statistical Association, vol. 93, pp. 294–302, 1998.
 [10] Wang, Wei and Yang, Jiong and Muntz, Richard and others, ‘STING: A statistical information grid approach to spatial data mining,” VLDB, vol. 97, pp. 186–195, 1997.
 [11] Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos and Prabhakar Raghavan, “Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications,” SIGMOD, pp. 94–105, 1998.
 [12] Gholamhosein Sheikholeslami, Surojit Chatterjee and Aidong Zhang, “WaveCluster: A MultiResolution Clustering Approach for Very Large Spatial Databases,” VLDB, pp. 428–439, 1998.
 [13] Bungartz, HansJoachim and Griebel, Michael, “Sparse grids,” SIGKDD, pp. 1055–1064, 2016.
 [14] Maurus, Samuel and Plant, Claudia, “Skinnydip: Clustering in a Sea of Noise,” Acta numerica, vol. 13, pp. 147–269, 2016.
 [15] Jain, Anil K and Murty, M Narasimha and Flynn, Patrick J, “Data clustering: a Review,” ACM computing surveys (CSUR), vol.31, pp. 264–323, 1999.
 [16] Von Luxburg, Ulrike, “A tutorial on spectral clustering,” Statistics and Computing, vol.17, pp. 395–416, 2007.
 [17] HansPeter Kriegel, Peer Kröger, Jörg Sander and Arthur Zimek, “Densitybased Clustering,” Wiley Interdisc. Rew.: Data Mining and Knowledge Discovery, vol. 1, pp. 231–240, 2011.

[18]
Jain, Anil K, “Data clustering: 50 years beyond Kmeans,” Pattern recognition letters, vol.31, pp. 651–666, 2010.
 [19] Martin Ester, HansPeter Kriegel, Jörg Sander and Xiaowei Xu, “A Densitybased Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” SIGKDD, pp. 226–231, 1996.
 [20] Mihael Ankerst, Markus M. Breunig, HansPeter Kriegel and Jörg Sander, “OPTICS: Ordering Points To Identify the Clustering Structure,” SIGMOD, pp. 49–60, 1999.
 [21] Christian Böhm, Claudia Plant, Junming Shao and Qinli Yang, “Clustering by Synchronization,” SIGKDD, pp. 583–592, 2010.
 [22] Hartigan, John A and Hartigan, Pamela M, “The dip test of unimodality,” The Annals of Statistics, pp. 70–84, 1985.
 [23] Steinhaus, H., “Sur la division des corps matériels en parties,” Bull. Acad. Polon. Sci. (in French), vol.4, pp. 801–804, 1957.

[24]
E.W. Forgy, “Cluster Analysis of Multivariate Data: Efficiency Versus Interpretability of Classifications,” Biometrics, vol.21, pp. 768–769, 1965.
 [25] Aleksandar Bojchevski, Yves Matkovic and Stephan Günnemann, “Robust Spectral Clustering for Noisy Data: Modeling Sparse Corruptions Improves Latent Embeddings,” SIGKDD, pp. 737–746, 2017.
 [26] Celeux G and Govaert G, “A classification EM algorithm for clustering and two stochastic versions,” Elsevier Science Publishers B.V, pp. 315–332, 1992.
Comments
There are no comments yet.