I Introduction
Clustering is a fundamental data mining task that finds groups of similar objects while keeping dissimilar objects separated in different groups or in the group of noise (noisy points) [15, 17]
. The objects can be spatial data points, feature vectors, or patterns. Typical clustering techniques include centroidbased clustering
[18][16], density based clustering [17], etc. These techniques usually perform well on “clean” data. However, they face a big challenge for realworld applications where patterns usually “swim” in a sea of clutter or noise. Furthermore, complex and irregualarly shaped groups render these onceeffective clustering techniques intractable, because typical clustering approaches are either deficient of a clear “noise” concept or limited in specific situations due to its shapesensitive property.This work addresses the problem of effectively uncovering arbitrarily shaped clusters when the noise is extremely high. Based on the pioneering work of Sheikholeslami that applies wavelet transform, originally used for signal processing, on spatial data clustering [12], we propose a new wavelet based algorithm called AdaWave that can adaptively and effectively uncover clusters in highly noisy data. To tackle general applications, we assume that the clusters in a dataset do not follow any specific distribution and can be of arbitrarily shaped.
To show the hardness of the clustering task, we first design a highly noisy running example, as illustrated in Fig. 1 (a), which is highly noisy with clusters in various types. Fig. 1
(b) shows that AdaWave can correctly locate these clusters. Without any estimation of the explicit models on the data, the proposed AdaWave algorithm exhibits properties that are favorable to large scale realworld applications.
For comparison, we illustrate the results of several typical clustering methods on the running example: means [23, 24] as the representative for centroidbased clustering methods, DBSCAN [19] as the representative for densitybased clustering methods [17, 20], and SkinnyDip [14] as a newly proposed method that handles extremely noisy environments. Besides illustration, we also use Adjusted Mutual Information (AMI) [21], a standard metric ranging from 0 at worst to 1 at best, to evaluate the performances.

Centroidbased clustering algorithms tend to lack a clear notion of “noise”, and behave poorly on very noisy data. As shown in Fig. 2(b), the standard means yields poor results, and the AMI is very low at 0.25.

Densitybased clustering algorithms usually perform well in the case where clusters have heterogeneous form. As a representative, DBSCAN is known to robustly discover clusters of varying shapes in noisy environments. After finetuning the parameters that requires graphical inspection, the result of DBSCAN is shown in Fig. 2(c). DBSCAN roughly finds the shapes of the clusters, but there is still much room for improvement. It detects 21 clusters with an AMI of 0.28, primarily because there are various areas in which, through the randomness of noise, the local density threshold is exceeded. Also, we have tested various parameters and did not find a parameter combination for DBSCAN that can solve the problem.

SkinnyDip is known to robustly cluster spatial data in extremely noisy environments. For the running example, we sample from the nodes of a [13], which is regarded as the best mechanism for choosing a starting point for gradientascent in SkinnyDip. However, the clustering result in Fig. 2(c) shows that Skinnydip is performs poorly as the datasets do not satisfy its assumption that the projection of clusters to each dimension is in a unimodal shape.
The above methods, which are heavily rely on the estimation of explicit models on the data, produce high quality results when the data is organized according to the assumed models. However, when facing more complicated and unknown cluster shapes or highly noisy environments, these methods do not perform well as expected. By comparison, as shown in Fig. 2(d), the proposed AdaWave, which applies wavelet decomposition to denoising and “elbow theory” in adaptive threshold setting, correctly detects all the five clusters and groups all the noisy points as the sixth cluster. AdaWave achieves a AMI value as high as 0.76 and, furthermore, computes deterministically and runs in linear time.
We propose AdaWave^{1}^{1}1The data and code will be public after the review. to cluster highly noisy data which frequently appears in realworld applications where clusters are in extremenoise settings with arbitrary shapes. We also demonstrate the efficiency and effectiveness of AdaWave on synthetic as well as realworld data. The main characteristics of AdaWave that distinguish the algorithm from existing clustering methods are as follows:

AdaWave is deterministic, fast, orderinsensitive, shapeinsensitive, robusts in highly noisy environment, and requires no preknowledge on the data models. To our knowledge, there is no other clustering method that meets all these properties.

We design a new data structure for wavelet transform such that comparing to the classic WaveCluster algorithm, AdaWave is able to handle highdimensional data meanwhile remain storagefriendly in such situation.

We propose a heuristic method based on
elbow theory to adaptively estimate the best threshold for noise filtering. By implementing the selfadaptive operation on the threshold, AdaWave exhibits high robustness in extremely noisy environment.
The rest of this paper is organized as follows. Section II provides a quick review for existing clustering methods. Section III analyzes the properties of wavelet transform and explains why wavelet transform exhibits distinctive strength in clustering and automatical denoising. In Section IV we discuss the in detail and summarize the proposed clustering algorithm AdaWave. Section V carries out experiments and does comparisons with related clustering methods. The results are discussed in Section VI, Section VII concludes our work.
Ii Related Work
Various approaches have been proposed to improve the robustness of clustering algorithms in noisy data [19, 9, 14]. Here we highlight several algorithms most related to this special problem and focus on illustrating the preconditions for these clustering methods.
DBSCAN [19] is a typical densitybased clustering method designed to reveal clusters of arbitrary shapes in noisy data. When varying the noise level on the running example, we find that DBSCAN performs well only when the noise is controlled below 15%. Its performance derogates drastically as we continue to increase the noise percentage. Meanwhile, the overall average time complexity for DBSCAN is for data points and, in the worst case, . Thus, its running time can also be a limiting factor when dealing with large scale datasets. Another densitybased clustering method, Sync [21], exhibits the same weakness on time complexity (reliance on pairwise distance, with time complexity of .
Regarding data with high noise, as early as 1998, Dasgupta et al. [9] proposed an algorithm to detect minefields and seismic faults from “cluttered” data. Their method is limited to the twodimensional case, and an extended version for slightly higher dimension () requires significant parameter tuning.
In 2016, Samuel et al. [14] proposed an intriguing method called SkinnyDip [14]. SkinnyDip optimizes DipMeans [7] with an elegant dip test of unimodality [22]. It focuses on the high noise situation, and yields good result when taking a unimodal form on each coordinate direction. However, the condition is very strict that the projections of clusters have to be of unimodal shapes in every dimension. When such condition does not exist, SkinnyDip could not uncover clusters.
A newly proposed work in 2017 [25] applies a sparse and latent decomposition of the similarity graph used in spectral clustering to jointly learn the spectral embedding and the corrupted data. It proposes a robust spectral clustering (RSC) technique that are not sensitive to noisy data. Their experiments demonstrated the robustness of RSC against spectral clustering methods. However, it can only deal with lownoise data (up to and of noise).
Our proposed AdaWave algorithm targets extremenoise settings, as SkinnyDip does. Besides, as a gridbased algorithm, AdaWave shares the common characteristic with STING [10] and CLIQUE [11]: fast and independent of the number of data objects. Though the time complexity of AdaWave is , slightly higher than that of SkinnyDip, it still runs in linear time and yields good results when dataset consists of clusters in irregular shapes.
Experiments in Section V show that AdaWave outperforms other algorithms, especially in the following scenarios that happen commonly in large scale real applications, when 1) the data contains clusters of irregular shapes such as a ring, 2) is a very large dataset in relatively high dimensions, 3) the dataset contains a very high percentage (for example 80%) of noise.
Iii Wavelet Transform
Wavelet transform has been known as an efficient denoising technology. Overpassing its predecessor Fourier Transform, wavelet transform can analyze the frequency attributes of a signal when its spatial information is retained.
Iiia Principles of Wavelet Transform
In this section, we focus on (DWT) which is applied in AdaWave algorithm. The ‘transform’ in DWT separates the original signal into scale space and wavelet space. The scale space stores an approximation of the outline of the signal while the wavelet space stores the detail of the signal. Thanks to the Mallat algorithm [1], we can simplify the complicated process of DWT into two filters.
IiiA1 1D Discrete Wavelet Transform
As shown in Fig. 3, signal pass two filters and and is downsampled by 2. According to the Mallat algorithm[12], signal can be decomposed into scale space and wavelet space by passing a lowpass filter and a highpass filter correspondingly. Choosing different wavelet functions, we can get related filters by looking up a precalculated table.
Just by passing the signal through two filters and down sample by 2, we are able to decompose it into a space that only contains the outline of the signal and another space that only contains the detail. The signal discussed here includes high dimensional signals. Passing a filter for a dimensional signal is just repeating the process of 1D signal for times.
In both data science and information theory, noise is defined as the collection of unstable, nonsense points in signal (dataset). As shown in Fig.
4, in denoising task, we are trying to maintain the outline of the signal and amplify the contrast between high value and low value, which is a perfect stage for DWT.IiiA2 2D Discrete Wavelet Transform
To further show DWT’s denoising feature regarding space data, we apply two separate 1D wavelet transform on 2D dataset and filter coefficient in transformed feature space. Referring to [3], the 2D feature space is first convolved along the horizontal () dimension, resulting in a lowpass feature space and a highpass feature space . We then downsample each of the convolved space in the dimension by 2. Both and are then convolved along the vertical () dimension, resulting in four subspaces: (average signal), (horizontal features), (vertical features), and (diagonal features). Next, by extracting the signal part () and discarding lowvalue coefficients, we obtain the transformed feature space as shown in Fig. 5.
Intuitively, according to Fig. 5
, dense regions in the original feature space act as attractors to the nearby points and at the same time as inhibitors to the points that are not close enough. This means that clusters in the data and the clear regions around them automatically stand out and become more distinct. Also, it is evident that the number of points sparsely scattered (outliers) in the transformed feature space lower than that in the original space. The decrese in outliers reveals the robustness of DWT regarding extreme noise.
As mentioned earlier, the above wavelet model can similarly be generalized in dimensional feature space, where onedimensional wavelet transform will be applied times.
IiiB Properties of Wavelet Transform
Besides its astonishing ability of analyzing frequency and spatial information at the same time, other properties of DWT also distinguish it among various denoising methods.
Low entropy. The sparse distribution of wavelet coefficients reduces the entropy after DWT. Thus after the signal decomposition, many wavelet coefficients are close to zero, which generally refers to the noise. The main component of the signal is more concentrated in some wavelet basis, therefore removing the lowvalue coefficients is an effective denoising method that can better retain the original signal.
Multiresolution. As shown in Fig. 3, DWT is implemented in a layered structure. In each layer of decomposition, we only decompose which represents the wavelet space, also known as the detailed feature of the signal, and preserve the scale space . Such layered structure gives us the possibility to observe signal in different resolutions of scale space, to , in a single application of DWT. Even though we didn’t implement this idea in AdaWave, this property may inspire new algorithm design.
Decorrelation. As illustrated above, DWT will separate signals into scale space and wavelet space. By such separation, DWT decorrelated the ‘detail’ (the part that oscillated very fast in Fig. 4). With the decorrelation property, DWT works especially well on separating noise from highly noisy data.
Flexibility of choosing basis. Wavelet transform can be applied with a flexible choice of wavelet function. As described above, for each kind of wavelet function, there exists a precalculated filter. According to different requirements, various wavelet families have been designed, including Daubechies, Biorthogonal and so on. The users of AdaWave have the flexibility to choose any kind of wavelet basis, which makes the algorithm universal under various requirements.
Iv Wavelet Based Adaptive Clustering
AdaWave is an efficient algorithm based on wavelet decomposition. It is generally divided into four parts, as shown in Algorithm 1. In the first step we propose a new data structure for clustering highdimension data. With space quantization, we divide data into grids and project them to a highscale space. Step 2 is wavelet decomposition that preliminarily denoises by removing wavelet coefficients close to zero. Step 3 is threshold filtering, which is a further step to eliminate noise. And we apply “elbow theory” in adaptively setting the threshold. In the last step, we label and make the lookup table, thereby transforming grids to the original data.
Iva Quantization
The first step of AdaWave is to quantize the feature space. Assume that the dimension in the feature space will be divided into intervals. By making such division in every dimension, we separate the feature space into multiple grids. Objects are allocated into these grids according to the coordinates at each dimension.
In the original wavelet clustering algorithm[12], a unit = contains an object =, if ,1 j We may recall that is the right open interval in the partitioning of . For each unit (or grid), we use the number of objects contained in the unit, as the value of grid density, to represent the statistical feature of the grid. The selection of the number of grids to generate and statistical feature to use can significantly affect the performance of AdaWave.
It is easy to get trapped by storing all grids in the feature space to keep the quantized data. Even though quantifying the data can be finished in linear time, it can lead to an exponential consumption of memory related to the dimension . In AdaWave, inspired by DENCLUE [2] algorithm, we successfully achieved the goal of “only storing the grid with nonzero density” by “labeling” the grids in the data space. When considering low dimensional data, only storing grids with nonzero density cannot demonstrate its advantage because of the high data density in the entire space. However, in high dimension, the number of grids far exceeds the number of data points. When a lot of grids are of zero density, the above strategy can save considerable amount of memory, making it possible to apply AdaWave to high dimensional clustering problems.
IvB Transformation and Coefficient Denoising
At the second step, discrete wavelet transform will be applied to the quantized feature space. Applying wavelet transform to the grids in { : 1 } results in a new feature space, and hence new grids { : 1 }. Given the set of grids { : 1 }, AdaWave detects the connected components in the transformed feature space. Each connected component is a set of grids in { : 1 } and is considered as a cluster. Corresponding to each resolution of the wavelet transform, there is a set of clusters , where, usually at the coarser resolutions, the number of clusters is low. In experiments, we applied each of the threelevel wavelet transforms, Daubechies, CohenDaubechiesFeauveau ((4,2) and (2,2)). (feature spaces) give approximations of the original feature space at different scales to find clusters at different levels.
IvC Threshold Filtering
At the third step, the key step of AdaWave, we identify noise and clusters by removing the noise grids from the grid set. For a high noise percentage, it is hard for the original WaveCluster to eliminate noise by applying wavelet transform to the original feature space and take advantage of the lowpass filters used in the wavelet transform to automatically remove the noise.
After performing Wavelet transform and eliminating the low value coefficients, we can preliminarily remove the outliers with the sparse density. In other words, when the noise percentage is below 20%, Wavelet transform gains outstanding performance, and the computation complexity is . The automatic and efficient removal of the noise advantageously enables AdaWave to outperform many other clustering algorithms.
If 50 percent of the dataset is noise, many noise grids would also have high density and WaveCluster cannot distinguish them from clusters. Therefore, an additional technique would be applied to further eliminate the noise grids.
The density chart after sorting is shown in Fig. 6(a) and the grid densities occur as shown is because the entire data space is “averaged” during the wavelet transform. The essence of the wavelet transformation is a process of “filtering”. Since the filter corresponding to the scale space in wavelet transformation is a “lowpass filter”, the high frequency part that changes rapidly in grid data, is filtered out, leaving the low frequency data to represent the signal profile.
After lowpass filtering, the density of the grids in the signal space should be roughly divided into three categories: signal data, middle data, noise data. The chart of these three kinds of grids is statistically fitted with three line segment. The grid density of signal data should be the largest, represented by the leftmost vertical line in Fig. 6 (b). The middle part consists of the grids between clusters and noise. Due to the lowpass filtering in wavelet decomposition, these grids are lower in density than the grids in the class but higher than that of the noiseonly grids. The density of these grids decreases according to its distance from the class. The noise part also appears to be relatively smooth due to the lowpass filtering. Since the density of the noise data is much lower than that of the data in the class, the noise data is represented by the third line that is almost horizontal.
According to lots of experiments on various datasets, the position where “middle line” and “noise line” intersects is generally the best threshold. The algorithm below is the adaptive technique to find the statistically best threshold.
IvD Label and Make Lookup Table
Each cluster , , has a cluster number . The fourth step of the algorithm labels the grids in the feature space included in a cluster with the cluster number. That is, , where is the label of the grid . The clusters that are found are in the transformed feature space and are based on wavelet coefficients. Thus, they cannot be directly used to define the clusters in the original feature space. AdaWave builds a lookup table LT to map the grids in the transformed feature space to the grids in the original feature space. Each entry in the table specifies the relationship between one grid in the transformed feature space and the corresponding grid(s) of the original feature space. Therefore, the label of each grid in the original feature space can be easily determined. Finally, AdaWave assigns the label of each grid in the feature space to all of the objects whose feature vector is inside that grid, and thus determines the clusters. Formally, , , , where is the cluster label of object .
IvE Time Complexity
Let be the number of objects in the database and be the number of grids. Assuming that the feature vectors of objects are dimensional,we have a dimensional feature space. Here we suppose is large and is comparatively small.
For the first step, the time complexity is , because it scans all the database objects and assigns them to the corresponding grids, where each dimension in the ddimensional feature space will be divided into intervals. Assuming for each dimension of the feature space, there would be grids [4].
For the second step, the complexity of applying wavelet transform to the feature space is , where is a small constant representing the length of the filter used in the wavelet transform. As is very small, we regard as a constant and . If we apply wavelet transform to decomposition levels to sample in each level downward, the required time is [5]. That is, the cost of wavelet transform is less than , which indicates that to use the multiresolution cluster can be very effective.
The second step aims at finding a suitable threshold for further denoising. The sorting algorithm has a time complexity of , and filtering takes . Therefore, the total complexity of this step is .
To find the connected components in the feature space, the required time will be , where is a small constant. Making the lookup table requires time. After reading data objects, data processing is performed at steps 2 to 5. Thus the time complexity of processing data (without considering I/O) would in fact be , which is independent of the number of data objects . The time complexity of the last step is .
Since we assume that AdaWave is applied to very large datasets, that is , . Thus, the overall time complexity of the AdaWave algorithm is .
IvF Properties of AdaWave
AdaWave is able to detect clusters in irregular shapes. In AdaWave algorithm, the spatial relationship in data has been preserved during the quantization and wavelet transform. By clustering connected grids into the same cluster, AdaWave makes no assumption on the shape of the clusters. It can find convex, concave, or nested clusters.
AdaWave is noise robust. Wavelet transform is broadly known for its great performance in denoising. AdaWave takes advantage of this property and can automatically remove the noise and outliers from the feature space.
AdaWave is memory efficient. We overcome the problem of exponentially memory grow in wavelet transform for high dimensional data. By using ‘grid labeling’ and the strategy of ‘only store nonezero grids’, AdaWave is able to process data in comparatively high dimension space, which drastically expands the limit of WaveCluster algorithm.
AdaWave is computationally efficient. As will be theoretically and experimentally discussed later, the time complexity of AdaWave is where denotes the number of data points and denotes the number of grids stored. Because in most occasions , AdaWave is very efficient for large datasets. AdaWave will be especially efficient for the cases where the number of grids and the number of feature space dimensions are low.
AdaWave is inputorder insensitive. When the objects are assigned to the grids in the quantization step, the final content of the grids is independent of the order in which the objects are presented. The following steps of the algorithm will only be performed on these grids. Hence, the algorithm will have the same results with any order of the input data.
AdaWave can cluster in multiresolution simultaneously by taking advantage of the multiresolution attribute from the wavelet transform. By tweaking the quantization parameters or the wavelet basis, users can choose different resolution for clustering.
V Experimental Evaluation
In this section, we turn to a practical evaluation of AdaWave. First, according to general classification of acknowledged clustering methods, we choose the stateoftheart representatives from different families for comparison. Then, we generate a synthetic dataset which exhibits the challenging properties on which we focus and compare AdaWave with the selected algorithms. We apply our method to realworld datasets, and do runtime experiment to evaluate the efficiency of AdaWave.
Va Competitors
For comparison, we evaluate AdaWave against a set of representative competitors from different clustering families.
We begin with means [23, 24], which is a widely known technique of Centroidbased clustering methods. To gain the best performance of means, we correct the parameter settings of the means. With DBSCAN [19], we have the popular member of the densitybased family, which is famous for clustering arbitrary shape groups. EM [26]
focuses on probability instead of distance computation; a multivariate Gaussian probability distribution model is used to estimate the probability that a data point belongs to a cluster, and each cluster is regarded as a Gaussian model.
Next, we compare AdaWave to the stateoftheart clustering methods. RIC [8], which finetunes an initial coarse clustering based on the minimum description length principle (a modelselection strategy based on balancing accuracy with complexity through datacompression). DipMeans [7] is a wrapper for automating EM and means respectively. Selftuning spectral clustering (STSC) [21] is a popular automated approach to spectral clustering. Moreover, we omit the comparison with SkinnyDip [14] because it is a newly proposed method that performs well on handling highnoise. The continuity properties of the dip enable SkinnyDip to exploit multimodal projection pursuit and find an appropriate basis for clustering.
VB Synthetic datasets
In the synthetic data, we try to mimic the general situation for clustering under very high noise data. In the dataset, we emulate different shapes of clusters and various space relations between two clusters. Thus, clusters may not be able to create a uniform shape when projected to a dimension.
By default, we choose five clusters of 5600 objects each in two dimensions. One typical cluster is an approximate rectangular distribution that has no overlap in all dimensions and is Gaussian with a standard deviation of 0.005. To increase the difficulty of clustering and fully show AdaWave’s strength in finding arbitrarily shaped clusters. The next two clusters are of overlapping circular distributions in the directions of
and . The remaining two are parallel lines between two classes: circular in concentric distribution and approximate rectangles that are not completely overlapped. To evaluate AdaWave with competitors at different levels of noise, we systematically vary the noise percentage= 20,25,…,90 by sampling from the uniform distribution over the unit square. One of the dataset we obtain (50% noise) is shown in Fig.
7.As a parameterfree algorithm, AdaWave uses its default value of in all cases. We likewise use the default parameter values for the provided implementations of the baseline algorithms which require neither obscure parameters nor additional processing. To automate DBSCAN, we fix = 8 and run DBSCAN for all , reporting the best AMI result from these parameter combinations in each case. For means, we similarly set the correct to achieve automatic clustering and ensure the best AMI result.
Fig. 8 presents the results of AdaWave and the baselines of these data. A point corresponds to the mean AMI value from all datasets corresponding to that parameter combination. With regard to fairness of the techniques that have no concept of noise (e.g. centroidbased techniques), the AMI only considers the objects which truly belong to a cluster (nonnoise points). The results are discussed in Section VI.
VC RealWorld Datasets
For realworld data, we analyzed nine datasets of varying size from the UCI^{2}^{2}2http://archive.ics.uci.edu/ml/. These classificationstyle data are often used to quantitatively evaluate clustering techniques in realworld settings and the nine datasets we use include 2D map data, and other higher dimensional data. However, it should be noted that every point is assigned a semantic class label (none of the data include a noise label). For this reason, we run the means iteration (based on Euclidean distance) on the final AdaWave result to assign any detected noise objects to a “true” cluster. Class information is used as ground truth for the validation.
Table I summarizes the results for different datasets( denotes the total number and the dimension of dataset). In a quantitative sense, AdaWave’s results are promising when compared to the competitors. It achieves the strongest AMI value on six of the nine datasets, and is ranked third on two of the remaining datasets. On the Seeds dataset it ranks the fourth.
As a qualitative supplement, we investigate two of the results in detail. Roadmap dataset was constructed by adding elevation information to a 2D road network in North Jutland, Denmark (covering a region of 185 x 135 ). In this experiment, we choose the original 2D road network as the dataset for clustering. The horizontal and vertical axes in Fig. 9 represent latitude and longitude respectively and every data point represents a road segment. Roadmap is clearly a typical highly noisy dataset because the majority of road segments can be termed as “noise”: long arterials connecting cities, or lessdense road networks in the relatively sparsepopulated countryside. In our AdaWave algorithm, we apply 2D on the highly noisy Roadmap and further filters on the transformed grids, so that dense groups are automatically detected (with an optimal AMI value of 0.735). The clusters AdaWave detected are generally highlypopulated areas (cities like Aalborg, Hjørring and Frederikshavn  with populations over 20,000), which also verify the correctness of our results.
For the second example of Glass Identification, there are has 9 attributes: refractive index and 8 chemical elements (Na, Mg etc.). Due to the highdimension and weak correlation with class in an attribute (as shown in Table II), most techniques, according to our results, produce bad performances in glass classification (with average AMI value of less than 0.3). Instead of projected in all directions independently, AdaWave detects the connected 3 grids in the 9D feature space after the discrete wavelet transform, where a onedimensional wavelet transform will be applied nine times. Though the clustering result of AdaWave is not particularly good with AMI value of 0.467, it is nonetheless better than its competitors. The results of Glass also reveals the difficulty of clustering in highdimension data that is weakly corelated with class in a separate dimension.
Attribute  RI  Na  Mg  Al  Si 
Corelation  0.1642  0.5030  0.7447  0.5988  0.1515 
Attribute  K  Ca  Ba  Fe  
Corelation  0.0100  0.0007  0.5751  0.1879 
VD Runtime comparison
To further explore the efficiency of AdaWave and compare with the commonly used clustering algorithms, we carry out runtime experiment on synthetic data with scale (the number of objects). AdaWave is implemented in python, SkinnyDip is provided in R, and the remaining algorithms are provided in Java. Due to the difference of languages (interpreted vs compiled languages), we focus only on the asymptotic trends.
We generate our experimental data by increasing the number of objects in each cluster in equal cases. By this means, we are able to scale (the noise percentage is fixed at 75%). Then, the experiments are conducted on a computer with Intel Core i5 Processor 4GHz CPU, 500GB Memory, and Windows 10 Professional Service Operation System.
Fig. 10 illustrates the comparison on running time, AdaWave ranks the second among the five algorithms with regard to the cost. Comparing to methods like means and DBSCAN that have the computation complexity of , AdaWave is much faster. Although the cost of AdaWave is slightly higher than SkinnyDip, which grows sublinearly, AdaWave gains a much higher AMI in such situation. In other words, the high efficiency of SkinnyDip clustering is at the cost of shape limitation.
In general, it’s not surprising that AdaWave performs well in practical runtime growth. AdaWave is essentially a gridbased algorithm. The number of grids is much less than that of the objects, especially when the dimension is high. Due to process of quantization and the linear complexity of wavelet, AdaWave can eventually outperform other algorithms in runtime experiments.
Vi Discussion
In this section, we turn to a discussion of AdaWave and its results compared with the related clustering methods. As in the previous section, the competitors include representatives of clustering families and some stateoftheart automatic clustering methods, which enable us to discuss AdaWave in depth.
We first make observation on the performance of the competitors. The standard means [23, 24], which tends to lack a clear notion of “noise”, has low AMI in synthetic data even when the correct is given. Modelbased approaches like EM [26] also fail to gain good performance when the cluster shapes do not fit a simple model. The popular DBSCAN [19] method is known to robustly discover clusters of varying shapes in noisy environment. With the increasement of noise, however, the limitation of DBSCAN (finding clusters of varying density) is magnified when the dataset also contains large amount of noise. As shown in Fig. 7, the AMI of DBSCAN suffers a sharp decline when the noise percentage is up to 20% and DBSCAN performs much worse than the others in extremely noisy environments.
STSC [21] and DipMeans [7] are two wellknown approaches to automatic clustering, but they are unable to differentiate themselves from the other techniques in the experiments (particularly on the challenging, highnoise synthetic cases). RIC[8] is designed to be a wrapper for arbitrary clustering techniques: given a preliminary clustering, RIC purifies these clusters from noise and adjusts itself based on informationtheoretical concepts. Unfortunately it meets difficulties when the amount of noise in the dataset is nontrivial: for almost all our experiments the number of clusters detected is one (giving an AMI value of zero). The dipbased technique SkinnyDip [14], which is specialized to deal with high noise data, also meets challenge in the synthetic experiment because the projection of clusters in every dimension is usually not a unimodal shape. Due to the strict precondition, SkinnyDip does not work well in realworld data either.
In general, our results on synthetic and realworld data show that AdaWave has the ability to outperform the baselines by a wide margin. The margin by which AdaWave exceeds the baselines on the data is more or less maintained as the number of clusters and level of noise are varied. It is mainly due to the process of wavelet transform and threshold filtering. The hatshape part of wavelet filters emphasizes regions where points cluster and simultaneously suppresses weaker information in the boundary, thus automatically removing the outliers and making clusters more salient. Threshold filtering further uncovers true clusters from densenoise grids to ensure the strong robustness of noise. The main reason for the inferiority of the baselines can be seen with an investigation into the ringshape case: the clusters are in two overlapping circular distributions with dense noise arround, for which the comparison methods tend to group together as one or separate them as rectanglestyle clusters (causing a zero AMI). Applying DWT to wavelet feature space to process coefficient reduction and threshold filtering is a feasible solution for such problems that AdaWave focuses on.
Like other gridbased techniques, with the increasement on the dimension , the limitations of AdaWave begin to emerge. Due to the high dimension, the number of grid cells rises sharply (exponential increase) when rasterized in every dimension. Gridbased clustering tends to be ineffective due to the high computational complexity. Our new data structure provides us a storagefriendly solution; however, it cannot avoid highdimension problem in the fundamental sense. On the other hand, according to our experimental results, AdaWave gives low performance when the noise percentage is low (less than 10%) and even inferior to those with nearly 30% noise. The main reason is that the “elbow theory” may not work well in the lownoise situation. Threshold filtering aims at further eliminating densenoise grids. However, most outliers ( ) are removed in the process of coefficient reduction, thus influencing the determination of threshold. Practically this rarely happens, because AdaWave focuses on clustering of the datasets with extremely high noise and of arbitrary shape.
Vii Conclusion
In this paper, we propose a novel clustering algorithm AdaWave for clustering data in extremely noisy environments. AdaWave is a gridbased clustering algorithm implemented by applying a wavelet transform to the quantized feature space.
AdaWave exhibits strong robustness to noise, outperforming other clustering algorithms including SkinnyDip, DBSCAN and means on both synthetic data and natural data. Also, AdaWave doesn’t require the computation of pairwise distance, resulting in a low complexity. On the other hand, by deriving the “grid labeling” method, we drastically reduce the memory consumption of wavelet transform and thus AdaWave can be used in analyzing dataset in relatively high dimension. Furthermore, AdaWave is able to detect arbitrary clusters, which the SkinnyDip algorithm cannot. By wavelet transform, AdaWave inherits the ability to analyze data in different resolutions.
Such properties enable AdaWave to fundamentally distinguish itself from the centroid, distribution, density and connectivitybased approaches. AdaWave outperforms competing techniques especially when clusters are of irregular shapes and the environment is extremely noisy.
References
 [1] Stéphane Mallat, “A Theory for Multiresolution Signal Decomposition: The Wavelet Representation”, IEEE Trans. Pattern Anal. Mach. Intell., vol. 11, pp.674–693, 1989.
 [2] Alexander Hinneburg and Daniel A. Keim, “An Efficient Approach to Clustering in Large Multimedia Databases with Noise,” SIGKDD, pp. 58–65, 1998.
 [3] Michael L. Hilton, Bjorn D. Jawerth and Ayan Sengupta, “Compressing Still and Moving Images with Wavelets.” Multimedia Systems, pp. 218–227, 1994..
 [4] Wu, Guowei and Yao, Lin and Yao, Kai, “An Adaptive Clustering Algorithm for Intrusion Detection,” Information Acquisition, 2006 IEEE International Conference on, pp. 1443–1447, 2006.
 [5] Zhao, Mingwei and Liu, Yang and Jiang, Rongan, “Research of WaveCluster Algorithm in Intrusion Detection System,”Computational Intelligence and Security, 2008. CIS’08. International Conference on, vol. 1, pp. 259–263, 2008.
 [6] Maurus, Samuel and Plant, Claudia, “Skinnydip: Clustering in a Sea of Noise,” SIGKDD, pp. 1055–1064, 2016.
 [7] Kalogeratos, Argyris and Likas, Aristidis, “Dipmeans: an incremental clustering method for estimating the number of clusters,”NIPS, pp. 2393–2401, 2012.
 [8] Böhm, Christian and Faloutsos, Christos and Pan, JiaYu and Plant, Claudia, “Robust informationtheoretic clustering,” SIGKDD, pp. 65–75, 2006.
 [9] Dasgupta, Abhijit and Raftery, Adrian E, “Detecting features in spatial point processes with clutter via modelbased clustering,” Journal of the American Statistical Association, vol. 93, pp. 294–302, 1998.
 [10] Wang, Wei and Yang, Jiong and Muntz, Richard and others, ‘STING: A statistical information grid approach to spatial data mining,” VLDB, vol. 97, pp. 186–195, 1997.
 [11] Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos and Prabhakar Raghavan, “Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications,” SIGMOD, pp. 94–105, 1998.
 [12] Gholamhosein Sheikholeslami, Surojit Chatterjee and Aidong Zhang, “WaveCluster: A MultiResolution Clustering Approach for Very Large Spatial Databases,” VLDB, pp. 428–439, 1998.
 [13] Bungartz, HansJoachim and Griebel, Michael, “Sparse grids,” SIGKDD, pp. 1055–1064, 2016.
 [14] Maurus, Samuel and Plant, Claudia, “Skinnydip: Clustering in a Sea of Noise,” Acta numerica, vol. 13, pp. 147–269, 2004.
 [15] Jain, Anil K and Murty, M Narasimha and Flynn, Patrick J, “Data clustering: a Review,” ACM computing surveys (CSUR), vol.31, pp. 264–323, 1999.
 [16] Von Luxburg, Ulrike, “A tutorial on spectral clustering,” Statistics and computing, vol.17, pp. 395–416, 2007.
 [17] HansPeter Kriegel, Peer Kröger, Jörg Sander and Arthur Zimek, “Densitybased Clustering,” Wiley Interdisc. Rew.: Data Mining and Knowledge Discovery, vol. 1, pp. 231–240, 2011.

[18]
Jain, Anil K, “Data clustering: 50 years beyond Kmeans,” Pattern recognition letters, vol.31, pp. 651–666, 2010.
 [19] Martin Ester, HansPeter Kriegel, Jörg Sander and Xiaowei Xu, “A Densitybased Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” SIGKDD, pp. 226–231, 1996.
 [20] Mihael Ankerst, Markus M. Breunig, HansPeter Kriegel and Jörg Sander, “OPTICS: Ordering Points To Identify the Clustering Structure,” SIGMOD, pp. 49–60, 1999.
 [21] Christian Böhm, Claudia Plant, Junming Shao and Qinli Yang, “Clustering by Synchronization,” SIGKDD, pp. 583–592, 2010.
 [22] Hartigan, John A and Hartigan, Pamela M, “The dip test of unimodality,” The Annals of Statistics, pp. 70–84, 1985.
 [23] Steinhaus, H., “Sur la division des corps matériels en parties,” Bull. Acad. Polon. Sci. (in French), vol.4, pp. 801–804, 1957.

[24]
E.W. Forgy, “Cluster Analysis of Multivariate Data: Efficiency Versus Interpretability of Classifications,” Biometrics, vol.21, pp. 768–769, 1965.
 [25] Aleksandar Bojchevski, Yves Matkovic and Stephan Günnemann, “Robust Spectral Clustering for Noisy Data: Modeling Sparse Corruptions Improves Latent Embeddings,” SIGKDD, pp. 737–746, 2017.
 [26] Celeux G and Govaert G, “A classification EM algorithm for clustering and two stochastic versions,” Elsevier Science Publishers B.V, pp. 315–332, 1992.
Comments
There are no comments yet.