Clustering via Boundary Erosion

04/12/2018
by   Cheng-Hao Deng, et al.
Xiamen University
0

Clustering analysis identifies samples as groups based on either their mutual closeness or homogeneity. In order to detect clusters in arbitrary shapes, a novel and generic solution based on boundary erosion is proposed. The clusters are assumed to be separated by relatively sparse regions. The samples are eroded sequentially according to their dynamic boundary densities. The erosion starts from low density regions, invading inwards, until all the samples are eroded out. By this manner, boundaries between different clusters become more and more apparent. It therefore offers a natural and powerful way to separate the clusters when the boundaries between them are hard to be drawn at once. With the sequential order of being eroded, the sequential boundary levels are produced, following which the clusters in arbitrary shapes are automatically reconstructed. As demonstrated across various clustering tasks, it is able to outperform most of the state-of-the-art algorithms and its performance is nearly perfect in some scenarios.More over, it is very fast in large scale. We extend our algorithm to suit for high dimension data and boosting the performance of the state-of-the-art method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

12/11/2018

Deep Density-based Image Clustering

Recently, deep clustering, which is able to perform feature learning tha...
04/21/2021

Skeleton Clustering: Dimension-Free Density-based Clustering

We introduce a density-based clustering method called skeleton clusterin...
11/19/2018

An efficient density-based clustering algorithm using reverse nearest neighbour

Density-based clustering is the task of discovering high-density regions...
06/03/2020

Tangles: a new paradigm for clusters and types

Traditional clustering identifies groups of objects that share certain q...
10/08/2018

Hierarchical clustering that takes advantage of both density-peak and density-connectivity

This paper focuses on density-based clustering, particularly the Density...
06/09/2021

Very Compact Clusters with Structural Regularization via Similarity and Connectivity

Clustering algorithms have significantly improved along with Deep Neural...
11/01/2021

Algorithms for Interference Minimization in Future Wireless Network Decomposition

We propose a simple and fast method for providing a high quality solutio...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Clustering problems arise from variety of applications, such as documents/web pages categorization [45]

, pattern recognition, biomedical analysis 

[41]

, data compression via vector quantization 

[36] and nearest neighbor search [19, 27]. In general, clustering analysis plays an indispensable role for understanding various phenomena across different contexts. Given a set of samples in d-dimensional space , the task of clustering is to partition the data samples into subsets (called clusters) such that the samples in the same cluster are more homogeneous or closer to each other than those from different subsets (clusters).

Traditionally, this issue has been modeled as a distortion minimization problem in k-means [26]. The clustering procedure is organized into two steps. Firstly, samples are assigned to their closest centers. Secondly, the center of each cluster is updated with the data samples assigned to it. These two steps are repeated until the structure of clusters does not change in two consecutive iterations. This algorithm is simple and efficient whereas it is unable to discover clusters that are not in spherical shape.

Aiming to identify clusters of arbitrary shapes, a series of algorithms have been proposed in the last two decades. Most of these algorithms [11, 6, 33, 24, 2, 32, 31] are conceived from the perspective of density distribution. Intuitively, samples within each cluster are concentrated, and clusters are separated by sparse regions. Among these algorithms, samples are either iteratively assigned to [11, 2, 33] or shifted towards the density peaks [6]

. The clusters are therefore forged. However, heuristic rules 

[11] or kernels [6] that are employed in the algorithms are unable to deal with various density distributions in practice. For this reason, the performance of these algorithms turns out to be unstable under different scenarios.

Apart from the density based approaches, graph based algorithms are also able to discover clusters of arbitrary shapes. Representative methods are Chameleon [22] and order-constrained transitive distance clustering [44]. In both of them, the connectivity between samples are carefully considered. According to the strategies presented in the papers, samples that are far away from each other are still clustered together as long as they are reachable to each other via a chain of closely connected bridging samples. Unfortunately, for both of them, a matrix that keeps pair-wise distance between samples is required, which makes it inscalable to large-scale clustering tasks.

As a consequence, despite numerous efforts have been taken in the last several decades, two major goals in clustering analysis, namely the ability of identifying clusters in arbitrary shapes and the scalability towards large scale and high dimensional data, are hardly achieved with one algorithm. In this paper, a simple but effective density based solution is proposed. The basic idea is inspired by the phenomenon of land erosion by water. The boundaries between clusters are drawn gradually by a boundary erosion process without any heuristic rules or kernels. This is particularly powerful in the case that boundaries between clusters are obscure at the first sight. In addition, the bit-by-bit boundary erosion produces a sequential order following which the potential clusters could be reconstructed with the guidance of an

r

-NN graph. The boundary erosion is feasible in any metric spaces as long as the density of data sample could be estimated. Furthermore, we also demonstrate that this algorithm achieves satisfactory performance and high efficiency on large-scale and high dimensional clustering task with the support of efficient

k-NN graph construction [8].

2 Related Work

Since the proposal of k-means, a variety of clustering algorithms have been proposed in the past three decades, which are in general categorized into seven groups [43]. Namely, they are agglomerative [23], divisive, partitioning [26], density based [11, 32, 2, 33, 15, 6], graph based [22]

and neural network based algorithms. In the literature, it is also seen the ensemble of several existing algorithms to boost the performance 

[47, 46]. For the comprehensive surveys, readers are referred to [43, 17]. In this section, our focus will be on the review of several typical algorithms that are able to identify clusters in arbitrary shapes. Namely, the density based and graph based algorithms are discussed mainly.

Although clustering problem has been modeled from different perspectives, people basically agree clusters are composed by samples that are relatively concentrated and are separated in-between by relatively sparse regions. This perception is made without any specification about the distance measure on the input data. Density based algorithms are designed in general in line with this perception. Although different in details, the density based algorithms aims to discover groups of samples that are continuously connected.

In general, two steps are involved in the density based clustering process. Firstly, the local density surrounding each sample is estimated. Given sample and radius , density of sample is defined as the number of samples (s) that fall into ’s neighborhood of range (as shown in Eqn. 1).

(1)

Function in Eqn. 1 returns the distance between and . In the second step, the clusters are forged basically in two different manners. For instance, in DBSCAN [11], cluster is formed by expanding it from “core points” (points hold high density) to points with low density. While in mean-shift clustering [6], data samples are shifted iteratively from region of low density towards the density peaks. In DBSCAN, the expansion process could be very sensitive to the parameters. For instance, two heterogeneous clusters are falsely merged into one as the parameter changes slightly. While in mean-shift, the shifting process could be easily stuck in a local if there is no obvious density peak. In the approach of clustering based on density peak (clusterDP) [33], data samples are directly assigned to the closest density peak in which each density peak is recognized as a cluster center. However, it faces similar problem as mean-shift since it is hard to identify the cluster center when there is no obvious density peak. Another pitfall of this approach is that the number of peaks to be selected as the cluster center has to be set manually.

Recently, several clustering methods are proprosed to deal with high dimensional data, which are hardly separable in the original space. Methods are typically in two categories. One is built upon generative model [42, 21]

. Another is called sparse spectral clustering (SSC) 

[10, 20]. SSC projects the high-dimensional data into a sparse and low-dimensional representation first. The projected data are clustered via spectral clustering. Method in [7] in general follows similar framework to fulfill the clustering on high-dimensional data.

3 Clustering via Boundary Erosion

3.1 Motivation

For all the clustering algorithms discussed, the key step is to partition the data samples into groups. However, this is challenging particularly when boundaries between different clusters are not obvious. In this paper, a boundary erosion procedure is proposed, which addresses such kind of ambiguity. The idea is inspired by the natural phenomenon of land erosion by water. An illustration is given in Fig. 1. As shown in the figure, the erosion explicitizes the boundaries between clusters gradually as the water erodes the lands bit by bit. More importantly, a sequential order that indicates how the samples are assembled one after another as one cluster is established based on the order of the samples being eroded. With this sequential order, which is called as sequential boundary levels in the paper, the latent clusters could be easily reconstructed.

(a) gradual erosion (step 1)
(b) gradual erosion (step 2)
(c) gradual erosion (step 3)
(d) trend of erosion
Figure 1: An illustration of boundary erosion. The erosion starts from bottomlands, invading inwards. As more and more lands (from (a) to (c)) are eroded out, the boundaries between lands (clusters) become more and more apparent. The erosion continues until all the lands are eroded out.

Notice that this idea is essentially different from watershed transform [34] in the sense that “water level” in our case does not rise up to bury the lands. Instead, it only erodes the lands. The lands on the outer part are eroded earlier than the inner lands instead of being buried at the same time, even they are on the same altitude.

3.2 Generating Boundary Levels via Erosion

To facilitate the boundary erosion, the density of one sample is estimated in a quite different manner from conventional algorithms. Namely, the density of each sample is dynamically estimated by gradually eroding its neighbors out. In order to do that, a dynamic array , in which samples along with their dynamic boundary density is maintained. The dynamic boundary density is given in Eqn. 2.

(2)

As shown in Eqn. 2, the major difference from Eqn. 1 is that samples outside of are not counted during density estimation. At the beginning, all the samples are put into the dynamic array . For this reason, of each sample is initially the same as given in Eqn. 1.

The boundary erosion starts from deleting the sample with the lowest density (which corresponds to boundaries we are most cerntain) in . Each time, data sample holding the lowest density111It is possible that several samples hold the same density value will be removed at once. is removed out from . Due to the removal of sample , dynamic boundary density of its neighbors’ are influenced according to Eqn. 2. Therefore the density of ’s neighbors’ in are recalculated and updated. Thereafter, next sample which holds the lowest dynamic boundary density is identified and removed from . This process continues until

is empty. At each time of removal, a sequential boundary level is assigned to the samples of being removed. Samples that are removed at the same moment are assigned with the same level. This erosion process is summarized in Alg. 

1.

Data: Data sample matrix:
Result: Boundary levels: , r-NN graph
1 Compute r-NN Graph for ;
2 Sort each in ascending order;
3 Calculate (Eqn. 2) based on ;
4 Push all samples s into ;
5 Sort in ascending order by ;
6 , ;
7 while  do
8       Pop with the lowest from ;
9       ;
10       for each that is in its neighborhood do
11             Calculate ;
12             Update with ;
13            
14       end for
15      ;
16      
17 end while
Algorithm 1 Produce boundary levels via erosion

This erosion process invades inwards from boundaries as more and more samples have been eroded out. It is imaginable that samples that are initially not located on the cluster border are gradually exposed to the boundary erosion. The erosion continues until all the samples have been deleted from . In the erosion process, a sample living inside automatically ruptures as the start of new boundary when the current lowest dynamic boundary density equals to the density of this sample. An illustration of the erosion process is given by movie S1 in the supplementary materials.

In the above process, samples are removed out from sequentially according to their dynamic boundary density ranging from low to high. Based on the order of being removed from , sample is assigned with a boundary level , which reflects both the original density (Eqn. 1) and innerness of one sample as a cluster member. It is easy to see the sample lying outer holds lower boundary level than the one lying inner even they share the same . This is the essential difference between our approach and watershed transform [34].

(a) density
(b) boundary levels
(c) 3D view of (a)
(d) 3D view of (b)
Figure 2: The comparison between conventional density estimation and sequential boundary levels produced by boundary erosion. The darker of the red color, the higher of the value for both (best viewed in color).

Fig. 2(a) and Fig. 2(b) show the density estimation from Eqn. 1 and the boundary levels produced by the erosion process respectively. Accordingly, the 3D views of Fig. 2(a) and Fig. 2(b) are shown in Fig. 2(c) and Fig. 2(d). As shown in the figure, density estimated by Eqn. 1 is full of potholes. This is not surprising since it is not necessarily true that density from the border to the center increases smoothly. The cluster expansion undertaken afterwards is easily trapped in the potholes distributed along the density slope, which is a common issue latent in the traditional approaches. While this issue is avoided in dynamic density estimation, which considers both the density and innerness of one sample. A clear contrast is seen from their 3D views (in Fig. 2(c) and Fig. 2(d)), the boundary levels produced by Alg. 1 turn out to be smooth within each emerging cluster.

In Alg. 1, the first step calculates the r-NN Graph which keeps nearest neighbors of each sample within range . is the only parameter used to set the scale of neighborhood of each sample. Entry keeps a list of nearest neighbors for sample in its neighborhood . The nearest neighbor list of each entry is sorted in ascending order according to the distances to the sample . This will facilitate the afterwards boundary erosion and labeling step (Alg. 2). The time complexity of building nearest neighbor lists for all the samples is quadratic to the scale of input samples, which is on the same level as DBSCAN [11, 15] and algorithm in [33]. The complexity of computing an approximate r-NN graph could be decreased to  [8], which will be discussed in detail in later section.

In order to support fast updating of for s induced by the removal of (Alg. 1, Line 10-13) , a reverse nearest neighbor graph  [8] is also maintained, in which keeps the data samples s that appears in their nearest neighbor lists. Essentially, reverse nearest neighbor graph is nothing more than a simple reorganization of .

Discussion

The advantages of boundary erosion are several folds. Firstly, the erosion takes place on the region of lowest density all the way. It therefore guarantees that the boundaries between clusters are drawn along the most likely regions. From this sense, the global optimality of this process is reached although it is a greedy strategy. Secondly, the bit-by-bit erosion allows the boundaries between clusters to be drawn gradually instead of at once, which is approriate when boundaries are not clear at the beginning. More importantly, the gradual erosion produces an ordered sequence that sorts samples from boundary to center, the reverse of which regularizes a roadmap for cluster expansion. In the whole process, no kernels or heuristic rules are introduced which avoids any unnecessary assumptions on the data distribution or metric spaces.

3.3 Label Propagation

Once the sequence of boundary levels are produced, the clustering process becomes natural and could be conveniently undertaken. It is basically a process of cluster expansion that starts from peaks of boundary levels (given in Alg. 2). The propagation starts from the data sample with the highest boundary level. Data sample is assigned with a new cluster label if none of its neighbors in are labeled. Otherwise, the sample is assigned with the same cluster label as its closest neighbor that has been labeled in the previous rounds. Likewise, the unlabeled samples are sequentially visited following the boundary levels from high to low. The process continues until all the samples are assigned with a label. In this process, the expansion of one cluster stops automatically when it reaches to the cluster boundary, where samples from other cluster hold higher boundary levels. An illustration of this propagation procedure is given by movie S2 in the supplementary materials.

Data: Boundary levels: , r-NN graph
Result: Cluster labels:
1 Sort by in descending order;
2 , ;
3 while  do
4       Pop from ;
5       for each in  do
6             if L[j] 0 then
7                   L[i]L[j]; break;
8             end if
9            
10       end for
11      if L[i] 0 then
12             L[i]C;
13             ;
14            
15       end if
16      
17 end while
Algorithm 2 Label propagation based on r-NN graph

As a summary, the proposed clustering process consists of three steps. Firstly, given the radius of neighborhood , a nearest neighbor graph is built. In the graph, a list of neighbors falling within range are kept for each sample. With the support of nearest neighbor graph, the sequential boundary levels are produced based on a boundary erosion process in the second step. Finally, clusters are produced by propagating cluster labels sequentially from samples of holding high boundary level to those of lower.

Similar as DBSCAN [11], mean-shift [6] and clusterDP [33]

, our algorithm is able to identify clusters of arbitrary shapes as well as the outliers. However, the proposed approach is more attractive from several aspects of view. On one hand, unlike DBSCAN, no heuristic rules are introduced, which makes the clustering insensitive to extra the parameter settings. On the other hand, unlike mean-shift or clusterDP in 

[33], no kernel is adopted in the density estimation, which makes it feasible for various types of metric spaces. Moreover, unlike DBSCAN or clusterDP in [33], no cluster centers or cluster peaks are explicitly defined or specified. Instead, similar as affinity propagation [13], the cluster peaks and the clusters emerge gradually. Furthermore, in the algorithm, there is no specification on the distance measure. As a consequence, unlike k-means [26], mean-shift [6] or recent OCTD [44], it is feasible for various metric spaces as long as the density of samples could be estimated.

The boundary erosion shares similar motivation as “border-peeling” in [3], however they are essentially different in three major aspects. Firstly, no kernel is introduced in our approach. Secondly, all samples will be eroded out after the erosion process. In constrast, cores points are reserved for cluster expansion in “border-peeling”. Finally, “border-peeling” relies on DBSCAN to reconstruct the clusters, while clustering in boundary erosion is undertaken via label propagation with the guidance of a r-NN graph.

In the above label propagation process, the same r-NN graph is used as the boundary erosion process (Alg. 1). Alternatively, it is feasible to use a different r-NN graph in the label propagation. In some cases, the density of a sample is very low. Such kind of samples are usually recorgnized as outliers by Alg. 1. However in certain scenario, we may expect that such kind of outliers are assigned to clusters that are the most close to them. To achieve that, the r-NN graph that is supplied to the above expansion procedure is revised. In particular, is augmented to top- nearest neighbors when the size of its nearest neighbors list is less than , where is another given parameter. In the experiment section, we are going to show this augmented propagation strategy is meaningful in certain circumstances.

According to our observation, boundary erosion fails only when samples from different clusters are mixed with each other. In this case, the assumption of this algorithm that clusters are separated by sparse regions actually breaks. However, it is possible to address this issue by recent sub-space embedding [20]. The input high dimensional data are firstly projected to lower and separable space by DSC-Net-L2 [20]. Boundary erosion is therefore applied on the projected data, which will be illustrated in the experiment section.

4 Clustering in Large-scale

As presented in Alg. 1, r-NN graph is required as the pre-requisite of the boundary erosion process. The time complexity of calculating r-NN graph could be as high as . Moreover, in the worst case, the space complexity of keeping r-NN graph is close to since one could not assume how many neighbors are located in range in advance. As a consequence, this algorithm becomes computationally inefficient in the situation that both and are large. To address this issue, an approximate solution is presented in this section.

4.1 Clustering with Approximate r-NN Graph

As shown above, it is computationally expensive to calculate an exact r-NN graph particularly in high dimensional and large-scale cases. Many attempts have been made to seek for approximate solutions for this issue. Thanks to the progress made in recent years, with NN-Descent algorithm presented in [8], it is possible to construct a k-NN graph in high accuracy under the empirical complexity of . More attractively, there is no specification on distance measure in the algorithm, which is precisely in line with our clustering algorithm.

In our practice for large-scale clustering task, the first step of Alg. 1 (i.e., Line 1) is modified. NN-Descent [8] is called to produce an approximate k-NN graph. The k-NN list of each sample is further pruned according to the given parameter , which results in an approximate r-NN graph . While the rest of clustering process remains unaltered. In the experiment section, the results on the large-scale image clustering are illustrated.

4.2 Complexity Analysis

As presented in the previous sections, clustering via boundary erosion basically consists of three steps. In the first step, an r-NN graph is built. The time complexity of building an exact r-NN is . This is feasible for low dimensional and small scale cases. While for high dimensional and large-scale task, NN-Descent [8] is adopted for the approximate r-NN graph construction, for which the construction complexity is around  [8]. In the second step, the boundary erosion process operates on a dynamic array . Each time, at least one sample with the lowest boundary density is removed from the array. samples on average222This is the average number of samples fall into neighborhood . in the array, that are influenced by the removal of one sample, are updated. For efficiency, the dynamic array can be implemented with a heap. The removal repeats for at most times. As a result, the time complexity of this step is , which is on level. In the label propagation step, it is clear to see the complexity is only . Overall, the complexity of the clustering algorithm is if one expects exact solution. This is suitable for small scale tasks. While for large-scale cases, the complexity is only with the support of NN-Descent k-NN graph construction, which is even more efficient than the conventional k-means.

5 Experiments

In the following, the performance of the proposed boundary erosion (BE) is studied in comparison to several state-of-the-art approaches on various evaluation benchmarks and tasks, such as synthetic data of different distributions, face image grouping, and clustering on biological as well as large-scale image data. On large scale image clustering part, our algorithm is implemented with C++ and compiled with GCC 5.4,and conducted by single thread on a PC with 3.6GHz CPU and 16G memory setup.

5.1 Clustering on Synthetic Datasets

Figure 3: Clustering results on six synthetic datasets (best viewed in color). The datasets in Fig 3(a to f) are namely Aggregation (AGG) [16], S3 [12], flame [14], sparil [4], Jain [18] and pathbased (Path) [5] respectively. The valid range of that allows to reproduce the same results as shown are [1.5, 2.4], [3.5, ], [1.3, 2.6], 2.5 and 3.8 for a, b, c, d, e, and f respectively.
Figure 4: Clustering results on six synthetic datasets (best viewed in color) by boundary erosion with augmented propagation. The valid range of that allows to reproduce the same results as shown are [1.4, 2.4], [2.6, ], [1.0, 2.7], [1.3, 4.1],[2.1, 2.5] and 3.8 for a, b, c, d, e, and f respectively.

The first experiment is conducted on six synthetic datasets, all of which are 2D spatial data points. The data points are drawn from a probability distribution with nonspherical shapes. These datasets have been widely adopted to test the robustness of a clustering algorithm. Results produced by our algorithm are presented in Fig. 

3. The valid range of parameter for each dataset that allows to reproduce the same results is accordingly attached. As shown in the figure, the proposed algorithm is able to identify all the clusters as well as the outliers of each case. In particular, satisfactory results are observed on challenging datasets S3, Path and Jain, on which existing approaches hardly produce descent results.

Fig. 4 shows the results from the augmented propagation strategy. As shown from the figure, the results for datasets a, d and f are the same as previous experiment. While for datasets b, c and e, where the outliers are in presence, the augmented propagation assigns the outliers to the clusters that are the most close to them. This is meaningful in the case that one prefers to producing clusters without isolated outliers. The results in Fig. 4 are also quantiatively shown in Tab. 1 in terms of clustering accuracy [44]. It is compared to k-means (KMS), spectural clustering (SC) and order-constrained transitive distance clustering (OCTD). Perfect results are achieved on most of the datasets. Compared to results from the most representative methods (in [44]), BE achieves the best performance in all the cases.

In the following experiments, the results are produced by r-NN graph without augmentation if it is not specified.

Mthd. Agg S3 flame sparil Path Jain
Kms 87.92 85.58 84.17 33.97 74.34 78.28
SC 99.37 8.10 98.75 59.30 97.00 100.00
OCTD 99.87 U.A. 100.00 100.00 96.66 100.00
BE 100.00 95.80 100.00 100.00 100.00 100.00
Table 1: Clustering accuracy (%) on synthetic datasets. For BE, the augmented r-NN graph is adopted in label propagation, which allows isolated samples to be assigned to its closest cluster

5.2 Clustering on Biological Data

Mthd. -score Para. Settings
DIANA 0.991 metric = , k = 26
AGNES 0.987 Complete-link,
metric = ,k = 25
HC 0.987 Complete-link,
k = 25
TC 0.986 T = 48.868
clusterDP 0.975 k = 25,
dc = 258.645
clusterONE 0.946 s = 1, d = 0.0
MC 0.923 I = 2.196
k-Medoids 0.912 k = 37
AP 0.910 dampfact = 0.845,
preference = 80.827
maxits = 5000,
convits = 500
DBSCAN 0.680 eps = 323.306,
MinPts = 1
SC 0.656 k = 11
BE 0.998 r = 60
Table 2: Comparisons to state-of-the-art algorithms on Brown dataset. Results of the state-of-the-art algorithms are cited from [41]

In this part, our algorithm is tested on Brown dataset [41]

which are biological data. In this datset, the affinity matrix that keeps pairwise distances between DNA sequences are supplied. They are pairwise of blasted sequences of

232 proteins belonging to 29 groups of families. In this case, algorithms such as k-means, are not feasible since they only work on -space. In this study, BE is compared to DIANA [23], AGNES [23]

, Hierarchical Clustering (HC) 

[38], Transitivity Clustering (TC) [40], clusterDP [33], clusterONE [30], Markov Clustering (MC) [9], k-Medoids (PAM) [23], Affinity Propagation (AP) [13], DBSCAN [11] and Spectral Clustering (SC) [35].

Twenty eight clusters are produced when by our algorithm. The score is 0.998, which is nearly perfect. This is also the best performance ever reported according to [41], as is shown in Table 2. While affinity propagation [13] only achieves 0.910, which is considerably worse than that of our algorithm. Our algorithm also outperforms clusterDP [33] by more than 2%.

5.3 Clustering on High-dimensional Data

Figure 5: Clustering results on the 40 groups of Olivetti Face Database (best viewed in color). Faces with the same cover color are clustered into the same cluster. The cluster centers are labeled with a white circle.

Our algorithm is also tested on two face datasets namely Olivetti Face Database (ORL) [37] and Extended Yale B(EYaleB) [25], and two visual object image datasets, namely COIL20 [29] and COIL100 [28]. In these four datasets, there are 400, 2432, 1440 and 7400 images which are from 40, 38, 20 and 100 visual object groups respectively. On these four datasets, clustering algorithms are expected to identify images that are from the same object groups. Since the images are not directly separable by their pixel intensities (i.e., RGB), images are projected to low-dimensional feature space by DSC-Net-L2 [20]. Our algorithm (BE) is adopted in the final clustering stage. In the experiments, DSC-Net-L2 in combination with BE (denoted as DSC+BE) is compared to sparse spectral clustering (SSC) [10], DSC-Net-L2 in combination with spectural clustering (denoted as DSC+SC) and standard configuration of DSC-Net-L2 based clustering [20], in which a discriminative variant of spectral clustering is integrated. The clustering error rates [20] are shown in Table 3. As shown in the table, DSC+BE outperforms or is very close to the best results ever reported on these datasets.

Datasets SSC DSC DSC+SC DSC+BE
ORL 32.50 14.00 15.16 12.23
EYaleB 27.51 2.67 11.92 4.52
COIL20 14.86 5.14 9.00 3.82
COIL100 45.00 30.96 34.99 31.67
Table 3: Clustering error rates on four image datasets. The radius of BE is set to 0.55, 0.55, 0.06 and 0.06 respectively on four datasets

Overall, superior performance is achieved in all experiments and on different categories of data by BE, which is essentially attributed to its extraordinary capability of identifying clusters in arbitrary shapes and the genericness of its model.

5.4 Image Clustering in Large-scale

In this section, the effectiveness of the proposed clustering algorithm is verified on image clustering/linking task. A subset of YFCC100M [39] is adopted for evaluation. There are 1.1

million images in total. They are represented with deep features from HybridNet 

[1], which are in 128 dimensions after PCA. In the clustering, NN-Descent is called to build the approximate r-NN graph for YFCC 1.1 million. In the experiments, top- is fixed to 5 for YFCC 1.1 million. While is set to 0.70. The augmented propagation is adopted in the cluster expansion stage, which avoids isolating similar images that are under severe transfomations.

It takes around 20 minutes for r-NN graph construction and 1.2 minutes to produce 474,500 groups. In contrast, for the same task it would take more than 100 hours for k-means. Most of the clusters produced by our algorithms are meaningful. There are 4,268 clusters that contain more than 3 images. Since no ground-truth available, only three sample groups are shown in Fig. 6(e). As shown in the figure, the algorithm performs reasonably well even with the support of approximate r-NN graph. According to our observation, the small clusters (whose size is less than 10) are comprised by near-duplicate images, which is highly helpful for large-scale image linking tasks.

Figure 6: Sample image groups that have been successfully identified by our algorithm on YFCC 1.1 million).

6 Conclusion

Boundary erosion is a process of unspinning the natural structures of potential clusters. The erosion starts from boundaries of clusters, invading inwards, till it reaches to all the density peaks. It therefore produces a sequential order that following which the clusters are naturally reconstructed. In the whole process, only one parameter, namely the radius of neighborhood is involved. The density peaks, the corresponding clusters and the cluster boundaries emerge automatically. The effectiveness of the algorithm has been verified on various clustering tasks and in different scales. Due to its simplicity, genericness, speed efficiency as well as superior performance across various datasets, this algorithm will find its value in various science and engineering tasks.

7 Acknowledgments

This work is supported by National Natural Science Foundation of China under grants 61572408.

References

  • [1] G. Amato, F. Falchi, C. Gennaro, and F. Rabitti.

    YFCC100M hybridnet fc6 deep features for content-based image retrieval.

    In The 2016 ACM Workshop on Multimedia COMMONS, pages 11–18, 2016.
  • [2] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. OPTICS: ordering points to identify the clustering structure. In ACM Sigmod record, pages 49–60. ACM, 1999.
  • [3] N. Bar, H. Averbuch-Elor, and D. Cohen-Or. Border-peeling clustering. CoRR, abs/1612.04869, 2016.
  • [4] H. Chang and D.-Y. Yeung. Robust path-based spectral clustering. Pattern Recognition, 41(1):191–203, January 2008.
  • [5] H. Chang and D.-Y. Yeung. Robust path-based spectral clustering. Pattern Recognition, 41(1):191–203, 2008.
  • [6] Y. Cheng. Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8):790–799, August 1995.
  • [7] K. G. Dizaji, A. Herandi, C. Deng, W. Cai, and H. Huang.

    Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization.

    In

    2017 IEEE International Conference on Computer Vision (ICCV)

    , pages 5747–5756. IEEE, 2017.
  • [8] W. Dong, C. Moses, and K. Li. Efficient k-nearest neighbor graph construction for generic similarity measures. In International Conference on World Wide Web, pages 577–586, Mar. 2011.
  • [9] S. Dongen. A cluster algorithm for graphs. Technical report, CWI (Centre for Mathematics and Computer Science), Amsterdam, The Netherlands, 2000.
  • [10] E. Elhamifar and R. Vidal. Sparse subspace clustering: Algorithm, theory, and applications. IEEE transactions on pattern analysis and machine intelligence, 35(11):2765–2781, 2013.
  • [11] M. Ester, H. peter Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In IEEE Transactions on Knowledge Discovery and Data Engineering, pages 226–231, 1996.
  • [12] P. Fränti and O. Virmajoki. Iterative shrinking method for clustering problems. Pattern Recognition, 39(5):761–775, May 2006.
  • [13] B. J. Frey and D. Dueck. Clustering by passing messages between data points. Science, 315:972–976, Feburary 2007.
  • [14] L. Fu and E. Medico. FLAME, a novel fuzzy clustering method for the analysis of dna microarray data. BMC Bioinformatics, 8(3), January 2007.
  • [15] J. Gan and Y. Tao. Dbscan revisited: Mis-claim, un-fixability, and approximation. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 519–530, 2015.
  • [16] A. Gionis, H. Mannila, and P. Tsaparas. Clustering aggregation. ACM Transactions on Knowledge Discovery from Data, 1(1), March 2007.
  • [17] R. Greenlaw and S. Kantabutra. Survey of clustering: Algorithms and applications. International Journal of Information Retrieval and Resouces, 3(2):1–29, April 2013.
  • [18] A. K. Jain and M. H. Law. Data clustering: A user’s dilemma. PReMI, 3776:1–10, 2005.
  • [19] H. Jégou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):117–128, Janurary 2011.
  • [20] P. Ji, T. Zhang, H. Li, M. Salzmann, and I. Reid. Deep subspace clustering networks. In Advances in Neural Information Processing Systems, pages 23–32, 2017.
  • [21] Z. Jiang, Y. Zheng, H. Tan, B. Tang, and H. Zhou. Variational deep embedding: An unsupervised and generative approach to clustering. In

    International Joint Conference on Artificial Intelligence

    , 2017.
  • [22] G. Karypis, E.-H. S. Han, and V. Kumar. Chameleon: Hierarchical clustering using dynamic modeling. Computer, 32(8):68–75, August 1999.
  • [23] L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, 1990.
  • [24] H.-P. Kriegel, P. Kröger, J. Sander, and A. Zimek. Density-based clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(3):231–240, 2011.
  • [25] K.-C. Lee, J. Ho, and D. J. Kriegman.

    Acquiring linear subspaces for face recognition under variable lighting.

    IEEE Transactions on pattern analysis and machine intelligence, 27(5):684–698, 2005.
  • [26] S. P. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28:129–137, March 1982.
  • [27] M. Muja and D. G. Lowe. Scalable nearest neighbor algorithms for high dimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36:2227–2240, 2014.
  • [28] S. Nene, S. Nayar, and H. Murase. Columbia object image library (coil 100). Technical Report CUCS-005-96, 1988.
  • [29] S. A. Nene, S. K. Nayar, H. Murase, et al. Columbia object image library (coil-20). Technical report CUCS-005-96, 1996.
  • [30] T. Nepusz, H. Yu, and A. Paccanaro. Detecting overlapping protein complexes in protein-protein interaction networks. Nature methods, 9(5):471–472, 2012.
  • [31] C. Otto, D. Wang, and A. Jain. Clustering millions of faces by identity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
  • [32] T. Pei, A. Jasra, D. J. Hand, A.-X. Zhu, and C. Zhou. DECODE: a new method for discovering clusters of different densities in spatial data. Data Mining and Knowledge Discovery, 18(3):337–369, 2009.
  • [33] A. Rodriguez and A. Laio. Clustering by fast search and find of density peaks. Science, 344(6191):1492–1496, June 2014.
  • [34] J. B. Roerdink and A. Meijster. The watershed transform: Definitions, algorithms and parallelization strategies. Journal Fundamenta Informaticae, 41(2):187–228, April 2000.
  • [35] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, August 2000.
  • [36] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. In IEEE International Conference on Computer Vision, pages 1470–1477, October 2003.
  • [37] F. S. Smaria and A. C. Harter. Parameterisation of a stochastic model for human face identification. In IEEE Workshop on Applications of Computer Vision, pages 138–142, 1994.
  • [38] R. C. Team. R: A language and environment for statistical computing. Technical report, R Foundation for Statistical Computing., 2012.
  • [39] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li. YFCC100M: The new data in multimedia research. Communications of the ACM, 59(2):64–73, Feb. 2016.
  • [40] T. Wittkop, D. Emig, S. Lange, S. Rahmann, M. Albrecht, J. H. Morris, S. Böcker, J. Stoye, and J. Baumbach. Partitioning biological data with transitivity clustering. Nature methods, 7(6):419–420, 2010.
  • [41] C. Wiwie, J. Baumbach, and R. Röttger. Comparing the performance of biomedical clustering methods. Nature methods, 12(11):1033–1038, 2015.
  • [42] J. Xie, R. Girshick, and A. Farhadi. Unsupervised deep embedding for clustering analysis. In

    International Conference on Machine Learning

    , pages 478–487, 2016.
  • [43] R. Xu and D. I. Wunsch. Survey of clustering algorithms. Transactions on Neural Networks, 16(3):645–678, May 2005.
  • [44] Z. Yu, W. Liu, W. Liu, Y. Yang, M. Li, and B. V. K. V. Kumar. On order-constrained transitive distance clustering. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pages 2293–2299. AAAI Press, 2016.
  • [45] Y. Zhao and G. Karypis. Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning, 55:311–331, 2004.
  • [46] L. Zheng, T. Li, and C. Ding. A framework for hierarchical ensemble clustering. ACM Transactions on Knowledge Discovery from Data, 9(2):9:1–9:23, September 2014.
  • [47] Z.-H. Zhou. Ensemble Methods: Foundations and Algorithms. Chapman & Hall/CRC, 1st edition, 2012.