Effective Deterministic Initialization for k-Means-Like Methods via Local Density Peaks Searching

11/21/2016
by   Fengfu Li, et al.
0

The k-means clustering algorithm is popular but has the following main drawbacks: 1) the number of clusters, k, needs to be provided by the user in advance, 2) it can easily reach local minima with randomly selected initial centers, 3) it is sensitive to outliers, and 4) it can only deal with well separated hyperspherical clusters. In this paper, we propose a Local Density Peaks Searching (LDPS) initialization framework to address these issues. The LDPS framework includes two basic components: one of them is the local density that characterizes the density distribution of a data set, and the other is the local distinctiveness index (LDI) which we introduce to characterize how distinctive a data point is compared with its neighbors. Based on these two components, we search for the local density peaks which are characterized with high local densities and high LDIs to deal with 1) and 2). Moreover, we detect outliers characterized with low local densities but high LDIs, and exclude them out before clustering begins. Finally, we apply the LDPS initialization framework to k-medoids, which is a variant of k-means and chooses data samples as centers, with diverse similarity measures other than the Euclidean distance to fix the last drawback of k-means. Combining the LDPS initialization framework with k-means and k-medoids, we obtain two novel clustering methods called LDPS-means and LDPS-medoids, respectively. Experiments on synthetic data sets verify the effectiveness of the proposed methods, especially when the ground truth of the cluster number k is large. Further, experiments on several real world data sets, Handwritten Pendigits, Coil-20, Coil-100 and Olivetti Face Database, illustrate that our methods give a superior performance than the analogous approaches on both estimating k and unsupervised object categorization.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 13

page 14

11/19/2018

An efficient density-based clustering algorithm using reverse nearest neighbour

Density-based clustering is the task of discovering high-density regions...
04/28/2013

Deterministic Initialization of the K-Means Algorithm Using Hierarchical Clustering

K-means is undoubtedly the most widely used partitional clustering algor...
12/11/2018

Deep Density-based Image Clustering

Recently, deep clustering, which is able to perform feature learning tha...
01/13/2022

A Geometric Approach to k-means

k-means clustering is a fundamental problem in various disciplines. This...
09/12/2014

Linear, Deterministic, and Order-Invariant Initialization Methods for the K-Means Clustering Algorithm

Over the past five decades, k-means has become the clustering algorithm ...
09/10/2012

A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm

K-means is undoubtedly the most widely used partitional clustering algor...
10/15/2015

Sparsity-aware Possibilistic Clustering Algorithms

In this paper two novel possibilistic clustering algorithms are presente...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Clustering methods are important techniques for exploratory data analysis with wide applications ranging from data mining [1]

, vector quantization

[2], dimension reduction [3], to manifold learning [4]. The aim of these methods is to partition data points into clusters so that data in the same cluster are similar to each other while data in different clusters are dissimilar. The approaches to achieve this aim include methods based on density estimation such as DBSCAN [5], mean-shift clustering [6] and clustering by fast search and find of density peaks [7], methods that recursively find nested clusters [8], and partitional methods based on minimizing objective functions such as -means [9], -medoids [10], the EM algorithm [11] and ISODATA [12]. For more information about clustering methods, see [13, 9, 14, 1].

Among partitional clustering methods, the

-means algorithm is probably the most popular and most widely studied one

[15]. Given a set of data points in the Euclidean space with dimensionality (thus ) and the number of clusters, , the partitional clustering problem aims to determine a set of data points in so as to minimize the mean squared distance from each data point to its nearest center. The -means algorithm solves this problem by a simple iterative scheme for finding a local minimal solution. It is of many advantages, including conceptually simple and easy to implement, linear time complexity and being guaranteed to converge to a local minima (see, e.g. [16, 17, 18, 19]). However, it also has the following four main drawbacks:

  1. it requires the user to provide the cluster number in advance;

  2. it is highly sensitive to the selection of initial seeds due to its gradient descent nature;

  3. it is sensitive to outliers due to the use of the squared Euclidean distance;

  4. it is limited to detecting compact, hyperspherical clusters that are well separated.

Fig. 1(a)-(d) shows the above issues on toy data sets.

(a) improper
(b) improper seeds
(c) outliers effect
(d) non-spherical distribution
(e) estimated
(f) carefully selected seeds
(g) outliers detection/removal
(h) manifold-based clustering
Fig. 1: Illustration of the drawbacks of -means ((a)-(d)) and the solution of our methods ((e)-(h)). First row shows the following cases for which -means fails: (a) is unreasonably given, (b) the initial seeds (marked by blue ) are improperly selected, (c) the effect of outliers is strong, and (d) the data distribution is non-spherical. Second row shows our solutions: (e) an appropriate is automatically estimated, (f) the initial seeds that are geometrically close to the center of the clusters are selected, (g) outliers are detected and removed before clustering, and (h) the manifold-based distance is used as the dissimilarity measure instead of the squared Euclidean distance to deal with manifold-distributed clusters.

Many approaches have been proposed to deal with these issues individually. The first issue can be partially remedied by extending the -means algorithm with the estimation of the number of clusters. -means [20] is one of the first such attempts that use the splitting and merging rules for the number of centers to increase and decrease as the algorithm proceeds. -means [21] works similarly as

-means except it assumes that the clusters are generated from the Gaussian distributions.

-means [22], however, only assumes each cluster to admit a unimodal distribution and verifies this by Hartigans’ dip test [23].

To address the second issue of -means, a general approach is to run -means repeatedly for many times and then to choose the model with the smallest mean square error. However, this can be time-consuming when is relatively large. To overcome the adverse effect of randomly selecting the initial seeds, many of adaptive initialization methods have been proposed. The -means++ algorithm [24] aims to avoid poor quality data partitioning during the restarts and achieves -competitive results with the optimal clustering. The Min-Max -means algorithm [25] deals with the initialization problem of

-means by alternating the objective function to be weighted by the variance of each cluster. Methods such as PCA-Part and Var-Part

[26] use a deterministic approach based on PCA and a variance of data to hierarchically split the data set into parts where initial seeds are selected. For many other deterministic initialization methods, see [27, 18, 19] and the references quoted there.

The third drawback of -means, that is, its sensitivity to outliers, can be addressed by using more robust proximity measure [28], such as the Mahalanobis distance and the distance rather than the Euclidean distance. Another remedy for this issue is to detect the outliers and then remove them before the clustering begins. The outlier removal clustering algorithm [29] uses this idea and achieves a better performance than the original -means method when dealing with overlapping clusters. Other outliers-removing cluster algorithms can be found in [30]and the references quoted there.

The last drawback of -means that we are concerned with is its inability to separate non-spherical clusters. This issue can be partially remedied by using the Mahalanobis distance to detect hyperellipsoidal clusters. However, it is difficult to optimize the objective function of -means with non-Euclidean distances. -medoids [10], as a variant of -means, overcomes this difficulty by restricting the centers to be the data samples themselves. It can be solved effectively (but slowly) by data partitioning around medoids (PAM) [31], or efficiently (but approximately optimally) by CLARA [31]. In addition, the -medoids algorithm makes it possible to deal with manifold-distributed data by using neighborhood-based (dis)similarity measures. However, -medoids also has the first two drawbacks of -means.

In this paper, we propose a novel method named Local Density Peaks Searching (LDPS) to estimate the number of clusters and to select high quality initial seeds for both -means and -medoids. A novel measure named local distinctiveness index (LDI) is proposed to characterize how distinctive a data point is compared with its neighbors. The larger the LDI is, the more locally distinguishable the data point is. Based on the LDI, we characterize the local density peaks by high local densities and high LDIs. A score function is then given with the two measures to quantitatively evaluate the potential of a data point to be a local density peak. Data points with high scores are founded and regarded as local density peaks. By counting the number of local density peaks, a reasonable number of clusters can be obtained for further clustering with -means or -medoids. In addition, the local density peaks can also be served as good initial seeds for the -means or -medoids clustering algorithm. As a result, the first two drawbacks of -means or -medoids are thus remedied.

In analogy with the searching of local density peaks, we characterize outliers with low local densities but high LDIs. Another score function to quantitatively evaluate the potential of a data point to be an outlier is given. Based on the scores, outliers can be effectively detected. To minimize the effect of outliers, we remove them before clustering begins. Thus, the third issue of -means is remedied. Fig. 1 (e)-(h) shows the clustering results of our methods compared with the original -means algorithm.

The remainder of the paper is organized as follows. Section 2 briefly reviews some related works. In Section 3, we give a step by step introduction of our initialization framework for -means and -medoids. Two novel clustering algorithms, called LDPS-means and LDPS-medoids, are proposed in Section 4. They are based on the LDPS algorithm together with the -means and -medoids clustering algorithms. Section 5 gives a theoretical analysis on the performance of the proposed methods. Experiments on both synthetic and real data sets are conducted in Section 6 to evaluate the effectiveness of the proposed methods. Final conclusions and discussions are given in Section 7.

Fig. 2: The clustering framework as a combination of the local density and local distinctiveness index based initialization and the -means type clustering. The number of clusters is an optional input. The main part of the framework is the initialization stage which serves as estimating , selecting initial seeds and detecting/removing outliers for the clustering stage. The clustering stage assists to find an optimized solution. Note that Y=yes, N=no.

2 Related works

2.1 The -means algorithm

Given a data set , the -means algorithm aims to minimize the Sum of Squared Error (SSE) with cluster centers :

(1)

where is the assignment index set, if is assigned to the th cluster and if otherwise, is a dissimilarity measure defined by the Euclidean distance . To solve (1), the -means algorithm starts from a set of randomly selected initial seeds and iteratively updates the assignment index set with

(2)

and the cluster centers with

(3)

Here, is the number of points that are assigned to the th cluster. The update procedure stops when has no change or SSE changes very little between the current and previous iterations. Finally, the algorithm is guaranteed to converge [16] at a quadratic rate [17] to a local minima of the SSE, denoted as .

2.2 Variants of the -means algorithm

2.2.1 -medoids

-medoids [10] has the same objective function (1) as -means. However, it selects data samples as centers (also called medoids), and the pairwise dissimilarity measure is no longer restricted to the square of the Euclidean distance. This leads to a slightly different but more general procedure in updating the centers. Specifically, if we denote the indices of the points in the th cluster as , then -medoids updates the center indices as

(4)

where denotes the index of the th center, that is, . The assignments in -medoids is updated in the same way as in -means.

Compared with -means, -medoids is capable to deal with diverse data distributions due to its better flexibility of choosing dissimilarity measures. For example, -means fails to discover the underlying manifold structure of the manifold-distributed data, while -medoids may be able to do so with manifold-based dissimilarity measures.

2.2.2 -means and -means

-means [20] is an extension of -means with the estimation of the number of clusters. It uses the splitting and merging rules for the number of centers to increase and decrease as the algorithm proceeds. During the process, the Bayesian Information Criterion (BIC) [32] is applied to score the modal. The BIC score is defined as follows:

where is the log-likelihood of the data set according to the modal and is the number of parameters in the modal with the dimensionality and cluster centers. -means chooses the modal with the best BIC score on the data. The BIC criterion works well only for the case where there are a plenty of data and well-separated spherical clusters.

-means [22] is another extension of -means with the estimation of the number of clusters. It assumes that each cluster admits a unimodal distribution which is verified by Hartigans’ dip test. The dip test is applied in each split candidate cluster with a score function

Here, is the set of split viewers, is a statistic significant level for the dip test. The candidate with the maximum score is split in each iteration. It works well when the data set has various structural types. However, it would underestimate when the clusters are closely adjacent.

2.2.3 -means++

-means++ is a popular variant of -means with adaptive initialization. It randomly selects the first center and then sequentially chooses to be the th () center with probability , where is the minimum distance from to the closest center that we have already chosen. The capability of the method is reported to be -competitive with the optimal clustering.

2.3 Clustering by fast search and find of density peaks

Clustering by fast search and find of density peaks (CFSFDP) [7] is a novel clustering method. It characterizes a cluster center with a higher density than its neighbors and with a relatively large distance from other points with high densities. The density of is defined as

(5)

where if and if otherwise, is the distance between and and is a cutoff distance. Intuitively, equals to the number of the points whose distance from is less than . Another measure , which we call the global distinctiveness index (GDI), is the minimum distance between and any other points with high densities:

(6)

For the point with the peak density, its GDI is defined as . Only those points with relatively high local densities and high GDIs are considered as cluster centers. By combining the local density and GDI, a possible way of choosing the cluster centers is to define

and then to choose the points with high values of to be the centers. After the cluster centers are determined, each of the remaining points is assigned to the cluster if it is the nearest neighbor of the cluster center with a higher density.

Though the CFSFDP method is simple, it is powerful to distinguish clusters with distinct shapes. In addition, it is insensitive to outliers due to a cluster halo detection procedure. However, CFSFDP also has some disadvantages, such as: 1) it does not give a quantitative measure of how to choose the cluster centers automatically, 2) its assignment step is not clear compared with the -means and -medoids algorithms, 3) it is sensitive to the parameter , and 4) it cannot serve as an initialization method for -means or -medoids when the number of the clusters is given in advance.

(a) The R15 data set.
(b) - graph of R15.
(c) - graph of R15.
Fig. 3: Comparison of GDI and LDI on the R15 data set. (a) The R15 data set which owns 15 clusters. Points marked with (red) ”+” own highest local densities within their neighborhoods. (b) The dominant effect of GDI. Although the 15 labeled points have similar highest local densities in their neighborhoods, their GDIs vary a lot from 0.1 to 0.4. The GDIs of points 1, 2, 4, 6, and 8 are dominated by points 3, 5 and 7 which own relatively larger densities than the aforementioned points. (c) The LDI wiped out the dominant effect by choosing proper ; where . Now, labeled points almost all have largest LDIs. As a result, they are quantitatively more distinctive than the unlabeled points.

3 Local density and LDI based initialization for -means-like methods

In this section, we propose an initialization framework for the -means and -medoids algorithms. Fig. 2 illustrates the overall diagram for the clustering framework, where Part I is the initialization framework. We first introduce two basic measures: one is the local density and the other is the local distinctiveness index (LDI). Based on these two measures, we propose a local density peaks searching algorithm to find the local density peaks, which can be used to estimate the number of clusters as well as to serve as the initial seeds for -means or -medoids. In addition, outliers can be detected and removed with the help of the measures. Below is a detailed description of the proposed method.

3.1 Local density and local distinctiveness index

Local density characterizes the density distribution of a data set. We use the kernel density estimation (KDE)

[33] to compute the local densities. Suppose the samples are generated from a random distribution with a unknown density . Then KDE of at the point is given by

(7)

where is a smoothing parameter called bandwidth, is the distance between and , and is a kernel function satisfying that 1) , 2) and 3) . A popular choice for the kernel function is the standard Gaussian kernel:

(8)

It is known that the Gaussian kernel is smooth. In addition, compared with the uniform kernel which is used in [7] (see (5)), the Gaussian kernel has a relatively higher value when is small and thus keeps more local information near zero. In what follows, we will use the Gaussian kernel.

Based on the local density, we propose a new measure called the local distinctiveness index (LDI) to evaluate the distinctiveness of the point compared with its -neighbors . We first define the local dominating index set (LDIS):

Intuitively, indicates which of the points in the -neighbors of dominates with the local density measure. Based on LDIS, we can define LDI as follows:

(9)

where denotes the empty set.

With the definition (9), lies in The point has the biggest LDI if its LDIS is empty, which means that is not dominated by any other point, that is, either is empty or for any . For any other point, its LDI is computed as the minimal distance between the point and the dominating points, divided by the local parameter . When is set to be larger than , the LDI will degenerate to the GDI since for any . Thus, LDI is a generalization of GDI. However, LDI characterizes the local property of the data distribution, but GDI does not give us any local information of the data distribution. Fig. 3 shows the difference between GDI and LDI. The GDI of the point with the highest density is defined as the global maximum distance between the point and all other points, and GDIs of the other points are defined by their maximum distance to the points with higher densities. Thus, even though two points have similar highest local densities within their neighborhoods, their GDIs may have a big difference. We call this phenomenon as the dominant effect of GDI (see Fig. 3(a)-(b)). Fortunately, the dominant effect of GDI can be eliminated by using LDI with the appropriate choice of the parameter since LDI of a point is only affected by the points within its -neighborhood (see Fig. 3(c)). Thus, LDI will be quantitatively more distinctive than GDI when the number of clusters is large.

3.2 Local density peaks searching

In the CFSFDP algorithm, a density peak is characterized by both a higher density than its neighbors and a relatively large distance from the points with higher densities. We use the similar idea to search for the local density peaks by assuming that a local density peak should have a high local density and a large local distinctiveness index. Fig. 4 gives an intuitive explanation of how this works.

To find the local density peaks, we introduce a quantitative measure to evaluate the potential of a point to be a local density peak. is defined as

(10)

Here, is the normalized local density.

By definition (10), lies in and is an increasing function of and . Further, the term I in the second equation in the definition (10) is used for selecting the high local density and high LDI points. For instance, the point with a high density () and a high LDI () will have the value being close to On the contrary, the point with a low local density () and a low local LDI () will have the value being close to The term II in the second equation in the definition (10) is to balance the influence of the local density and LDI. As a consequence, the points with balanced local densities and LDIs are preferred. For example, equals to which is much greater than that equals to


Fig. 4: Local density and LDI distribution. Point 1 is the unique density peak which has the highest local density and highest LDI among all of the points. Points 2 and 3 have relatively large densities, but their LDIs are very small since their LDISs include Point 1, and their distances to Point 1 are very small. Points 4-7 all have relatively low densities, but their LDIs are different. Points 4 and 5 are relatively close to the center and thus their LDIs are small. Points 6 and 7, however, are far away from the cluster, and as a result, their LDIs are relatively high.

We now analyze the local density and LDI distribution in Fig. 4. First, the point which lies in the centroid of a cluster will have the highest local density and LDI and thus is regarded as a local density peak. Secondly, points which are close to the centroid will have high local densities and low LDIs since their LDISs all include the local density peak, and their distances to the local density peak are small. Finally, points which are far away from the centroid will have relatively low local densities. Quantitatively, the local density peaks in a local area will have high values, but the other points will have relatively much smaller values by the definition (10). Thus, there should have a big gap in terms of the value between the local density peaks and the other points around the local density peakes. By observing the gap, the local density peaks can be found. Based on the above discussion we propose the Local Density Peaks Searching (LDPS) algorithm which is stated in Algorithm 1.

0:  dissimilarity matrix , bandwidth and local parameter
0:   and
1:  Compute local densities with (7)
2:  Compute local distinctiveness index by (9)
3:  Compute with and by (10)
4:  Sort with the descending order:
5:  Compute the gaps (negative difference of ):
6:  Observe the biggest gap and decide the number of the local density peaks:
7:  Search for the local density peaks with indices:
8:  Compute the maximum gap of the value between the local density peaks and the other points:
9:  return   and
Algorithm 1 The LDPS Algorithm

In Algorithm 1, is the sorted vector of with the descending order, is the sorted index, that is, , and is the numerical difference operator:

3.3 Estimating via local density peaks searching

-means and -medoids require the number of the clusters as an input. However, it is not always easy to determine the best value of [20]. Thus, learning is a fundamental issue for the clustering algorithms. Here, we use the LDPS algorithm for estimating which is equal to the number of the local density peaks.

in Algorithm 1 is the minimum gap of the value between the selected local density peaks and the other points and thus treated as a measure of how distinctive the local density peaks are. The bigger the value of is, the better the estimated will be. If the resulting is too small, the procedure for estimating will fail. In this case, we set the estimated as .

Compared with -means and -means which are incremental methods that use the splitting/merging rules to estimate , our method does not have to split the data set into subsets and is thus fast and more stable. Further, CFSFDP uses the two-dimensional - decision graph (see Fig. 3(b)) together with manual help to select the cluster centers, while our method estimates quantitatively and automatically without any manual help.

3.4 Selecting initial seeds with local density peaks

Choosing appropriate initial seeds for the cluster centers is a very important initialization step and plays an essential role for -means and -medoids to work properly. Here, we assume that we have already known the true number of the clusters (either given by the user or estimated by using the LDPS algorithm). Let us denote by the true number of the clusters used for the clustering algorithms.

If the true number of the clusters is not given in advance by the user, we use the estimated to be . In addition, we take the local density peaks obtained by the LDPS algorithm to be the initial seeds. In fact, we select the first elements with the leading values as the initial seeds, that is,

(11)

where is the indices of the initial seeds.

Geometrically, the initial seeds found by (11) will have relatively high local densities as well as high LDIs. Thus, they avoid being selected as outliers (due to the high local densities) and avoid lying too close to each other (due to the high GDIs). As a result, these initial seeds can lead to very fast convergence of the -means algorithm when the clusters are separable. This advantage will be verified by the experiments in Section 6.

3.5 Outliers detection and removal

Outliers detection and removal can be very useful for -means to work stably. Here, we develop a simple algorithm to detect and remove the outliers, based on the local density and LDI. First, we define the value as follows:

(12)

This definition is very similar to that of except that is a decreasing function of and is increasing with increasing. The points with low densities but high LDIs will get high values and are thus regarded as outliers.

Secondly, we use a threshold of , denoted by , to detect the outliers with the principle that of the outliers should be greater than . For example, if we set , then the points with and will have the values being greater than Thus, they are treated as outliers. The set of outliers and the set of the corresponding indices are denoted as and , respectively. The other points with higher densities or lower LDIs will get relatively smaller values of and therefore will be treated as normal samples.

Finally, we remove the outliers from the data set before -means proceeds, that is, setting .

3.6 Model selection for the LDPS algorithm

The accuracy of the estimation of obtained by the LDPS algorithm depends heavily on the bandwidth for the local density estimation and the neighborhood size for the computation of LDI. Denote by the (normalized) parameters, where and . Fig. 5(a) shows the results of the LDPS algorithm with different parameters on the R15 data set. As seen in Fig. 5(a), the estimated is equal to the ground truth only when the parameters are properly selected.

(a) - graph
(b) - graph
Fig. 5: Effects of different parameters for the LDPS algorithm on the R15 data set. The ground truth is (a) In the - graph, the estimated equals to when or . (b) In the - graph, the maximum is achieved when , from which we get .

3.6.1 Parameters choosing by grid search

In many real applications, we do not know beforehand what the true number of the clusters is. Therefore, we need to define certain criteria to evaluate the estimated number of clusters and do model selection to optimize the criteria.

Here, we utilize as a criterion to assess how good the estimated will be. As discussed in Section 3.3, indicates the maximum gap of the value between the selected local density peaks and the other points. Mathematically, it can be written as a function of the dissimilarity matrix with parameters (see Algorithm 1). The parameters that maximize will result in the most distinctive local density peaks. Thus, we choose the parameters by solving the optimization problem:

(13)

Fig. 5(b) shows the value on the R15 data set with respect to different , where the dissimilarity measure is the square of the Euclidean distance.

There are no explicit solutions for the optimization problem (13). A practical way of solving this problem approximately will be the grid search method [34], in which various pairs of the values are tried and the one that results in the maximum is picked. Due to the local property of the density estimation and LDI, and are generally set to lie in a small range such as and , respectively. Take the R15 data set as an example, we equally split and into fractions, respectively and then use a grid search procedure to maximize . The maximum gap we get is with .

4 LDPS-means and LDPS-medoids

In the previous section, we proposed the LDPS algorithm for initializing -means and -medoids. For -means, the input dissimilarity matrix is the square of the Euclidean distance. For -medoids, any kind of dissimilarity matrix can be used as input. In view of this difference, they use different procedures for updating the cluster centers.

In this section, we propose two novel clustering algorithms, LDPS-means (Algorithm 2) and LDPS-medoids (Algorithm 3), as a combination of the LDPS initialization algorithm (Algorithm 1) with the clustering procedures of -means and -medoids, respectively. Their clustering framework is implemented as in Fig. 2.

0:  , (optional), , ,
0:  , , and
1:  Perform the LDPS algorithm to get , , and
2:  if  is not given then
3:     Estimate with , and set
4:  end if
5:  Select initial seeds with indices by (11);
6:  Compute :
7:  Detect outliers with
8:  Remove the outliers:
9:  while not converging do
10:     Compute assignments by (2)
11:     Update the centers by (3)
12:  end while
13:  return  , and
Algorithm 2 The LDPS-means Algorithm
0:  , (optional), , ,
0:  , and
1:  Perform the LDPS algorithm to get , , and
2:  if  is not given then
3:     Estimate with , and set
4:  end if
5:  Select initial seeds with indices by (11);
6:  Compute :
7:  Detect outliers with indexes
8:  Remove the outliers:
9:  while not converging do
10:     Compute assignments by (2)
11:     Update the medoids by (4)
12:  end while
13:  return  , , and
Algorithm 3 The LDPS-medoids Algorithm

LDPS-means is a powerful method to deal with spherically distributed data. However, it is unable to separate non-spherically distributed clusters. LDPS-medoids can deal with this issue by choosing appropriate dissimilarity measures. In the next subsection, we will discuss how to choose an appropriate dissimilarity measure for the LDPS-medoids algorithm.

4.1 Dissimilarity measures for LDPS-medoids

A dissimilarity measure is the inverse of a similarity measure [35], which is a real-word function that quantifies the similarity between two objects. It can be viewed as a kind of distance without satisfying the distance axioms. It assesses the dissimilarity between data samples, and the larger it is, the less similar the samples are.

Choosing appropriate dissimilarity measures for clustering methods is very crucial and task-specific. The (square of) Euclidean distance is the most commonly used dissimilarity measure and suitable to deal with spherically distributed data. The Mahalanobis distance is a generalization of the Euclidean distance and can deal with hyper-ellipsoidal clusters. If the dissimilarity measure is the distance, -medoids will get the same result as -median [36]. For manifold distributed data, the best choice for dissimilarity measures would be the manifold distance [37] which is usually approximated by the graph distance based on the -neighborhood or the -nearest-neighborhood (-nn). Graph-based -means [38] uses this measure. For images, one of the most effective similarity measures may be the complex wavelet structural similarity (CW-SSIM) index [39], which is robust to small rotations and translations of images. In [40]

, a combination of the manifold assumption and the CW-SSIM index is used for constructing a new manifold distance named geometric CW-SSIM distance, which shows a superior performance for visual object categorization tasks. Other cases include the cosine similarity which is commonly used in information retrieval, defined on vectors arising from the bag of words modal. In many machine learning applications, kernel functions such as the radial basis function (RBF) kernel can be viewed as similarity functions.

In the section on experiments, whenever manifold-based dissimilarity measures are needed, we always use -nn as the neighborhood constructor for approximating the manifold distance. is generally set to be a small value such as and .

5 Performance analysis

In this section, we analyze the performance of the local density and LDI based clustering methods. To simplify the analysis, we make the following assumptions:

  1. the clusters are spherical-distributed,

  2. each cluster has a constant number () of data points,

  3. the clusters are non-overlapping and can be separated by -means with appropriate initial seeds.

We use both the final SSE and the number of iterations (updates) as criteria to assess the performance of -means and LDPS-means. Under the above assumptions, we have the following results.

Theorem 1.

Under Assumptions 1)-3) above, the average number of repeats that -means needs to achieve the competitive performance of LDPS-means is .

Here, competitive means both good final SSE and less number of iterations. See Appendix for the proof.

We now analyze the time complexity of -means to achieve the competitive performance of LDPS-means. The time complexity of LDPS-means is . The time complexity of -means to achieve the competitive performance of LDPS-means is . This is summarized in the following theorem.

Theorem 2.

Under Assumptions 1)-3) above, the time complexity of -means to achieve the competitive performance of LDPS-means is . The relative time complexity of -means to achieve the competitive performance of LDPS-means is .

Note that and do not depend on each other. Thus, compared with -means, LDPS-means is superior in time complexity when is much larger compared with .

The above theorems are also true for -medoids and LDPS-medoids, and in this case, the assumption 1) is not needed.

6 Experiments

In this section, we conduct experiments to evaluate the effectiveness of the proposed methods. The experiments mainly consist of two parts: one is for evaluating the performance of the LDPS algorithm in estimating and the other is the clustering performances of LDPS-means and LDPS-medoids obtained by the deterministic LDPS initialization algorithm (Algorithm 1). We also evaluate the effect of the outliers detection and removal procedure on the performance of the clustering algorithm.

All experiments are conducted on a single PC with Intel i7-4770 CPU (4 Cores) and 16G RAM.

6.1 The compared methods

In the first part on estimating , we compare the LDPS method with -means, -means and CFSFDP on the estimation of the cluster number . The -means is parameter-free. For -means, we set the significance level for the dip test and the voting percentage as in [22]. For CFSFDP, we follow the suggestion in [7] to choose so that the average number of neighbors is around to of the total number of data points in the data set. Formula (13) is used to estimate the parameters in the LDPS algorithm. We denote LDPS with the square of the Euclidean distance and the manifold-based dissimilarity measure as LDPS(E) and LDPS(M), respectively.

In the clustering part, we compare the clustering performance of LDPS-means and LDPS-medoids with -means and -medoids, respectively.

Data Sets     -means    -means     CFSFDP   LDPS(E)   LDPS(M)

 

(a). A-sets A   1   5   5   5   5
A   2   14   20   20   20
A   1   30   35   35   35
A   1   47   47   50   50

 

(b). S-sets S   1   47   50   50   50
S   1   43   50   50   50
S   1   33   42   50   50
S   1   30   35   50   50

 

(c). Dim-sets D   1   11   41   50   50
D   1   1   37   47   50
D   1   1   30   48   50
D   50   1   25   47   50

 

(d). Shape-sets Crescent ()   8   2   3   3   2
Flame ()   1   1   7   4   2
Path-based ()   44   5   2   2   3
Spiral ()   1   2   3   3   3
Compound ()   3   3   9   10   7
Aggregation ()   1   4   10   9   6
R15 ()   1   8   15   15   15
D31 ()   1   22   31   31   31
TABLE I: Results of the estimated on the synthetic data sets. (a) The A-sets have different numbers of clusters. (b) The S-sets vary with the complexity of the data distribution. (c) The Dim-sets lie in the Euclidean space with the varying dimensionality . (d) The Shape-sets have different shapes.

6.2 The data sets

6.2.1 Overview of the data sets

We use both synthetic data sets and real world data sets for evaluation. Four different kinds of synthetic data sets are used: the A-sets [41] have different numbers of clusters, , the S-sets [42] have different data distributions, the Dim-sets vary with the dimensionality and the Shape-sets [7] are of different shapes. They can be download from the clustering datasets website 111http://cs.joensuu.fi/sipu/datasets/. We made certain modifications on the S-sets and Dim-sets since these two sets are easy to be separated. More details can be found in Section 6.4. The real world data sets include Handwritten Pendigits [43], Coil-20 [44], Coil-100 [45] and Olivetti Face Database [46].

6.2.2 Attribute normalization

In clustering tasks, attribute normalization is an important preprocessing step to prevent the attributes with large ranges from dominating the calculation of the distance. In addition, it helps to get more accurate numerical computations. In our experiments, the attributes are generally normalized into the interval using the min-max normalization. Specifically,

where is the th attribute of the data point , and . For gray images whose pixel values lying in we simply normalize the pixels by dividing

6.3 Performance criteria

For estimating , we use the simple criterion that the better performance is achieved when the estimated is close to the ground truth .

For the task of clustering the synthetic data sets, three criteria are used to evaluate the performance of initial seeds: 1) the total CPU time cost to achieve , 2) the number of repeats (#repe) needed to achieve and 3) the number of assignment iterations (#iter) when is achieved in the repeat. We first run LDPS-means to get an upper bound for , denoted as . We then run -means repeatedly and sequentially up to times and record the minimal , which is denote by . During this process, once is smaller than , we record the current number of repeats as #repe and the number of iterations in this repeat as #iter. Otherwise, #repe and #iter are recorded when the minimal is achieved within the whole repeats. The same strategy is used to record the results of -means++, whose minimal is denoted as . Records of CPU time, #iter, #repe, and are averaged over the duplicate tests to reduce the randomness.

On the real world data sets, we consider the unsupervised classification task [47]. Three criteria are used to evaluate the clustering performance of the comparing algorithms by comparing the learned categories with the true categories. First, each learned category is associated with the true category that accounts for the largest number of the training cases in the learned category and thus the error rate () can be computed. The second criterion is the rate of true association (), which is the fraction of pairs of images from the same true category that were correctly placed in the same learned category. The last criteria is the rate of false association (), which is the fraction of pairs of images from different true categories that were erroneously placed in the same learned category. The better clustering performance is characterized with a lower value for and but a higher value for . To fairly compare the performance of LDPS-means with -means, we make their running time to be the same by control the number of repeats of -means. The same strategy is applied for LDPS-medoids and -medoids. The results of -means and -medoids are recorded when in the repeat is the smallest one among the repeats.

Data Sets   -means   -means++   LDPS-means
time (s)  #iter #repe   time (s)  #iter #repe   time (s)  #iter

 

A   0.005 9.8 1.4 7.61   0.05 7.8 1.2 7.61   0.17 4 7.61
A   0.9 15.7 76.2 6.75   1.48 14.4 23.9 6.75   1.36 2 6.75
A   101.6 15.6 4053 7.70   42.9 12.1 219 7.54   3.49 2 7.54
A   275.5 13.7 7411 7.78   2036 10 4612 7.08   6.94 2 6.99
TABLE II: Clustering performance comparison on the A-sets with different numbers of clusters, where ”time” stands for the total CPU time cost in seconds, #iter = the number of iterations when the optimal SSE is achieved, #repe = the number of repeats needed to achieve the optimal SSE and = the optimal SSE.
Data Sets   -means   -means++   LDPS-means
time (s)  #iter #repe   time (s)  #iter #repe   time (s)  #iter

 

S   69.7 9.2 3972 3.21   728 4.6 2590 1.98   3.09 2 1.98
S   129 15 5486 4.61   1148 8.6 3967 4.03   3.39 2 3.97
S   142 16.7 5149 6.19   1027 11.8 3482 5.80   3.26 3 5.73
S   186 17.3 6038 7.25   654 16.5 2207 7.20   3.29 4 7.20

 

D   132 11.7 5653 16.5   1675 7.2 5789 15.1   3.63 3 14.6
D   111 16.3 3688 221   276 14 923 219   3.15 6 221
D   140 13 5418 334   1245 11.3 4105 323   3.03 15 312
D   121 12.5 4566 650   291 10.2 953 630   3.13 7 639
TABLE III: Clustering performances on S-sets and Dim sets with varying complexity of data distribution and with varying dimensionality, respectively. The meanings of the criteria are the same as that in TABLE II
(b) Flame
(c) Path-based
(d) Spiral
(e) Compound
(f) Aggregation
(g) R15
(h) D31
(a) Crescent
Fig. 6: Visualization of the clustering results on the Shape-sets by LDPS-medoids. Points marked with ”” and ”” are initial seeds and the final centroids, respectively. Points marked with ”” are detected outliers. LDPS-medoids correctly detected the outliers in the Flame set and most of the outliers in the Compound set.
(a) Crescent

6.4 Experiments on synthetic data sets

We use four kinds of synthetic data sets. The A-sets contains three two-dimensional sets and with different numbers of circular clusters (). Each cluster has data points. We generate a new set by selecting five clusters from with the labels

The S-sets to are composed of the data points sampled from two-dimensional Gaussian clusters , with data points in each cluster. Their centers are the same as that of the min-max normalized set. We set , with being and respectively, where

is the identity matrix in

.

We also generate four Dim-sets with the dimensionality The Dim-sets are Gaussian clusters that distribute in multi-dimensional spaces. Each of them has clusters with data samples in each cluster. The first two-dimensional projection of their cluster centers are the same as that of the min-max normalized set. The axes in the other dimensions of the cluster centers are randomly distributed. Their covariances are set to be with respectively, where is the identity matrix in .

The Shape-sets consist of sets with different shapes (see Fig. 6(a)). Six of them are the Flame set (=2), the Spiral set (=3), the Compound set (=6), the Aggregation set (=7), the R15 set (=15) and the D31 set (=31). We generate two new Shape-sets, the Crescent shape set () and the Path-based set ().

  large high large shapes

 

-means  
-means  
CFSFDP  
LDPS(E)  
LDPS(M)  
TABLE IV: Summary of the ability to learn with different data distributions, where ✓means ”good”, means ”general” and means ”bad”.

6.4.1 Performance on the estimation of

The results of the estimated are summarized in TABLE I. From this table it is seen that -means fails to split on most of the data sets but it gets the correct result for the set. -means gets better results than -means in most of the cases, but it underestimated for most of the data sets. In particular, -means fails to detect any valid cluster in , and due to the relatively high dimensionality. CFSFDP gets better results than -means on most of the data sets. Though CFSFDP underestimated for the sets , , and all the Dim-sets, it gets a very good estimation of (very close to the true cluster number ) on the Shape-sets. Note that CFSFDP fails on the Flame set, the Compound set and the Aggregation set. This is slightly different from the results reported in [7]. It should be pointed out that the results reported in [7] can be achieved with a very careful selection of parameters and with a prior knowledge on the data distribution, which we did not consider in this paper.

  Data sets      -means      -means      CFSFDP (E)      CFSFDP (M)     LDPS (E)    LDPS (M)  

 

PD   280   4   3   5   3   3  
PD   453   2   4   4   5   5  
PD   764   7   7   6   8   8  
PD   942   9   7   8   8   11  

 

PD   142   4   6   3   3   3  
PD   265   4   5   5   5   5  
PD   427   3   7   6   7   8  
PD   520   7   8   6   8   9  
TABLE V: Results of the estimated on the Handwritten Pendigits data set and its subsets.
Data Sets   -means   LDPS-means   -medoids   LDPS-medoids
     

 

PD   0.165 0.764 0.171   0.165 0.764 0.171   0.142 0.787 0.142   0.142 0.787 0.142
PD   0.220 0.826 0.111   0.079 0.856 0.037   0.054 0.906 0.029   0.046 92.3 0.023
PD   0.230 0.736 0.072   0.141 0.772 0.039   0.180 0.737 0.052   0.116 81.6 0.031
PD   0.290 0.664 0.066   0.256 0.641 0.051   0.207 0.757 0.041   0.157 76.9 0.034

 

PD   0.360 0.660 0.448   0.360 0.660 0.448   0.359 0.660 0.446   0.359 0.660 0.446
PD   0.144 0.752 0.066   0.142 0.756 0.066   0.038 0.934 0.019   0.038 0.934 0.019
PD   0.285 0.706 0.083   0.232 0.629 0.061   0.247 0.772 0.082   0.150 0.806 0.042
PD   0.311 0.644 0.067   0.286 0.640 0.049   0.205 0.764 0.043   0.147 0.801 0.031
TABLE VI: Clustering performance comparison on the Handwritten Pendigits data set and its subsets, where is the error rate, stands for the rate of true association which is the fraction of pairs of images from the same true category that were correctly placed in the same learned category, and is the rate of false association which is the fraction of pairs of images from different true categories that were erroneously placed in the same learned category.

LDPS(E) works very well on most of the data sets. Compared with CFSFDP, -means obtained the correct on , and and a very close to on the Dim-sets. Compared with the other comparing methods, LDPS(M) obtained the best results due to its use of LDI and manifold-based dissimilarity measure. Compared with LDPS(E), LDPS(M) shows its superiority in learning when dealing with the Shape-sets.

Based on the above analysis, we summarize the ability of the comparing methods for estimating on the synthetic data sets in TABLE IV.

6.4.2 Clustering performance

We first compare the clustering performance of LDPS-means with -means and -means++ on the A-sets to verify the result of Theorem 1. The experimental results are listed in TABLE II. As shown in TABLE II, the clustering performance of LDPS-means is getting much better as increases. On the sets and , LDPS-means outperforms -means and -means++ greatly. This is consistent with Theorem 1.

We then conduct experiments on S-sets and Dim-sets to show the capability of LDPS-means in separating clusters with a varying complexity of data distributions and a varying dimensionality, respectively. The experimental results are listed in TABLE III. From the table it is seen that, compared with -means and -means++, LDPS-means takes much less time and much less number of iterations to achieve . Note that is smaller than and on most of the data sets.

Finally, the variants of -means (including -means, -means++ and LDPS-means) fail on most of the Shape-sets. However, using LDPS-medoids with the manifold-based dissimilarity measure can get satisfactory results. Fig. 6(a) shows the clustering results on the Shape-sets by LDPS-medoids with an appropriate and the estimated parameters .

6.5 Experiments on Handwritten Pendigits

We now carry out experiments on the real world data set, Handwritten Pendigits, to evaluate the performance of LDPS-means and LDPS-medoids on general purpose clustering. This data set can be download from the UCI Machine Learning repository222http://archive.ics.uci.edu/ml/. The Handwritten Pendigits data set contains totally data points with -dimensional features. Each of them represents a digit from written by a human subject. The data set consists of a training data set and a testing data set with and samples, respectively. Apart from the full data set, we also consider three subsets that contain the digits {1,3,5} ( and ), {0,2,4,6,7} ( and ), and {0,1,2,3,4,5,6,7} ( and ). On these sets, the manifold distance is approximated by the graph distance, which is the shortest distance on the graph constructed by -nn.

6.5.1 Performance on the estimation of

TABLE V presents the results of estimating on the Handwritten Pendigits. -means fails on all of those data sets. -means also fails on all of those data sets though it gets the closest to the true cluster number on the set compared with all of the other comparing methods. CFSFDP(E) and CFSFDP(M) get an underestimated at most of the cases. Compared with the CFSFDP methods, the LDPS methods get the correct on most of the data sets owing to LDI. LDPS(M) gets better results than those obtained by LDPS(E) on the and sets due to the use of the manifold distance.

6.5.2 Clustering performance

We now compare the unsupervised object classification performance of LDPS-mean and LDPS-medoids with -means and -medoids. The results are shown in TABLE VI. As seen in the table, LDPS-means gets better results than -means does on most of the data sets except for the criterion on , and , where LDPS(E) fails to estimate the correct . -medoids gets better results than the other comparing methods due to the use of the manifold distance as the dissimilarity measure. However, LDPS-medoids gets the best results on all the data sets with all the criteria.

  Data Sets      -means      -means      CFSFDP (E)      CFSFDP (M)     LDPS (E)    LDPS (M)

 

Coil-5   74   7   2   5   4   5
Coil-10   150   8   3   11   2   10
Coil-15   215   7   4   14   6   15
Coil-20   296   3   4   14   4   17
TABLE VII: Results of the estimated of the comparing methods on the Coil-sets.
Data Sets   -means   LDPS-means   -medoids   LDPS-medoids
     

 

Coil-5   0.317 0.582 0.135   0.478 0.426 0.158   0.025 0.956 0.013   0.025 0.956 0.013
Coil-10   0.169 0.823 0.039   0.226 0.809 0.050   0.114 0.932 0.028   0.030 0.957 0.007
Coil-15   0.246 0.778 0.027   0.237 0.786 0.032   0.221 0.879 0.064   0.029 0.956 0.004
Coil-20   0.336 0.664 0.028   0.353 0.609 0.028   0.281 0.821 0.039   0.171 0.846 0.016
TABLE VIII: Clustering performance comparison on the Coil-sets of the comparing methods, where is the error rate, stands for the rate of true association which is the fraction of pairs of images from the same true category that were correctly placed in the same learned category, and is the rate of false association which is the fraction of pairs of images from different true categories that were erroneously placed in the same learned category.
  Data Sets      -means      -means      CFSFDP (E)      CFSFDP (M)     LDPS (E)    LDPS (M)

 

Coil-25   369   8   5   14   5   24
Coil-50   691   7   13   36   7   49
Coil-75   1069   15   18   45   4   74
Coil-100   1349   20   24   56   19   97
TABLE IX: Results of the estimated of the comparing methods on the large Coil-sets.
Data Sets   -means   LDPS-means   -medoids   LDPS-medoids