An efficient K -means clustering algorithm for massive data

01/09/2018
by   Marco Capó, et al.
UPV/EHU
BCAM
0

The analysis of continously larger datasets is a task of major importance in a wide variety of scientific fields. In this sense, cluster analysis algorithms are a key element of exploratory data analysis, due to their easiness in the implementation and relatively low computational cost. Among these algorithms, the K -means algorithm stands out as the most popular approach, besides its high dependency on the initial conditions, as well as to the fact that it might not scale well on massive datasets. In this article, we propose a recursive and parallel approximation to the K -means algorithm that scales well on both the number of instances and dimensionality of the problem, without affecting the quality of the approximation. In order to achieve this, instead of analyzing the entire dataset, we work on small weighted sets of points that mostly intend to extract information from those regions where it is harder to determine the correct cluster assignment of the original instances. In addition to different theoretical properties, which deduce the reasoning behind the algorithm, experimental results indicate that our method outperforms the state-of-the-art in terms of the trade-off between number of distance computations and the quality of the solution obtained.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/10/2016

An efficient K-means algorithm for Massive Data

Due to the progressive growth of the amount of data available in a wide ...
12/11/2013

Fast Approximate K-Means via Cluster Closures

K-means, a simple and effective clustering algorithm, is one of the most...
07/05/2019

Hybridized Threshold Clustering for Massive Data

As the size n of datasets become massive, many commonly-used clustering ...
12/24/2014

An Effective Semi-supervised Divisive Clustering Algorithm

Nowadays, data are generated massively and rapidly from scientific field...
02/07/2020

A novel initialisation based on hospital-resident assignment for the k-modes algorithm

This paper presents a new way of selecting an initial solution for the k...
03/05/2021

Rissanen Data Analysis: Examining Dataset Characteristics via Description Length

We introduce a method to determine if a certain capability helps to achi...
11/16/2020

Adaptive Matching of Kernel Means

As a promising step, the performance of data analysis and feature learni...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Partitional clustering is an unsupervised data analysis technique that intends to unveil the inherent structure of a set of points by partitioning it into a number of disjoint groups, called clusters. This is done in such a way that intra-cluster similarity is high and the inter-cluster similarity is low. Furthermore, clustering is a basic task in many areas, such as artificial intelligence, machine learning and pattern recognition

[12, 17, 21].

Even when there exists a wide variety of clustering methods, the -means algorithm remains as one of the most popular [6, 18]. In fact, it has been identified as one of the top algorithms in data mining [34].

1.1 -means Problem

Given a set of data points (instances) in and an integer , the -means problem is to determine a set of centroids in , so as to minimize the following error function:

(1)

This is a combinatorial optimization problem, since it is equivalent to finding the partition of the

instances in groups whose associated set of centers of mass minimizes Eq.1. The number of possible partitions is a Stirling number of the second kind, [19].

Since finding the globally optimal partition is known to be NP-hard [1], even for instances in the plane [25], and exhaustive search methods are not useful under this setting, iterative refinement based algorithms are commonly used to approximate the solution of the -means and similar problems [19, 22, 24]. These algorithms iteratively relocate the data points between clusters until a locally optimal partition is attained. Among these methods, the most popular is the -means algorithm [18, 24].

1.2 -means Algorithm

The -means algorithm is an iterative refinement method that consists of two stages: Initialization, in which we set the starting set of centroids, and an iterative stage, called Lloyd’s algorithm [24]. In the first step of Lloyd’s algorithm, each instance is assigned to its closest centroid (assignment step), then the set of centroids is updated as the centers of mass of the instances assigned to the same centroid in the previous step (update step). Finally, a stopping criterion is verified. The most common criterion implies the computation of the error function (Eq.1) : If the error does not change significantly with respect to the previous iteration, the algorithm stops [27]: If and are the set of centroids obtained at consecutive Lloyd’s iterations, then the algorithm stops when

(2)

Conveniently, every step of the -means algorithm can be easily parallelized [35], which is a major key to meet the scalability of the algorithm [34].

The time needed for the assignment step is , while updating the set of centroids requires computations and the stopping criterion, based on the computation of the error function, is . Hence, the assignment step is the most computationally demanding and this is due to the number of distance computations that needs to be done at this step. Taking this into account, the main objective of our proposal is to define a variant of the -means algorithm that controls the trade-off between the number of distance computations and the quality of the solution obtained, oriented to problems with high volumes of data. Lately, this problem has gained special attention due to the exponential increase of the data volumes that scientists, from different backgrounds, face on a daily basis, which hinders the analysis and characterization of such an information [9].

1.2.1 Common initializations

It is widely reported in the literature that the performance of Lloyd’s algorithm highly depends upon the initialization stage, in terms of the quality of the solution obtained and the running time [29]. A poor initialization, for instance, could lead to an exponential running time in the worst case scenario [33].

Ideally, the selected seeding/initialization strategy should deal with different problems, such as outlier detection and cluster oversampling. A lot of research has been done on this topic: A detailed review of seeding strategies can be found in

[30, 32].

The standard initialization procedure consists of performing several re-initializations via Forgy’s method [14] and keeping the set of centroids with the smallest error [30, 32]. Forgy’s technique defines the initial set of centroids as instances selected uniformly at random from the dataset. The intuition behind this approach is that, by choosing the centroids uniformly at random, we are more likely to choose a point near an optimal cluster center, since such points tend to be where the highest density regions are located. Besides the fact that computing the error of each set of centroids is (due to the assignment step), the main disadvantage of this approach is that there is no guarantee that two, or more, of the selected seeds will not be near the center of the same cluster, especially when dealing with unbalanced clusters [30].

More recently, simple probabilistic based seeding techniques have been developed and, due to their simplicity and strong theoretical guarantees, they have become quite popular. Among these, the most relevant is the -means++ algorithm proposed by Arthur and Vassilvitskii in [2]. -means

selects only the first centroid uniformly at random from the dataset. Each subsequent initial centroid is chosen with a probability proportional to the distance with respect to the previously selected set of centroids.

The key idea of this cluster initialization technique is to preserve the diversity of seeds while being robust to outliers. The -means algorithm leads to a factor approximation 111Algorithm is an factor approximation of the -means problem, if , for any output of . of the optimal error after the initialization [2]. The main drawbacks of this approach refer to its sequential nature, which hinders its parallelization, as well as to the fact that it requires full scans of the entire dataset, which leads to a complexity of .

In order to alleviate such drawbacks, different variants of -means have been studied. In particular, in [4], a parallel -means type algorithm is presented. This parallel variant achieves a constant factor approximation to the optimal solution after a logarithmic number of passes over the dataset. Furthermore, in [3], an approximation to -means

that has a sublinear time complexity, with respect to the number of data points, is proposed. Such an approximation is obtained via a Markov Chain Monte Carlo sampling based approximation of the

-means probability function. The proposed algorithm generates solutions of similar quality to those of -means, at a fraction of its cost.

1.2.2 Alternatives to Lloyd’s algorithm

Regardless of the initialization, a large amount of work has also been done on reducing the overall computational complexity of Lloyd’s algorithm. Mainly, two approaches can be distinguished:

  • [leftmargin=*]

  • The use of distance pruning techniques: Lloyd’s algorithm can be accelerated by avoiding unnecessary distance calculations, i.e., when it can be verified in advanced that no cluster re-assignment is possible for a certain instance. As presented in [11, 13, 15], this can be done with the construction of different pairwise distance bounds between the set of points and centroids, and additional information, such as the displacement of every centroid after a Lloyd’s iteration. In particular, in [15], reductions of over 80% of the amount of distance computations are observed.

  • Apply Lloyd’s algorithm over a smaller (weighted) set of points: As previously commented, one of the main drawbacks of Lloyd’s algorithm is that its complexity is proportional to the size of the dataset, meaning that it may not scale well for massive data applications. One way of dealing with this is to apply the algorithm over a smaller set of points rather than over the entire dataset. Such smaller sets of points are commonly extracted in two different ways:

    1. [wide, labelwidth=!, labelindent=0pt]

    2. Via dataset sampling: In [5, 7, 10, 31], different statistical techniques are used with the same purpose of reducing the size of the dataset. Among these algorithms, we have the Mini-batch -means proposed by Sculley in [31]. Mini-batch -means is a very popular scalable variant of Lloyd’s algorithm that proceeds as follows: Given an initial set of centroids obtained via Forgy’s algorithm, at every iteration, a small fixed amount of samples is selected uniformly at random and assigned to their corresponding cluster. Afterwards, the cluster centroids are updated as the average of all samples ever assigned to them. This process continues until convergence. Empirical results, in a range of large web based applications, corroborate that a substantial saving of computational time can be obtained at the expense of some loss of cluster quality [31]. Moreover, very recently, in [28], an accelerated Mini-batch -means algorithm via the distance pruning approach of [13] was presented.

    3. Via dataset partition: The reduction of the dataset can also be generated as sets of representatives induced by partitions of the dataset. In particular, there have been a number of recent papers that describe factor approximation algorithms and/or (,)-coresets 222 A weighted set of points is a (,)-coreset if, for all set of centroids , , where and is the weight associated to a representative . for the -means problem [16, 23, 26]. However, these variants tend to be exponential in and are not at all viable in practice [2]. Moreover, Kanungo et al. [20] also proposed a approximated algorithm for the -means problem that is , thus it is not useful for massive data applications. In particular, for this kind of applications, another approach has been very recently proposed in [8]: The Recursive Partition based -means algorithm.

      1.2.2.1 Recursive Partition based -means algorithm

      The Recursive Partition based -means algorithm (RPM) is a technique that approximates the solution of the -means problem through a recursive application of a weighted version of Lloyd’s algorithm over a sequence of spatial based-thinner partitions of the dataset:

      Definition 1 (Dataset partition induced by a spatial partition).

      Given a dataset and a spatial partition of its smallest bounding box, the partition of the dataset induced by is defined as , where and 333From now on, we will refer to each as a block of the spatial partition ..

      Applying a weighted version of -means algorithm over the dataset partition , consists of executing Lloyd’s algorithm (Section 1.2) over the set of centers of mass (representatives) of , for all , considering their corresponding cardinality (weight), , when updating the set of centroids. This means, we seek to minimize the weighted error function , where .

      Afterwards, the same process is repeated over a thinner partition of the dataset 444 A partition of the dataset is thinner than , if each subset of can be written as the union of subsets of ., using as initialization the set of centroids obtained for . In Algorithm 1, we present a pseudo-code of a RPM type algorithm:

      Input: Dataset and number of clusters .
      Output: Set of centroids .
      Step 1: Construct an initial partition of , , and define an initial set of centroids, .
      Step 2: WeightedLloyd.
      while not Stopping Criterion do
             Step 3: Construct a dataset partition , thinner than . Set . Step 4: WeightedLloyd.
            
      end while
      return
      Algorithm 1 RPM algorithm pseudo-code

      In general, the RPM algorithm can be divided into three tasks: The construction of an initial partition of the dataset and set of centroids (Step 1), the update of the corresponding set of centroids via weighted Lloyd’s algorithm (Step 2 and Step 4) and the construction of the sequence of thinner partitions (Step 3). Experimental results have shown the reduction of several orders of computations for RPM with respect to both -means++ and Mini-batch -means, while obtaining competitive approximations to the solution of the -means problem [8].

1.3 Motivation and contribution

In spite of the quality of the practical results presented in [8], due to the strategy followed in the construction of the sequence of thinner partitions, there is still large room for improvement. The results presented in [8] refer to a RPM variant called grid based RPM. In the case of the grid based RPM, the initial spatial partition is defined by the grid obtained after dividing each side of the smallest bounding box of by half, i.e., a grid with equally sized blocks. In the same fashion, at the -th grid based RPM iteration, the corresponding spatial partition is updated by dividing each of its blocks into new blocks, i.e., can have up to representatives. It can be shown that this approach produces a (,)-coreset with descending exponentially with respect to the number of iterations 555See Theorem 1 at Appendix 4..

Taking this into consideration, three main problems arise for the grid based RPM:

  • [leftmargin=*]

  • Problem 1. It does not scale well on the dimension : Observe that, for a relatively low number of iterations, , and/or dimensionality , applying this RPM version can be similar to applying Lloyd’s algorithm over the entire dataset, i.e., no reduction of distance computations might be observed, as . In fact, for the experimental section in [8], .

  • Problem 2. It is independent of the dataset : As we noticed before, regardless of the analyzed dataset , the sequence of partitions of the grid based RPM is induced by an equally sized spatial partition of the smallest bounding box containing . In this sense, the induced partition does not consider features of the dataset, such as its density, to construct the sequence of partitions: A large amount of computational resources might be spent on regions whose misclassification does not add a significant error to our approximation. Moreover, the construction of every partition of the sequence has a cost, which is particularly expensive for massive data applications, as can be huge.

  • Problem 3. It is independent of the problem: The partition strategy of the grid based RPM does not explicitly consider the optimization problem that -means seeks to minimize. Instead, it offers a simple/inefficient way of generating a sequence of spatial thinner partitions.

    The reader should note that each block of the spatial partition can be seen as a restriction over the -means optimization problem, that enforces all the instances contained in it to belong to the same cluster. Therefore, it is of our interest to design smarter spatial partitions oriented to focus most of the computational resources on those regions where the correct cluster affiliation is not clear. By doing this, not only can a large amount of computational resources be saved, but also some additional theoretical properties can be deduced.

    Among other properties that we discuss in Section 2, at first glance it can be observed that if all the instances in a set of points, , are correctly assigned for two sets of centroids, and , then the difference between the error of both sets of centroids is equivalent to the difference of their weighted error, i.e., 666 See Lemma 1 in Appendix 4.. Moreover, if this occurs for each subset of a dataset partition and the centroids generated after two consecutive weighted -means iterations, then we can guarantee a monotone decrease of the error for the entire dataset 777 See Theorem 2 in Appendix 4.. Likewise, we can actually compute the reduction of the error for the newly obtained set of centroids, without computing the error function for the entire dataset, as in this case . Last but not least, when every block contains instances belonging to the same cluster, the solution obtained by our weighted approximation is actually a local optima of Eq.1 888 See Theorem 3 in Appendix 4..

In any case, independently of the partition strategy, RPM algorithm offers some interesting properties such as the no clustering repetition. This is, none of the obtained groupings of the instances into groups can be repeated at the current RPM iteration or for any thinner partition than the current one. This is a useful property since it can be guaranteed that the algorithm discards many possible clusterings at each RPM iteration using a much reduced set of points than the entire dataset. Furthermore, this fact enforces the decrease of the maximum number of Lloyd iterations that we can have for a given partition. In practice, it is also common to observe a monotone decrease of the error for the entire dataset [8].

Bearing all these facts in mind, we propose a RPM type approach called the Boundary Weighted -means algorithm (BWM). The name of our proposal summarizes the main intuition behind it: To generate competitive approximations to the -means problem by dividing those blocks that may not be well assigned, which conform the current cluster boundaries of our weighted approximation.

Definition 2 (Well assigned blocks).

Let be a set of centroids and be a given dataset. We say that a block is well assigned with respect to and if every point is assigned to the same centroid .

The notion of well assigned blocks is of our interest as RPM associates all the instances contained in a certain block to the same cluster, which corresponds to the one that its center of mass belongs to. Hence, our goal is to divide those blocks that are not well assigned. Moreover, in order to control the growth of the set of representatives and to avoid unnecessary distance computations, we have developed a non-expensive partition criterion that allows us to detect blocks that may not be well assigned. Our main proposal can be divided into three tasks:

  • [leftmargin=*]

  • Task 1: Design of a partition criterion that decides whether or not to divide a certain block, using only information obtained from the weighted Lloyd’s algorithm.

  • Task 2: Construct an initial partition of the dataset given a fixed number of blocks, which are mostly placed on the cluster boundaries.

  • Task 3: Once a certain block is decided to be cut, guarantee a low increase on the number of representatives without affecting, if possible, the quality of the approximation. In particular, we propose a criterion that, in the worst case, has a linear growth in the number of representatives after an iteration.

Observe that, both Task 2 and Task 3, ease the scalability of the algorithm with respect to the dimensionality of the problem, (Problem 1). Furthermore, the goal of Task 1 and Task 2 is to generate partitions of the dataset that, ideally, contain well assigned subsets, i.e., all the instances contained in a certain subset of the partition belong to the same cluster (Problem 2 and Problem 3). As we previously commented, this fact implies additional theoretical properties in terms of the quality of our approximation.

The rest of this article is organized as follows: In Section 2, we describe the proposed algorithm, introduce some notation and discuss some theoretical properties of our proposal. In Section 3, we present a set of experiments in which we analyze the effect of different factors, such as the size of the dataset and the dimension of the instances over the performance of our algorithm. Additionally we compare these results with the ones obtained by the state-of-the-art. Finally, in Section 4, we define the next steps and possible improvements to our current work.

2 BwM algorithm

In this section, we present the Boundary Weighted -means algorithm. As we already commented, BWM is a scalable improvement of the grid based RPM algorithm 999From now on, we assume each block to be a hyperrectangle., that generates competitive approximations to the -means problem, while reducing the amount of computations that the state-of-the-art algorithms require for the same task. BWM reuses all the information generated at each weighted Lloyd run to construct a sequence of thinner partitions that alleviates Problem 1, Problem 2 and Problem 3.

Our new approach makes major changes in all the steps in Algorithm 1 except in Step 2 and Step 4. In these steps, a weighted version of Lloyd’s algorithm is applied over the set of representatives and weights of the current dataset partition . This process has a cost, hence it is of our interest to control the growth of , which is highlighted in both Task 2 and Task 3.

In the following sections, we will describe in detail each step of BWM. In Section 2.1, Section 2.2 and Section 2.3 we elaborate on Task 1, Task 2 and Task 3, respectively.

2.1 A cheap criterion for detecting well assigned blocks

BWM tries to efficiently determine the set of well assigned blocks in order to update the dataset partition. In the following definition, we introduce a criterion that will help us verify this mostly using information generated by our weighted approximation:

Definition 3.

Given a set of centroids, , a set of points , a block and the subset of points contained in . We define the misassignment function for given and as:

(3)

where and is the length of the diagonal of . In the case , we set .

The following result is used in the construction of both the initial and the sequence of thinner partitions:

Theorem 1.

Given a set of centroids, , a dataset, , and a block , if , then for all .101010The proof of Theorem 1 is in Appendix 4.

In other words, if the misassignment function of a block is zero, then the block is well assigned. Otherwise, the block may not be well assigned. Even though the condition in Theorem 1

is a sufficient condition, we will use the following heuristic rule during the development of the algorithm: The larger the misassignment function of a certain block is, then the more likely it is to contain instances with different cluster memberships.

In particular, Theorem 1 offers an efficient and effective way of verifying that all the instances contained in a block belong to the same cluster, using only information related to the structure of and the set of centroids, . Observe that we do not need any information associated to the individual instances in the dataset, . The criterion just requires some distance computations with respect to the representative of , , that are already obtained from the weighted Lloyd’s algorithm.

Definition 4.

Let be a dataset, be a set of centroids and be a spatial partition. We define the boundary of given and as

(4)

The boundary of a spatial partition is just the subset of blocks with a positive misassignment function value, that is, the blocks that may not be well assigned. In order to control the size of the spatial partition and the number of distance computations, BWM only splits blocks from the boundary.

In Fig.1, we observe the information needed for a certain block of the spatial partition, the one marked out in black, to verify the criterion presented in Theorem 1. In this example, we only set two cluster centroids (blue stars) and the representative of the instances in the block, , given by the purple diamond. In order to compute the misassignment function of the block, we require the length of the three segments: Distance between the representative with respect to its two closest centroids in (blue dotted lines) and the diagonal of the block (purple dotted line). If the misassignment function is zero, then we know that all the instances contained in the block belong to the same cluster. Observe that, in this example, there are instances in both red and blue clusters, the misassignment function is positive, thus, the block is included in the boundary.

Figure 1: Information required for computing the misassignment function of the block , , for .
Theorem 2.

Given a dataset, , a set of centroids and a spatial partition of the dataset , the following inequality is satisfied:

where and 111111The proof of Theorem 2 is in Appendix 4..

According to this result, we must increase the amount of well assigned blocks and/or reduce the diagonal lengths of the blocks of the spatial partition, so that our weighted error function approximates better the -means error function, Eq.1. Observe that by reducing the diagonal of the blocks, not only is the condition of Theorem 1 more likely to be satisfied, but also we are directly reducing both additive terms of the bound in Theorem 2. This last point gives the intuition for our new partition strategy: i) split only those blocks in the boundary and ii) split them on their largest side.

2.2 Initial Partition

In this section, we elaborate on the construction of the initial dataset partition used by the BWM algorithm (see Step 1 of Algorithm 5, where the main pseudo-code of BWM is). Starting with the smallest bounding box of the dataset, the proposed procedure iteratively divides subsets of blocks of the spatial partition with high probabilities of not being well assigned. In order to determine these blocks, in this section we develop a probabilistic heuristic based on the misassignment function, Eq.3.

As our new cutting criterion is mostly based on the evaluation of the misassignment function associated to a certain block, we firstly need to construct a starting spatial partition of size , from where we can select the set of centroids with respect to which the function is computed (Step 1).

From then on, multiple sets of centroids are selected via a weighted

-means++ run over the set of representatives of the dataset partition, for different subsamplings. This will allow us to estimate a probability distribution that quantifies the chances of each block of not being well assigned (

Step 2). Then, according to this distribution, we randomly select the most promising blocks to be cut (Step 3), and divide them until reaching a number of blocks (Step 4). In Algorithm 2, we show the pseudo-code of the algorithm proposed for generating the initial spatial partition.

Input: Dataset , number of clusters , integer , size of the initial spatial partition .
Output: Initial spatial partition and its induced dataset partition, .
Step 1: Obtain a starting spatial partition of size , (Algorithm 3).
while  do
       Step 2: Compute the cutting probability, for (Algorithm 4).
       Step 3: Sample blocks from , with replacement, according to to determine a subset of blocks .
       Step 4: Split each and update .
end while
Step 5: Construct .
return and .
Algorithm 2 Construction of the initial partition

In Step 1, a partition of the smallest bounding box containing the dataset , , of size is obtained by splitting recursively the blocks according to the pseudo-code shown in Algorithm 3 –see the comments below. Once we have the spatial partition of size , we iteratively produce thinner partitions of the space as long as the number of blocks is lower than . At each iteration, the process is divided into three steps: In Step 2, we estimate the cutting probability for each block in the current space partition using Algorithm 4 –see the comments below. Then, in Step 3, we randomly sample (with replacement) blocks from according to to construct the subset of blocks , i.e., . Afterwards, each of the selected blocks in is replaced by two smaller blocks obtained by splitting in the middle point of its longest side. Finally, the obtained spatial partition and the induced dataset partition (of size lower or equal to ) are returned.

Input: Dataset , partition size , sample size .
Output: A spatial partition of size , .
- Set . while  do
       - Take a random sampling of size , .
       - Obtain a subset of blocks, , by sampling, with replacement, blocks according to a probability proportional to , for each .
       - Split the selected blocks and update .
end while
return .
Algorithm 3 Step 1 of Algorithm 2

Algorithm 3 generates the starting spatial partition of size of the dataset . This procedure recursively obtains thinner partitions by splitting a subset of up to blocks selected by a random sampling with replacement according to a probability proportional to the product of the diagonal of the block , , by its weight, . At this step, as we can not estimate how likely it is for a given block to be well assigned with respect to a set of representatives, the goal is to control both weight and size of the generated spatial partition, i.e., to reduce the possible number of cluster misassignments, as this cutting procedure prioritizes those blocks that might be large and dense. Ultimately, as we reduce this factor, we improve the accuracy of our weighted approximation- see Theorem 2.

This process is repeated until a spatial partition with the desired number of blocks, , is obtained. Such a partition is later used to determine the sets of centroids which we use to verify how likely it is for a certain block to be well assigned.

Input: A spatial partition of size higher than , dataset , number of clusters , sample size , number of repetitions .
Output: Cutting probability for each .
for  do
       -Take subsample of size and construct .
       -Obtain a set of centroids by applying -means++ over the representatives of .
       - Compute for all (Eq. 3).
end for
Step 4: Compute for every , using for (Eq. 5).
return .
Algorithm 4 Step 2 of Algorithm 2

In Algorithm 4, we show the pseudo-code used in Step 2 of Algorithm 2 for computing the cutting probabilities associated to each block , . Such a probability function depends on the misassignment function associated to each block with respect to multiple -means++ based set of centroids. To generate these sets of centroids, subsamples of size , with replacement, are extracted from the dataset, . In particular, the cutting probabilities is expressed as follows:

(5)

for each , where is the subset of points sampled and is the set of centroids obtained via -means++ for . As we commented before, the larger the misassignment function is, then the more likely it is for the corresponding block to contain instances that belong to different clusters. It should be highlighted that a block with a cutting probability is well assigned for all and , with .

Even when cheaper seeding procedures, such as a Forgy type initialization, could be used, -means ++ avoids cluster oversampling, and so one would expect the corresponding boundaries not to divide subsets of points that are supposed to have the same cluster affiliation. Additionally, as previously commented, this initialization also tends to lead to competitive solutions. Later on, in Section 2.4.1, we will comment on the selection of the different parameters, used in the initialization (, , and ).

2.3 Construction of the sequence of thinner partitions

In this section, we provide the pseudo-code of the BWM algorithm and introduce a detailed description of the construction of the sequence of thinner partitions, which is the basis of BWM. In general, once the initial partition is constructed via algorithm 2, BWM progresses iteratively by alternating i) a run of weighted Lloyd’s algorithm over the current partition and ii) the creation of a thinner partition using the information provided by the weighted Lloyd’s algorithm. The pseudo-code of the BWM algorithm can be seen in Algorithm 5.

Input: Dataset , number of clusters and initialization parameters , , , .
Output: Set of centroids . Step 1: Initialize and via Algorithm 2, with input , , , , and obtain by applying a weighted -means++ run over the set of representatives of .
Step 2: WeightedLloyd.
while not Stopping Criterion do
       Step 3: Update dataset partition :
       - Compute for all .
       - Select by sampling, with replacement, blocks according to , for all .
       - Cut each block in and update and . Step 4: WeightedLloyd.
end while
return
Algorithm 5 BWM Algorithm

In Step 1, the initial spatial partition and the induced dataset partition, , are generated via Algorithm 2. Afterwards, the initial set of centroids is obtained through a weighted version of -means++ over the set of representatives of .

Given the current set of centroids and the partition of the dataset , the set of centroids is updated in Step 2 and Step 4 by applying the weighted Lloyd’s algorithm. It must be commented that the only difference between these two tasks is the fact that Step 2 is initialized with a set of centroids obtained via weighted -means++ run, while Step 4 utilizes the set of centroids generated by the weighted Lloyd’s algorithm over the previous dataset partition. In addition, in order to compute the misassignment function for all in Step 3 (see Eq.3), we store the following information provided by the last iteration of the weighted Lloyd’s algorithm: for each , the two closest centroids to the representative in are saved (see Figure 1).

In Step 3, a spatial partition thinner than and its induced dataset partition are generated. For this purpose, the misassignment function, for all is computed and the boundary is determined using the information stored at the last iteration of Step 2. Next, as the misassignment criterion in Theorem 1 is just a sufficient condition, instead of dividing all the blocks that do not satisfy it, we prioritize those blocks that are less likely to be well assigned: A set of blocks is selected by sampling with replacement blocks from with a (cutting) probability proportional to . Note that the size of is at most . Afterwards, in order to reduce as much as possible the length of the diagonal of the newly generated blocks and control the size of the thinner partition, each block in is divided in the middle point of its largest side. Each block is split once into two equally shaped hyper-rectangles and it is replaced in to produce the new thinner spatial partition. Finally, given the new spatial partition , its induced dataset partition is obtained .

It should be noted that the cutting criterion, Eq.3, is more accurate, i.e., it detects more well assigned blocks, as long as we evaluate it over the smallest bounding box of each block of the spatial partition, since we minimize the maximum distance (diagonal) between any two points in the block. Therefore, when updating the data partition in Step 3, we also recompute the diagonal of the smallest bounding box of each subset.

Step 2 and Step 3 are then repeated until a certain stopping criterion is satisfied (for details on different stopping criteria, see Section 2.4.2).

2.3.1 Computational complexity of the BWM algorithm

In this section, we provide the computational complexity of each step of BWM, in the worst case.

The construction of the initial spatial partition, the corresponding induced dataset partition and the set of centroids of BWM (Step 1) has the following computational cost: . Each of the previous terms corresponds to the complexity of Step 1, Step 2 and Step 5 in Algorithm 2, respectively, which are the most computationally demanding procedures of the initialization. Even when these costs are deduced from the worst possible scenario, which is overwhelmingly improbable, in Section 2.4.1, we will comment on the selection of the initialization parameters in such a way that the cost of this step is not more expensive than that of the -means algorithm, i.e., .

As mentioned at the beginning of Section 1.2.2, Step 2 of Algorithm 5 (the weighted Lloyd’s algorithm) has a computational complexity of . In addition, Step 3 executes computations to verify the cutting criterion, since all the distance computations are obtained from the previous weighted Lloyd iteration. Moreover, assigning each instance to its corresponding block and updating the bounding box for each subset of the partition is . In summary, since , then BWM algorithm has an overall computational complexity of in the worst case.

2.4 Additional Remarks

In this section, we discuss additional features of the BWM algorithm, such as the selection of the initialization parameters for BWM, we also comment on different possible stopping criteria, with their corresponding computational costs and theoretical guarantees.

2.4.1 Parameter selection

The construction of the initial space partition and the corresponding induced dataset partition of BWM (see Algorithm 2 and Step 1 of Algorithm 5) depends on the parameters , , , , and , while the core of BWM (Step 2 and Step 3) only depends on and . In this section, we propose how to select the parameters , , and , keeping in mind the following objectives: i) to guarantee BWM having a computational complexity equal to or lower than , which corresponds to the cost of Lloyd’s algorithm, and ii) to obtain an initial spatial partition with a large amount of well assigned blocks.

In order to ensure that the computational complexity of BWM’s initialization is, even in the worst case, , we must take , , and such that , and are . On the other hand, as we want such an initial partition to minimize the number of blocks that may not be well assigned, we must consider the following facts: i) the larger the diagonal for a certain block is, then the more likely it is for not to be well assigned, ii) as the number of clusters increases, then any block has more chances of containing instances with different cluster affiliations, and iii) as increases, the cutting probabilities become better indicators for detecting those blocks that are not well assigned.

Taking into consideration these observations, and assuming that is a predefined small integer, satisfying , we propose the use of and . Not only does such a choice satisfy the complexity constraints that we just mentioned (See Theorem 3 in Appendix 4), but also, in this case, the size of the initial partition increases with respect to both dimensionality of the problem and number of clusters: Since at each iteration, we divide a block only on one of its sides, then, as we increase the dimensionality, we need more cuts (number of blocks) to have a sufficient reduction of its diagonal (observation i)). Analogously, the number of blocks and the size of the sampling increases with respect to the number of clusters and the actual size of the dataset, respectively (observation ii) and iii)). In particular, in the experimental section, Section 3, we used , and .

2.4.2 Stopping Criterion

As we commented in Section 1.3, one of the advantages of constructing spatial partitions with only well assigned blocks is that our algorithm, under this setting, converges to a local minima of the -means problem over the entire dataset and, therefore, there is no need to execute any further run of the BWM algorithm as the set of centroids will remain the same for any thinner partition of the dataset:

Theorem 3.

If is a fixed point of the weighted -means algorithm for a spatial partition , for which all of its blocks are well assigned, then is a fixed point of the -means algorithm on . 121212The proof of Theorem 3 in Appendix 4.

To verify this criterion, we can make use of the concept of boundary of a spatial partition (Definition 4). In particular, observe that if , then one can guarantee that all the blocks of are well assigned with respect to both and . To check this, we just need to scan the misassignment function value for each block, i.e., it is just . In addition to this criterion, in this section we will propose three other stopping criteria:

  • [leftmargin=*]

  • A practical computational criterion: We could set, in advance, the amount of computational resources that we are willing to use and stop when we exceed them. In particular, as the computation of distances is the most expensive step of the algorithm, we could set a maximum number of distances as a stopping criterion.

  • A Lloyd’s algorithm type criterion: As we mentioned in Section 1.2, the common practice is to run Lloyd’s algorithm until the reduction of the error, after a certain iteration, is small, see Eq 2. As in our weighted approximation we do not have access to the error , a similar approach is to stop the algorithm when the obtained set of centroids, in consecutive iterations, is smaller than a fixed threshold, . We can actually set this threshold in a way that the stopping criterion of Lloyd’s algorithm is satisfied. For instance, for , if , then the criterion in Eq.2 is satisfied131313See Theorem 4 in Appendix 4. However, this would imply additional computations at each iteration.

  • A criterion based on the accuracy of the weighted error: We could also consider the bound obtained at Theorem 2 and stop when it is lower than a predefined threshold. This will let us know how accurate our current weighted error is with respect to the error over the entire dataset. All the information in this bound is obtained from the weighted Lloyd iteration and the information of the block and its computation is just .

3 Experiments

In this section, we perform a set of experiments so as to analyze the relation between the number of distances computed and the quality of the approximation for the BWM algorithm proposed in Section 2. In particular, we compare the performance of BWM with respect to different methods known for the quality of their approximations: Lloyd’s algorithm initialized via i) Forgy (FM), ii) -means++ (M++) and iii) the Markov chain Monte Carlo sampling based approximation of the -means (MC2). From now on we will refer to these approaches as Lloyd’s algorithm based methods. We also consider the Minibatch -means, with batches 141414Similar values were used in the original paper [31]. (MB b), which is particularly known for its efficiency due to the small amount of resources needed to generate its approximation. Additionally, we also present the results associated to the -means initialization (M++init).

To have a better understanding of BWM, we analyze its performance on a wide variety of well known real datasets (see Table I) with different scenarios of the clustering problem: size of the dataset, , dimension of the instances, , and number of clusters, .

Dataset Corel Image Features (CIF) 3D Road Network (3RN) Gas Sensor (GS) SUSY Web Users Yahoo! (WUY)
Table I: Information of the datasets.

The considered datasets have different features, ranging from small datasets with large dimensions (CIF) to large datasets with small dimensions (WUY). For each dataset, we have considered a different number of clusters, . Given the random nature of the algorithms, each experiment has been repeated 40 times for each dataset and each value.

Figure 2: Distance computations vs relative error on the CIF dataset
Figure 3: Distance computations vs relative error on the 3RN dataset

In order to illustrate the competitiveness of our proposal, for each domain, we have limited its maximum number of distance computations to the minimum number required by the set of selected benchmark algorithms in all the 40 runs. Note that this distance bound is expected to be set by MB 100, since, among the methods that we consider, it is the one that uses the lowest number of representatives.

However, BWM can converge before reaching such a distance bound when the corresponding boundary is empty. In this case, we can guarantee that the obtained set of centroids is a fixed point of the weighted Lloyd’s algorithm for any thinner partition of the dataset and, therefore, it is also a fixed point of Lloyd’s algorithm on the entire dataset (see Theorem 3).

The -means error function (Eq.1) strongly depends on the different characteristics of the clustering problem: , , and the dataset itself. Thus, in order to compare the performance of the algorithms for different problems, we have decided to use the average of the relative error with respect to the best solution found at each repetition of the experiment:

(6)

where is the set of algorithms being compared and stands for the -means error obtained by method . That is, the quality of the approximation obtained by an algorithm is worse than the best solution found by the set of algorithms considered.

In Fig. 2-6, we show the trade-off between the average number of distances computed vs the average relative error for all the algorithms. Observe that a single symbol is used for each algorithm, except for BWM, in which we compute the trade-off at each iteration so as to observe the evolution of the quality of its approximation as the number of computed distances increases. Since the number of BWM iterations required to reach the stopping criteria may differ at each execution, we plot the average of the most significant ones, i.e., those that do not exceed the upper limit of the confidence interval of the total number of BWM iterations for each run.

Figure 4: Distance computations vs relative error on the GS dataset
Figure 5: Distance computations vs relative error on the SUSY dataset
Figure 6: Distance computations vs relative error on the WUY dataset

In order to ease the visualization of the results, both axis of each figure are in logarithmic scale. Moreover, we delimit with a horizontal dashed black line the regions of BWM that are under of error with respect to the best found solution for the competition (Lloyd’s algorithm based methods and MB) On one hand, the vertical green dashed line indicates the amount of distance computations required by BWM to achieve such an error, when that happens, otherwise it shows the amount of distance computations at its last iteration. On the other hand, the blue and red vertical dashed lines show the algorithms among the Lloyd’s algorithm based methods and MB that computed the least amount of distances, respectively.

At first glance, we observe that, in out of different configurations of datasets and values, BWM obtains the best (average) solution among the considered methods. It must be highlighted that such a clustering is achieved while computing a much reduced number of distances: up to 2 and 4 orders of magnitude of distances less than MB and the Lloyd’s based methods, respectively. Moreover, BWM quite frequently (in 12 out of 15 cases) generated solutions that reached, at least, of error with respect to the best solution found among the competitors (black dashed line). In particular and as expected, the best performance of BWM seems to occur on large datasets with small dimensions (WUY). On one hand, the decrease in the amount of distances computed is mainly due to the reduction in the number of representatives that BWM uses in comparison to the actual size of the dataset. On the other hand, given a set of points as the dimension decreases, the number of blocks required to obtain a partition completely well assigned tends to decrease (WUY and 3RN).

Regardless of this, even when considering the most unfavorable setting considered for BWM (small dataset size and large dimensions, e.g., CIF), for small values, our proposal still managed to converge to competitive solutions at a fast rate. Note that for small values, since the number of centroids is small, one may not need to reduce the diagonal of the blocks so abruptly to verify the well assignment criterion.

Next, we will discuss in detail the results obtained for each of the considered databases.

In the case of CIF, which is the smallest dataset and has a high dimensionality, BWM behaves similarly to MB. It gets its best results for , where it reaches of relative error with respect to the best solution found among the competitors, while reducing over 2 orders of magnitude of distances with respect to the Lloyd’s based methods. For , BWM improves the results of M++init using a much lower number of distances computed.

In the case of small datasets with low dimensionality (3RN), BWM performs much better in comparison to the previous case: for , it actually generates the most competitive solutions. Moreover, in order to achieve a relative error of with respect to the best solution found by the benchmark algorithms, our algorithm reduces between 1 to 2 orders of magnitude of distances with respect to MB, and around 3 orders of magnitude against the Lloyd’s based methods.

If we consider the case of the medium to large datasets with hight dimensionality (GS and SUSY), in order to reach a relative error, BWM needs up to orders of magnitude less than MB and from to orders less than the Lloyd’s based methods. Moreover, BWM obtains the best results in out of configurations requiring order of magnitude less than the Lloyd’s based algorithms.

For the largest dataset with low dimension (WUY), BWM got its best performance: Regardless of the number of clusters , BWM generated the most competitive solutions. Furthermore, in order to achieve a solution with an error higher than the best of the Lloyd’s algorithm, BWM requires to compute an amount of distance from to and 4 to over 5 order of magnitude lower than MB and the Lloyd’s based algorithms, respectively.

Finally, we would like to highlight that BWM, already at its first iterations, reaches a relative error much lower than M++init in all the configurations requiring to compute an amount of distances from to order of magnitude lower. This fact strongly motivates the use of BWM as a competitive initialization strategy for Lloyd’s algorithm.

4 Conclusions

In this work, we have presented an alternative to the -means algorithm, oriented to massive data problems, called the Boundary Weighted -means algorithm (BWM). This approach recursively applies a weighted version of the -means algorithm over a sequence of spatial based partitions of the dataset that ideally contains a large amount of well assigned blocks, i.e., cells of the spatial partition that only contain instances with the same cluster affiliation. It can be shown that our weighted error approximates the -means error function, as we increase the number of well assigned blocks, see Theorem 2. Ultimately, if all the blocks of a spatial partition are well assigned at the end of a BWM step, then the obtained clustering is actually a fixed point of the -means algorithm, which is generated after using only a small number of representatives in comparison to the actual size of the dataset (Theorem 3). Furthermore, if, for a certain step of BWM, this property can be verified at consecutive weighted Lloyd’s iterations, then the error of our approximation also decreases monotonically (Theorem 2).

In order to achieve this, in Section 2.1, we designed a criterion to determine those blocks that may not be well assigned. One of the major advantages of the criterion is its low computational cost: It only uses information generated by the weighted -means algorithm -distances between the center of mass of each block and the set of centroids- and a feature of the corresponding spatial partition -diagonal length of each block-. This allows us to guarantee that, even in the worst possible case, BWM does not have a computational cost higher than that of the -means algorithm. In particular, the criterion is presented in Theorem 1 and states that, if the diagonal of a certain block is smaller than half the difference of the two the smallest distances between its center of mass and the set of centroids, then the block is well assigned.

In addition to all the theoretical guarantees that motivated and justify our algorithm (see Section 2 and Appendix 4), in practice, we have also observed its competitiveness with respect to the state-of-the-art (Section 3). BWM has been compared to techniques known for the quality of their approximation (Lloyd’s algorithm initialized with Forgy’s approach, -means++ and via an approximation of the -means++ probability function based on a Markov chain Monte Carlo sampling). Besides, it has been compared to Minibatch -means, a method known for the small amount of computational resources that it needs for approximating the solution of the -means problem.

The results, on different well known real datasets, show that BWM in several cases (7 out of 15 configurations) has generated the most competitive solutions. Furthermore, in 12 out of 15 cases, BWM has converged to solutions with a relative error of under with respect to the best solution found by the state-of-the-art, while using a much smaller amount of distance computations (from 2 to 6 orders of magnitude lower).

As for the next steps, we plan to exploit different benefits of BWM. First of all, observe that the proposed algorithm is embarrassingly parallel up to the -means++ seeding of the initial partition (over a very tiny amount of representatives when compared to the dataset size), hence we could implement this approach in a more appropriate platform for this kind of problems, as is the case of Apache Spark. Moreover, we must point out that BWM is also compatible with the distance pruning techniques presented in [11, 13, 15], therefore, we could also implement these techniques within the weighted Lloyd framework of BWM and reduce, even more, the number of distance computations.

In the first result, we present a complimentary property of the grid based RPM proposed in [8]. Each iteration of the RPM can be proved to be a coreset with an exponential decrease in the error with respect to the number of iterations. This result could actually bound the BWM error, if we fix as the minimum number of cuts that a block, of a certain partition generated by BWM, , has.

Theorem 1.

Given a set of points in , the -th iteration of the grid based RPM produces a -coreset with , where and the length of the diagonal of the smallest bounding box containing .

Proof.

Firstly, we denote by to the representative of at the -th grid based RPM iteration, i.e., if then , where is a block of the corresponding dataset partition of . Observe that, at the -th grid based RPM iteration, the length of the diagonal of each cell is and we set a positive constant, , as the positive real number satisfying . By the triangular inequality, we have

Analogously, observe that the following inequalities hold and . Thus, :

On the other hand, we know that and that, as both x and must be located in the same cell, . Therefore, as ,

In other words, the -th RPM iteration is a - coreset with . ∎

The two following results show some properties of the error function when having well assigned blocks.

Lemma 1.

If and for all , where and , are a pair of sets of centroids, then .

Proof.

From Lemma 1 in [8], we can say that the following function is constant , for . In particular, since , we have that and so we can express the weighted error of a dataset partition, , as follows

(7)

In particular, for , we have

In the previous result we observe that, if all the instances are correctly assigned in each block, then the difference of the weighted and the entire dataset error, of both sets of centroids, is the same. In other words, if all the blocks of a given partition are correctly assigned, not only can we then actually guarantee a monotone descend of the entire error function for our approximation, a property that can not be guaranteed for the typical coreset type approximations of -means, but we know exactly the reduction of such an error after a weighted Lloyd iteration.

Theorem 2.

Given two set of centroids , , where is obtained after a weighted Lloyd’s iteration (on a partition ) over and and for all and , then .

Proof.

Using Lemma 1 over all the subsets , we know that  . Moreover, from the chain of inequalities in [8], we know that at any weighted Lloyd iteration over a given partition , thus . ∎

In Theorem 1, we prove the cutting criterion that we use in BWM. It consists of an inequality that, only by using information referred to the partition of the dataset and the weighted Lloyd’s algorithm, helps us guarantee that a block is well assigned.

See 1

Proof.

From the triangular inequality, we know that . Moreover, observe that is contained in the block , since is a convex polygon. Then .

For this reason, holds. As , then and, therefore, for all . In other words, for all . ∎

As can be seen in Section 2.2, there are different parameters that must be tuned. In the following result, we set a criterion to choose the initialization parameters of Algorithm 2 in a way that its complexity, even in the worst case scenario, is still the same as that of Lloyd’s algorithm.

Theorem 3.

Given an integer , if and , then Algorithm 2 is .

Proof.

It is enough to verify the conditions presented before. Firstly, observe that and . Moreover, as , then . ∎

Up to this point, most of the quality results assume the case when all the blocks are well assigned. However, in order to achieve this, many BWM iterations might be required. In the following result, we provide a bound to the weighted error with respect to the full error. This result shows that our weighted representation improves as more blocks of our partition satisfy the criterion in Algorithm 1 and/or the diagonal of the blocks are smaller.

See 2

Proof.

Using Eq.7 in Theorem 1, we know that .

Observe that, for a certain instance , where , , as by Theorem 1. On the other hand, if , we have the following inequalities: