implementation of recombinator-k-means
We present a heuristic algorithm, called recombinator-k-means, that can substantially improve the results of k-means optimization. Instead of using simple independent restarts and returning the best result, our scheme performs restarts in batches, using the results of a previous batch as a reservoir of candidates for the new initial starting values (seeds), exploiting the popular k-means++ seeding algorithm to piece them together into new promising initial configurations. Our scheme is general (it only affects the seeding part of the optimization, thus it could be applied even to k-medians or k-medoids, for example), it has no additional costs and it is trivially parallelizable across the restarts of each batch. In some circumstances, it can systematically find better configurations than the best one obtained after 10^4 restarts of a standard scheme. Our implementation is publicly available at https://github.com/carlobaldassi/RecombinatorKMeans.jl.READ FULL TEXT VIEW PDF
Selection of initial seeds greatly affects the quality of the clusters a...
This paper proposes the use of an optimization algorithm, namely PSO to
The initial centroid is a fairly challenging problem in the k-means meth...
Ocean dynamics constitute a source of incertitude in determining the oce...
In this paper we present an algorithm for finding a minimum dominator
While recent NAS algorithms are thousands of times faster than the pione...
DeepPrior is a simple approach based on Deep Learning that predicts the ...
implementation of recombinator-k-means
Clustering is a central problem of data science. As a general problem, it is however essentially ill-defined, and thus hundreds of different criteria have been proposed. Of these,
-means is probably the most well-known and widely used(wu2008top, )
, mainly due to its conceptual simplicity and generality; it is typically one of the first algorithms applied by data analysts when confronted with a new dataset, either on the raw data or after any amount of manipulations (e.g. normalization, or dimensionality reduction by principal component analysis or spectral analysis, etc.)(berkhin2006survey, ). It is defined as follows: given a metric space endowed with a distance function , and given points in that space , and given a positive integer , we wish to find points
that minimize the loss function:
The points are called “centroids” and the goal is to partition the set into clusters by associating each point to its nearest centroid, such that the overall squared-distance of the points from their associated centroids is minimized.
In most cases the space is and the distance is just the euclidean distance, so that the loss can be written as:
This function is, in general, highly non-convex, and no closed-form solution for the global minimum is available. In fact, the optimization problem is NP-hard (aloise2009np, ). The standard algorithm used for optimization, known as Lloyd’s algorithm (lloyd1982least, ), uses an alternating iterative strategy: given some initial guesses for the centroids, the algorithm alternates a step in which points are partitioned into clusters by assigning each point to their nearest centroid, and a step in which new centroids are computed by taking the baricenter of each cluster. This algorithm has the nice property that each iteration is guaranteed to monotonically decrease the loss. Thus, since there is a finite number of distinct clusterings, this algorithm always reaches a local optimum of the loss in a finite number of steps (the algorithm stops when it reaches a fixed point).
However, no guarantee is offered about reaching the global optimum of the loss function. Indeed, in many circumstances of practical interest the loss has a very large number of local optima, and Lloyd’s algorithm gets easily trapped in one of them unless it is initialized close to the global optimum. One strategy for dealing with this issue is then to initialize the centroids at random (usually, by random sampling from the input set ) and perform many restarts of Lloyd’s algorithm. In this setting, one simply returns the best result obtained across the restarts. This procedure is sometimes called repeated--means or multi-start--means. The procedure used for the choice of the initial values of the centroids is commonly called “seeding”.
While the most obvious and least expensive seeding choice for the centroids is that of sampling them uniformly at random from (known as the Forgy method), other more computationally expensive choices may often produce drastically better results on average, thereby requiring fewer restarts for the optimization and an overall improvement in performance.
Arguably, the most popular seeding method of this sort is currently the so-called -means++ method, originally proposed in ref. (arthur2007k, ), due to its simplicity (both conceptually and in terms of implementation), versatility, availability, relatively low computational cost , and generally good performances, especially if restarts are a viable option (celebi2013comparative, ). It can be regarded as a randomized variant of another popular (but more expensive) technique called “furthest point heuristic”, or “maxmin” (gonzalez1985clustering, ). The general underlying heuristic idea is to choose centroids such that the initial configuration resembles in some sense what we expect the optimum to look like, and therefore such that the initial loss is already rather good.111In fact, the authors prove that their initialization is -competitive with the optimal clustering, and Lloyd’s algorithm can only improve it from there. The basic scheme to achieve this goal is to sample the centroids from
progressively: the first one is sampled uniformly at random, while each new one is sampled with a probability distribution that favors points that are far from the centroids already chosen. More precisely, the basic-means++ seeding procedure is:
In the present work, however, we refer to a slightly refined version of this procedure, which we learned about while inspecting the code of the scikit-learn Python library (scikit-learn, ; scikit-learn-site, ).222This algorithm was added to the scikit-learn codebase in 2011 by Jan Schlueter, in this commit: https://github.com/scikit-learn/scikit-learn/commit/9c3bbb01d54dc631c7b9bfb77bd70d0a7ca7750b. A comment in the code affirms that this is a port from the implementation of the original authors of -means++, and that it was used for their tests. An accompanying URL address points to the personal page of one of the original authors, but it is now inactive. We have actually found no trace of this algorithm in any of the versions of ref. (arthur2007k, ) available online; moreover, the results reported in the paper on the same datasets appear to be worse than those that we found with the improved version (cf. our results below). In the literature only the basic variant is generally reported; the only mention of this algorithm that we could find in the literature was as “greedy” -means++, in ref. (celebi2013comparative, ), where however the reference is again to the original -means++ publication. The refinement consists in adding an extra sampling step: whenever a new centroid needs to be selected (except for the first one), a number of candidates are sampled, and the “best” of those is kept. In this context, the best candidate is the one that minimizes the loss function computed with the current number of centroids. The details are provided in the following algorithm:333The pseudocode description provided here is quite inefficient for the sake of clarity, but in practice the computations can be carried out rather more efficiently in by employing appropriate data structures, at a cost of additional memory. See the actual code at https://github.com/carlobaldassi/RecombinatorKMeans.jl for the details.
The new parameter determines the amount of extra sampling. Clearly, for we recover the basic -means++ algorithm. In all our tests, we have used the default value used by the scikit-learn library, i.e. , which seems to provide a good trade-off between the improvement in the initial configuration and the extra computational time required.
In the following, we will refer to this improved -means++ seeding followed by the Lloyd’s algorithm as simple-kmeans:
As mentioned above, in moderately complex situations this algorithm produces initial configurations which may lead to much better results of the Lloyd’s algorithm on average. However, it still offers no guarantees not to get trapped in local optima, and it is standard practice when possible to still keep the best final result across a few restarts of the procedure. An example case is shown in figure 1. Here we analyzed the A3 synthetic dataset from ref. (ClusteringDatasets, ), which is a moderately large two-dimensional dataset (, ) consisting of fairly well-distinct homogeneous clusters, and therefore for which optimizing the -means loss should produce a near-perfect clustering (see also fig. 2). For this dataset, the ground truth (the centroids used for generating the data) is known, and the globally optimal configuration of -means is indeed extremely close to it. As shown in the figure, across restarts of each algorithm the uniform-seeding version never reached the level of the ground truth loss, whereas (improved) -means++ seeding led to finding a loss comparable to that of the ground truth in of the cases, and on average to reach much better configurations. With this observed percentage of “success”, one needs about runs to reach a confidence level of of finding a configuration comparable to the ground truth, and runs to reach a confidence of .
In what follows we always use simple-kmeans as a reference, and we refer to the algorithm that returns the best out of repetitions of that as repeated-kmeans:
The contribution of this paper is to build on this scheme to further improve the chances of finding a good configuration. The starting observation is depicted in the four left panels of figure 2. Looking at different runs of simple-kmeans on a variety of test cases (especially artificial ones in 2 dimensions where there is a clear solution that is nonetheless hard to find except after several reruns of the algorithm), one can observe that, in most runs, most centroids end up being near their optimal position, while only a minority are stuck in a local minimum. If we then superimpose the final centroids obtained from different runs, we see how they form dense clusters around the optimal centroids, as shown in the right panel of figure 2.
Given this observation, the main idea of our method is rather simple, and it basically consists in using those distributions of centroids from repeated runs as a reservoir of potential starting points in a new batch of reruns of the algorithm. Clearly, if this reservoir is already well clustered as in the example of fig. 2, running the -means++ seeding will immediately “recombine” the previous results and produce an essentially optimal configuration with high probability. In more detail, this is a high-level description of our method: run the -means++ seeding followed by Lloyd’s algorithm a few times, say , as described above. However, instead of just keeping the best result, collect all the final centroids of each run in a new list, call it . Next, run the -means++ seeding algorithm (with a few adjustments, detailed below) on this set . In this way, the initial proposal for the centroids is already considering as candidates those points which were chosen as optimal in a previous run. Repeating this step a few times (e.g. times again) can lead to a new set , and the process can then be iterated as many times as needed. Heuristically, a simple and reasonable (although rather strict) stopping criterion is to proceed until all the losses of a batch are essentially the same (up to some tolerance), which normally implies that the configurations are all equivalent (i.e. they have collapsed). Cf. fig. 3. We call this scheme “recombinator--means”, and provide a detailed description in the following.
First, we need to modify the seeding algorithm to accept a reservoir argument , i.e. a list of points from which to sample the seeds, distinct from the data points . Note however that we still use the original data in the loss function to determine the best candidate among the
samples at each step. We also introduce an extra argument, discussed below: a vector of weightsof the same size as , that can be used as a “prior” over the candidates in the reservoir, in order to favor some elements of over others. The revised reservoir-kmeans++ algorithm is as follows (we have highlighted the differences with respect to improved-kmeans++ in blue):
Note that the new initialization method based on the reservoir sets is no more computationally expensive than standard -means++; in fact, using as the reservoir and uniform weights we recover the previous algorithm. The additional memory required is and thus it is also normally negligible compared to the data, which is at least for dense data. It is also worth noting that the restarts at each iteration can be trivially parallelized, just as one would parallelize the restarts of a standard scheme; thus, if is larger or equal than the number of available CPU cores this scheme is also no more time-consuming than the standard one. It is thus reasonable to compare the performances of the two methods (repeated--means with restarts vs. recombinator--means with the same overall amount of restarts).
The final algorithm is as follows:
The use of an approximate comparison in the stopping criterion is intended to account for small (arguably irrelevant) differences. In our tests, we used a relative tolerance of .
The function weights determines whether to favor some of the centroids in over others, based on their associated losses. The simplest choice is to provide uniform weights. In our tests, this already generally produces a drastic improvement in the results in some synthetic datasets such as A3. However, we have found that favoring the centroids belonging to configurations of smaller loss can help in general, and can be crucial in particularly difficult circumstances. It also plays a central role in ensuring that the batches eventually collapse onto some “consensus” configuration. We have used the following heuristic. Given an array of losses , we compute the best and the mean , and use the following formula to determine the weight of the centroids in a sample:
with some parameter . This formula gives the largest weight to configurations close to the best one, using the difference between the mean and the best as a scale. The parameter
determines the amount of skew in the weights, withcorresponding to the flat (unweighted) case, and leading to only choosing the candidates from the best configurations. We have found empirically that in most circumstances seems to be a suitable choice.
The literature on seeding strategies for -means is vast, and our recombinator--means scheme shares some similarities with previously proposed techniques. To the best of our knowledge, and according to the comprehensive reviews of prior works on this subject that can be found in refs. (celebi2013comparative, ; franti2019much, ), the two most similar approaches are found in ref. (bradley1998refining, ) and ref. (bahmani2012scalable, ).
In ref. (bradley1998refining, ), the authors propose a seeding method that they call refine, which can be briefly described as follows: 1) split the dataset in sub-datasets, which are then clustered independently producing groups of centroids , also collected together in a new dataset ; 2) cluster by -means times, each time using as the starting configuration and obtaining a result ; 3) use the with the minimum loss with respect to as the starting point for Lloyd’s algorithm on the complete dataset. Indeed, this shares some features of our proposed algorithm, but there are important differences in the following respects: 1) We don’t split the dataset to produce the initial configurations (although systematic tests would be needed to determine the effect of this choice); 2) We use the improved seeding mechanism of -means++ (which wasn’t available when ref. (bradley1998refining, ) was written), leading to much better initial performances than uniform seeding in many cases; 3) In each iteration, in order to obtain the new seeds, we don’t cluster the centroids of the previous iterations, but rather use them as candidates for -means++ seeding; 4) We weigh the centroids according to their quality when recombining them; 5) We don’t produce only one new seed, but of them. The third point is significant since our reservoir-kmeans++ algorithm aims at stochastically recombining the previous solutions, whereas the refine procedure clusters them (which is more similar to a smoothing, as the authors note, rather than a recombination), and uses each previous result as-is (making the mixing more indirect). Also, reservoir-kmeans++ does not only take into account the structure of , but also of the original data via the evaluation of candidates. The weighting of the candidates also appears, from our tests, to be crucial for boosting performance in the most difficult cases, and to induce convergence in the iterated procedure. Finally, the fact that we produce new seeds rather than just one means that our scheme lends itself naturally to be iterated, with significant results in some cases (cf. fig. 3).
In ref. (bahmani2012scalable, ) the authors propose an alternative to -means++ seeding, called -means, whose goal is to permit to parallelize the choice of the centroids as opposed to proceeding purely sequentially. The similarity to our approach relies in the fact that their method adopts the strategy of oversampling potential centroids at first, and finally produces a seed by clustering the result. However, the oversampling scheme is entirely different from our procedure, reflecting the different focus for their strategy. Indeed, their results indicate that -means matches the performance of (basic) -means++ in single runs (each run being parallelizable), while recombinator--means aims at beating the performance of (improved) -means++ in repeated runs (possibly parallelizing over the repetitions).
We performed a series of tests comparing recombinator-kmeans with repeated-kmeans. In order to make the comparison fair, we used the following procedure: for each dataset tested, and each value of , we ran simple-kmeans times and collected the losses in . Then we ran recombinator-kmeans with different combinations of parameters and , times for each combination. For each run, we collected the final loss, storing it in , and measured the total number of restarts (i.e. times the number of batch iterations). We then sampled (with replacement) results from the values and took the minimum, thereby ending up with new losses that we stored in . These quantities allow to directly compare the recombinator-kmeans scheme with repeated-kmeans, given that the two strategies should have a comparable computational cost.444Actually, starting from the second batch iteration of recombinator-kmeans, Lloyd’s algorithm takes a considerably shorter time than starting from the improved-kmeans++ seeds since it’s generally closer to a minimum, and thus this analysis is favoring simple-kmeans. However, since we performed our tests on different machines, in parallel and under varying load conditions, we did not record running times. Our main metric is the mean value obtained in these runs, but we also report the minimum, median and maximum values found. We also report the minimum overall value found in : as we shall show, there are situations in which even repetitions of simple-kmeans are insufficient to match even the worst results of recombinator-kmeans . Finally, we report the median of , simply to provide a scale for the losses found in individual runs.
All the tests were performed with the same code for both algorithms, and the implementation is publicly available at https://github.com/carlobaldassi/RecombinatorKMeans.jl. The implementation is essentially identical to that described in the previous sections, except for an additional “failsafe” stopping criterion added to avoid exceedingly long (potentially infinite) loops which might occur if is too small and the batches of recombinator-kmeans fail to collapse onto a single solution. In the settings of our reported tests was large enough that this failsafe mechanism, which simply detects cases when both the minimum and the average loss of a batch are not decreasing with respect to the previous ones, was triggered rarely, and we don’t consider that it significantly affected the results.
We present the results of the detailed analysis mentioned above on three datasets, a synthetic one and two real-world ones:555While developing the algorithm, we cursorily tested a number of additional datasets, in particular the A-sets, S-sets and Birch-sets datasets of ref. (ClusteringDatasets, ), with results qualitatively very similar to those reported here. the A3 dataset mentioned above (, ); the first part of the Cloud dataset (,
) from the UC-Irvine Machine Learning Laboratory(UCI-MachineLearningRepository, ), which has also been used as a benchmark in (arthur2007k, ); the Spambase dataset (, ) also available from the same repository, which has also been used as a benchmark in (arthur2007k, ; bahmani2012scalable, ). Following the previous cited works, we performed no manipulations or normalizations on the real-world datasets. The A3 dataset was uniformly scaled to make it fit in a square.
For the A3 dataset, the ground truth (the centroids used to generate the data) is available, the true value of is and its corresponding loss is . The global minimum appears to be . As shown in fig. 1, simple-kmeans has a probability of about of finding a solution of this approximate value; above this there is a clear gap up to values of about . Under these circumstances, it makes sense to consider any configuration below this gap as a “success”666This definition ignores the fact that local optima very close to the global optimum exist, since for practical purposes those would not matter in this kind of problem., and thus to measure the success rate of an algorithm with this criterion.
All our tests on this dataset were performed with the true . We tested recombinator-kmeans with different values of and . In table 1 we report the results. Overall, recombinator-kmeans outperforms repeated-kmeans in all cases, even at very small , in terms of the means and of the success rates. Also notice how the medians are very close to the minima. With the success rate is essentially 100%, achieved always with , while (as mentioned above) around repetitions or more would be required for repeated-kmeans to operate at that accuracy level. The medians show that, on top of this, the global minimum was found more than of the times (in fact, of the times in the best setting , to be compared with of the times or less for and of the times for ). For this dataset, appears to be generally slightly better than .
For the Spambase dataset we tested three values of (the same ones used in ref. (bahmani2012scalable, )) and used in all cases. The results are summarized in table 2. We used two values of , obtaining similar results but with a slight advantage for . Under these circumstances, repeated-kmeans is very slightly better on average than recombinator-kmeans at , slightly worse at and noticeably worse at . All of the reported loss values are scaled down by .
The case appears to be relatively easy, and the improved-kmeans++ seeding scheme can already do a good job given a few hundred of restarts;777For reference, note however that the median values of single runs reported in ref. (bahmani2012scalable, ) were around , , for basic -means++ and -means. This shows that the oversampling of improved-kmeans++ produces significant gains. in fact, the distributions are so tight that it is likely that corresponds to the global minimum of the problem. Although it would probably not be of much consequence in a practical setting, it is still interesting to note that recombinator-kmeans found this precise value in more than of the cases, while repeated-kmeans only achieved it with a rate. This suggests that recombinator-kmeans still has an advantage for finding the global minimum (as was the case for A3, but more dramatically so here).
The case shows a small advantage of recombinator-kmeans; it also clearly shows a case in which simple-kmeans was surely unable to find the global minimum even in restarts, since recombinator-kmeans managed to do better ( vs ).
The case shows a clear advantage of recombinator-kmeans. In the case, even the worse run is comparable to the best out of runs of simple-kmeans.
For the Cloud dataset we tested four values of (the same ones used in ref. (arthur2007k, ) plus ) and used in all cases. The results are summarized in table 3. We used two values of , obtaining similar results but overall a little better ones for . Under these circumstances, repeated-kmeans is very slightly worse than recombinator-kmeans at , and noticeably worse at . All of the reported loss values are scaled down by .
Similarly to the Spambase case, the smallest case appears to not be too difficult, and both algorithms, given a few hundreds of repetitions , get very close to what is likely the global optimum (and both actually achieve it at more than an rate).
Staring from , however, recombinator-kmeans has a clear advantage; at even the worse values found at are better than the best out of runs of simple-kmeans. At , the mean value of recombinator-kmeans is smaller than that of repeated-kmeans and smaller than the minimum of .
In figure 3, we show the progress of recombinator-kmeans from one batch iteration to the next, for the , case. For this graph, we have simply lumped together all the losses found by the samples after the first iteration ( data points in total), then all the second iterations, etc. Comparing the first batch (whose values are distributed like for simple-kmeans) with the following ones, it’s clear that the recombination process is achieving results which would be unattainable by simple restarts, and also that even a single additional batch gives a large improvement, such that even just performing two batches of, say, each would provide a significant advantage. Furthermore, it’s apparent that each successive batch produces diminishing returns, and therefore that a less strict stopping criterion for the algorithm than the simple one that we used could improve the trade-off between the quality of the optimization and the computational time. Our observations indicate that this process by which both the mean and the width of the distributions monotonically decrease as the iterations progress is rather general to most circumstances, with a large enough (indeed, in the limit of it would be trivially true).
We have presented a scheme that can improve substantially the results
of (repeated) -means optimization by using an iterative seeding
procedure based on the results of previous runs. The results indicate
that this scheme is particularly advantageous at larger values of
. The scheme depends crucially on the property of the -means++
seeding algorithm to produce an essentially near-optimal starting
configuration when the reservoir of candidates is well-clustered,
due to the fact that whenever an element in a cluster is chosen the
probability of choosing another member of the same cluster drops to
near zero and any new choice will likely belong to some other cluster.
The improvements of improved-kmeans++ over basic-kmeans++
however are helpful but possibly not essential. Moreover, approximated
versions of the -means++ procedure that can produce considerable
speed-ups in the seeding step, such as that of ref. (bachem2016fast, ),
can also be applied as drop-in replacements.888In the particular case of ref. (bachem2016fast, ) , the speed-up
is based on using a Monte Carlo Markov Chain in order to perform the
sampling, therefore neither the additional sampling of the
, the speed-up is based on using a Monte Carlo Markov Chain in order to perform the sampling, therefore neither the additional sampling of theimproved-kmeans++ code nor the weighting introduced by reservoir-means++ should pose any difficulty. Also, any optimization or modification that can be applied to standard -means solvers (e.g. to deal with sparse data, or to weigh the data points non uniformly, or to work with minibatches, etc.), should be straightforwardly transferable to this scheme, since at its core recombinator-kmeans only affects the seeding mechanism. Finally, just like -means++, this scheme can in principle be used as-is with any distance function (e.g. to perform -medians optimization) or even when only the distance matrix between data points is available and no overall metric is defined (e.g. to perform -medoids optimization).
The scheme that we have presented is only the simplest and most straightforward that exploits the starting observation of section III and gives good results across the board. It is in fact relatively easy to imagine further refinements, such as: having a variable number of repetitions per iteration (possibly determined dynamically); using a more refined stopping criterion; keeping a reservoir that includes the best configurations found so far instead of the last ones; performing the same scheme in a hierarchical fashion (i.e. pooling together solutions of reservoir-kmeans and applying the same machinery, possibly more than once); etc. Such explorations are left for future work.
More generally, this scheme demonstrates the capability to exploit the information of previous runs, piecing together different locally-optimal configurations, when performing an optimization in a highly non-convex landscape, reaching values that were basically unattainable by simple restarts (as in the cases of the Spambase and Cloud datasets at ). We relied on the ability of -means++ to guess a good starting point to achieve this, and that in turn exploits an a priori knowledge on the kind of solutions that are likely to be good for the -means problem. It would hten be of great interest, if possible, to be able to generalize the scheme and export it to other non-convex optimization problems.