Broadly defined, clustering is the problem of organizing a collection of elements into coherent groups in such a way that similar elements are in the same cluster and different elements are in different clusters. Of the models and formulations for this problem, the Euclidean minimum sum-of-squares clustering (MSSC) is prominent in the literature. MSSC can be formulated as an optimization problem in which the objective is to minimize the sum-of-squares of the Euclidean distances of the samples to their cluster means. This problem has been extensively studied over the last 50 years, as highlighted by various surveys and books (see, e.g., Han et al., 2011; Hruschka et al., 2009; Jain, 2010).
The NP-hardness of MSSC (Aloise et al., 2009) and the size of practical datasets explain why most MSSC algorithms are heuristics, designed to produce an approximate solution in a reasonable computational time. K-means (Hartigan and Wong, 1979) (also called Lloyd’s algorithm (Lloyd, 1982)) and K-means++ (Arthur and Vassilvitskii, 2007) are two popular local search algorithms for MSSC that differ in the construction of their initial solutions. Their simplicity and low computational complexity explain their extensive use in practice. However, these methods have two significant disadvantages: (i) their solutions can be distant from the global optimum, especially in the presence of a large number of clusters and dimensions, and (ii) their performance is sensitive to the initial conditions of the search.
To circumvent these issues, a variety of heuristics and metaheuristics have been proposed with the aim of better escaping from shallow local minima (i.e., poor solutions in terms of the MSSC objective). Nearly all the classical metaheuristic frameworks have been applied, including simulated annealing, tabu search, variable neighborhood search, iterated local search, evolutionary algorithms(Al-Sultan, 1995; Hansen and Mladenović, 2001; Ismkhan, 2018; Maulik and Bandyopadhyay, 2000; Sarkar et al., 1997; Selim and Alsultan, 1991), as well as more recent incremental methods and convex optimization techniques (An et al., 2014; Bagirov et al., 2016, 2011; Karmitsa et al., 2017)
. However, these sophisticated methods have not been predominantly used in machine learning applications. This may be explained by three main factors: 1) data size and computational time restrictions, 2) the limited availability of implementations, or 3) the belief that a near-optimal solution of the MSSC model has little impact on clustering validity.
To help remove these barriers, we introduce a simple and efficient hybrid genetic search for the MSSC called HG-means, and conduct extensive computational analyses to measure the correlation between solution quality (in terms of the MSSC objective) and clustering performance (based on external measures). Our method combines the improvement capabilities of the K-means algorithm with a problem-tailored crossover, an adaptive mutation scheme, and population-diversity management strategies. The overall method can be seen as a multi-start K-means algorithm, in which the initial center positions are sampled by the genetic operators based on the search history. HG-means’ crossover uses a minimum-cost matching algorithm as a subroutine, with the aim of inheriting genetic material from both parents without excessive perturbation and creating child
solutions that can be improved in a limited number of iterations. The adaptive mutation operator has been designed to help cover distant samples without being excessively attracted by outliers. Finally, the population is managed so as to prohibitclones and favor the discovery of diverse solutions, a feature that helps to avoid premature convergence toward low-quality local minima.
As demonstrated by our experiments on a variety of datasets, HG-means produces MSSC solutions of significantly higher quality than those provided by previous algorithms. Its computational time is also lower than that of recent state-of-the-art optimization approaches, and it grows linearly with the number of samples and dimension. Moreover, when considering the reconstruction of a mixture of Gaussians, we observe that the standard repeated K-means and K-means++ approaches remain trapped in shallow local minima which can be very far from the ground truth, whereas HG-means
consistently attains better local optima and finds more accurate clusters. The performance gains are especially pronounced on datasets with a larger number of clusters and a feature space of higher dimension, in which more independent information is available, but also in which pairwise distances are known to become more uniform and less meaningful. Therefore, some key challenges associated to high-dimensional data clustering may be overcome by improving the optimization algorithms, before even considering a change of clustering model or paradigm.
2 Problem Statement
In a clustering problem, we are given a set of samples, where each sample is represented as a point in with coordinates , and we seek to partition into disjoint clusters so as to minimize a criterion . There is no universal objective suitable for all applications, but should generally promote homogeneity (similar samples should be in the same cluster) and separation (different samples should be in different clusters). MSSC corresponds to a specific choice of objective function, in which one aims to form the clusters and find a center position for each cluster, in such a way that the sum of the squared Euclidean distances of each point to the center of its associated cluster is minimized. This problem has been the focus of extensive research: there are many applications (Jain, 2010), and it is the natural problem for which K-means finds a local minimum.
). For each sample and cluster, the binary variabletakes the value if sample is assigned to cluster , and otherwise. The variables represent the positions of the centers.
In the objective, represents the Euclidean norm. Equation (2) forces each sample to be associated with a cluster, and Equations (3)–(4) define the domains of the variables. Note that in this model, and in the remainder of this paper, we consider a fixed number of clusters . Indeed, from the MSSC objective viewpoint, it is always beneficial to use the maximum number of available clusters. For some applications such as color quantization and data compression (Scheunders, 1997), the number of clusters is known in advance (desired number of colors or compression factor). Analytical techniques have been developed to find a suitable number of clusters (Sugar and James, 2003) when this information is not available. Finally, it is common to solve MSSC for a range of values of and select the most relevant result a-posteriori.
Regarding computational complexity, MSSC can be solved in time when using dynamic programming. For general and , MSSC is NP-hard (Aloise et al., 2009, 2012). Optimal MSSC solutions are known to satisfy at least two necessary conditions:
In any optimal MSSC solution, for each , the position of the center coincides with the centroid of the points belonging to :
In any optimal MSSC solution, for each , the sample is associated with the closest cluster such that:
These two properties are fundamental to understand the behavior of various MSSC algorithms. The K-means algorithm, in particular, iteratively modifies an incumbent solution to satisfy first Property 1 and then Property 2, until both are satisfied simultaneously. Various studies have proposed more efficient data structures and speed-up techniques for this method. For example, Hamerly (2010) provides an efficient K-means algorithm that has a complexity of per iteration. This algorithm is faster in practice than its theoretical worst case, since it avoids many of the innermost loops of K-means.
Other improvements of K-means have focused on improving the choice of initial centers (Steinley, 2006). K-means++ is one such method. This algorithm places the first center
in the location of a random sample selected with uniform probability. Then, each subsequent centeris randomly placed in the location of a sample , with a probability proportional to the distance of to its closest center in . Finally, K-means is executed from this starting point. With this rule, the expected solution quality is within a factor of the global optimum.
Numerous other solution techniques have been proposed for MSSC. These algorithms can generally be classified according to whether they are exact or heuristic, deterministic or probabilistic, and hierarchical or partitional. Some process complete solutions whereas others construct solutions during the search, and some maintain a single candidate solution whereas others work with population of solutions(Jain, 2010). The range of methods includes construction methods and local searches, metaheuristics, and mathematical programming techniques. Since this work focuses on the development of a hybrid genetic algorithm with population management, the remainder of this review focuses on other evolutionary methods for this problem, adaptations of K-means, as well as algorithms that currently report the best known solutions for the MSSC objective since these methods will be used for comparison purposes in Section 4.
Hruschka et al. (2009)
provide a comprehensive review and analysis of evolutionary algorithms for MSSC, comparing different solution encoding, crossover, and mutation strategies. As is the case for other combinatorial optimization problems, many genetic algorithms do not rely on random mutation and crossover only, but also integrate a local search to stimulate the discovery of high-quality solutions. Such algorithms are usually calledhybrid genetic or memetic algorithms (Blum et al., 2011). The algorithms of Fränti et al. (1997); Kivijärvi et al. (2003); Krishna and Murty (1999); Lu et al. (2004); Merz and Zell (2002) are representative of this type of strategy and exploit K-means for solution improvement. In particular, Fränti et al. (1997) and Kivijärvi et al. (2003) propose a hybrid genetic algorithm based on a portfolio of six crossover operators. One of these, which inspired the crossover of the present work, pairs the centroids of two solutions via a greedy nearest-neighbor algorithm and randomly selects one center from each pair. The mutation operator relocates a random centroid in the location of a random sample, with a small probability. Although this method has some common mechanisms with HG-means, it also misses other key components: an exact matching-based crossover, population-management mechanisms favoring the removal of clones, and an adaptive parameter to control the attractiveness of outliers in the mutation. The variation (crossover and mutation) operators of Krishna and Murty (1999); Lu et al. (2004); Merz and Zell (2002) are also different from those of HG-means. In particular, Krishna and Murty (1999); Lu et al. (2004) do not rely on crossover but exploit random mutation to reassign data points to clusters. Finally, Merz and Zell (2002) considers an application of clustering for gene expression data, using K-means as a local search along with a crossover operator that relies on distance proximity to exchange centers between solutions.
Besides evolutionary algorithms and metaheuristics, substantial research has been conducted on incremental variants of the K-means algorithm (Bagirov, 2008; Bagirov et al., 2016; Karmitsa et al., 2017; Likas et al., 2003; Ordin and Bagirov, 2015), leading to the current state-of-the-art results for large-scale datasets. Incremental clustering algorithms construct a solution of MSSC iteratively, adding one center at a time. The global K-means algorithm (Likas et al., 2003) is such a method. Starting from a solution with centers, the complete algorithm performs runs of K-means, one from each initial solution containing the existing centers plus sample . The best solution with centers is stored, and the process is repeated until a desired number of clusters is attained. Faster versions of this algorithm can be designed, by greedily selecting a smaller subset of solutions for improvement at each step. For example, the modified global K-means (MGKM) of (Bagirov, 2008) solves an auxiliary clustering problem to select one good initial solution at each step instead of considering all possibilities. This algorithm was improved in Ordin and Bagirov (2015) into a multi-start modified global K-means (MS-MGKM) algorithm, which generates several candidates at each step. Experiments on 16 real-world datasets show that MS-MGKM produces more accurate solutions than MGKM and the global K-means algorithm. These methods were also extended in Bagirov et al. (2016) and Karmitsa et al. (2017), by solving an optimization problem over a difference of convex (DC) functions in order to choose candidate initial solutions. Finally, Karmitsa et al. (2018) introduced an incremental nonsmooth optimization algorithm based on a limited memory bundle method, which produces solutions in a short time. To this date, the MS-MGKM, DCClust and DCD-Bundle algorithms represent the current state-of-the-art in terms of solution quality.
Despite this extensive research, producing high-quality MSSC solutions in a consistent manner remains challenging for large datasets. Our algorithm, presented in the next section, helps to fill this gap.
3 Proposed Methodology
HG-means is a hybrid metaheuristic that combines the exploration capabilities of a genetic algorithm and the improvement capabilities of a local search, along with general population-management strategies to preserve the diversity of the genetic information. Similarly to Kivijärvi et al. (2003); Krishna and Murty (1999) and some subsequent studies, the K-means algorithm is used as a local search. Moreover, the proposed method differs from previous work in its main variation operators: it relies on an exact bipartite matching crossover, uses a sophisticated adaptive mechanism in the mutation operator, as well as population-diversity management techniques.
The general scheme is given in Algorithm 1. Each individual in the population is represented as a triplet containing a membership chromosome and a coordinate chromosome to represent the solution and a mutation parameter to help balance the influence of outliers. The algorithm first generates a randomized initial population (Section 3.1) and then iteratively applies variation operators (selection, recombination, mutation) and local search (K-means) to evolve this population. At each iteration, two parents and are selected and crossed (Section 3.2), yielding the coordinates and mutation parameter of an offspring . A mutation operator is then applied to (Section 3.3), leading to an individual that is improved using the K-means algorithm (Section 3.4) and then included in the population.
Finally, each time the population exceeds a prescribed size , a survivor selection phase is triggered (Section 3.5), to retain only a diverse subset of good individuals. The algorithm terminates after consecutive iterations (generation of new individuals) without improvement of the best solution or a total of iterations have been performed. The remainder of this section describes each component of the method, in more detail.
3.1 Solution Representation and Initial Population
Each individual contains two chromosomes encoding the solution: a membership chromosome with integers, specifying for each sample the index of the cluster with which it is associated; and a coordinate chromosome with
real vectors in, representing the coordinates of the center of each cluster. The individual is completed with a mutation parameter, . Figure 1 illustrates this solution representation for a simple two-dimensional example with three centers.
Observe that either of the two chromosomes is sufficient to characterize an MSSC solution. If only the membership chromosome is known, then Property 1 states that the center of each cluster should be located at the centroid of the points associated with it, and a trivial calculation gives the coordinates of each centroid in . If only the coordinate chromosome is known, then Property 2 states that each sample should be associated with its closest center, and a simple linear search in , by calculating the distances of all the centers from each sample, gives the membership chromosome. Finally, note that the iterative decoding of one chromosome into the other, until convergence, is equivalent to the K-means algorithm.
3.2 Selection and Crossover
The generation of each new individual begins with the selection of two parents and . The parent selection is done by binary tournament. A binary tournament selects two random solutions in the population with uniform probability and retains the one with the best fitness. The fitness considered in HG-means is simply the value of the objective function (MSSC cost).
Then, the coordinate chromosomes and of the two parents serve as input to the matching crossover (MX), which generates the coordinate chromosome of an offspring in two steps:
Step 1) The MX solves a bipartite matching problem to pair-up the centers of the two parents. Let be a complete bipartite graph, where the vertex set represents the centers of parent , and the vertex set represents the centers of parent . Each edge , for and represents a possible association of center from parent with center from parent . Its cost is calculated as the Euclidean distance between the two centers. A minimum-cost bipartite matching problem is solved in the graph using an efficient implementation of the Hungarian algorithm (Kuhn, 1955), returning pairs of centers in time.
Step 2) For each pair obtained at the previous step, the MX randomly selects one of the two centers with equal probability, leading to a new coordinate chromosome with centers and inherited characteristics from both parents.
Finally, the mutation parameter of the offspring is obtained as a simple average of the parent values: .
The MX is illustrated in Figure 2 on the same example as before. This method can be viewed as an extension of the third crossover of Fränti et al. (1997), using an exact matching algorithm instead of a greedy heuristic. The MX has several important properties. First, each center in belongs to at least one parent, therefore promoting the transmission of good building blocks (Holland, 1975). Second, although any MSSC solution admits symmetrical representations, obtained by reindexing its clusters or reordering its centers, the coordinate chromosome generated by the MX will contain the same centers, regardless of this order. In combination with population management mechanisms (Section 3.5), this helps to avoid the propagation of similar solutions in the population and prevents premature convergence due to a loss of diversity.
The MX is deterministic and strictly inherits existing solution features from both parents. In contrast, our mutation operator aims to introduce new randomized solution characteristics. It receives as input the coordinate chromosome of the offspring and its mutation parameter . It has two steps:
Step 1) It “mutates” the mutation parameter as follows:
where is a random number selected with uniform probability in . Such a mechanism is typical in evolution strategies and allows an efficient parameter adaptation during the search.
Step 2) It uses the newly generated to mutate the coordinate chromosome by the biased relocation of a center:
Select a random center with uniform probability for removal.
Re-assign each sample to the closest remaining center (but do not modify the positions of the centers). Let be the distance of each sample from its closest center.
Select a random sample and add a new center in its position, leading to the new coordinate chromosome . The selection of
is done by roulette wheel, based on the following mixture of two probability distributions:
The value of in Equation (8) drives the behavior of the mutation operator. In the case where , all the samples have an equal chance of selection, independently of their distance from the closest center. When , the selection probability is proportional to the distance, similarly to the initialization phase of K-means++ (Arthur and Vassilvitskii, 2007)
. Considering sample-center distances increases the odds of selecting a good position for the new center. However, this could also drive the center positions toward outliers and reduce the solution diversity. For these reasons,is adapted instead of remaining fixed.
The mutation operator requires time and is illustrated in Figure 3. The top-right center is selected for removal, and a new center is reinserted on the bottom-right of the figure. These new center coordinates constitute a starting point for the K-means local search discussed in the following section. The first step of K-means will be to reassign each sample to its closest center (Figure 3(c)), which is equivalent to recovering the membership chromosome . After a few iterations, K-means converges to a new local minimum (Figure 3(d)), giving a solution that is added to the population.
3.4 Local Search
Each coordinate chromosome generated through selection, crossover, and mutation serves as a starting point for a local search based on K-means. This algorithm iteratively 1) reassigns each sample to its closest center and 2) moves each center position to the centroid of the samples to which it is assigned. These two steps are iterated until convergence to a local optimum. For this purpose, we use the fast K-means of (Hamerly, 2010). This algorithm has a worst-case complexity of per loop, and it exploits lower bounds on the distances to eliminate many distance evaluations while retaining the same results as the classical K-means. Moreover, in the exceptional case in which some clusters are left empty after the application of K-means (i.e., one center has no allocated sample), the algorithm selects new center locations using Equation (8) and restarts the K-means algorithm to ensure that all the solutions have exactly clusters.
3.5 Population Management
One critical challenge for population-based algorithms is to avoid the premature convergence of the population. Indeed, the elitist selection of parents favors good individuals and tends to reduce the population diversity. This effect is exacerbated by the application of K-means as a local search, since it concentrates the population on a smaller subset of local minima. Once diversity is lost, the chances of generating improving solutions via crossover and mutation are greatly reduced. To overcome this issue, HG-means relies on diversity management operators that find a good balance between elitism and diversification and that allow progress toward unexplored regions of the search space without excluding promising individuals. Similar techniques have been shown to be essential to progress toward high-quality solutions for difficult combinatorial optimization problems (Sörensen and Sevaux, 2006; Vidal et al., 2012).
Population management. The initial population of HG-means is generated by runs of the standard K-means algorithm, in which the initial center locations are randomly selected from the set of samples. Moreover, the mutation parameter of each individual is randomly initialized in
with an uniform distribution. Subsequently, each new individual is directly included in the population and thus hassome chance to be selected as a parent by the binary tournament operator, regardless of its quality. Whenever the population reaches a maximum size , a survivor selection is triggered to retain a subset of individuals.
Survivors selection. This mechanism selects individuals in the population for removal.
To promote population diversity, the removal first focuses on clone individuals. Two individuals and are clones if they have the same center positions.
When a pair of clones is detected, one of the two individuals is randomly eliminated.
When no more clones remain, the survivors selection proceeds with the elimination of the worst individuals, until a population of size is obtained.
Complexity analysis. HG-means has an overall worst-case complexity of , where , , and represent the time spent in the crossover, mutation, and K-means procedures. The mutation and crossover methods are much faster than the K-means local search in practice, so HG-means’ CPU time is proportional to the product of with the time of K-means. Under strict CPU time limits, and could be set to small constants to obtain fast results (see Section 4.4). Moreover, since HG-means maintains a population of complete solutions with clusters, it has good “anytime behavior” since it can be interrupted whenever necessary to return the current best solution.
4 Experimental Analysis
We conducted extensive computational experiments to evaluate the performance of HG-means. After a description of the datasets and a preliminary parameter calibration (Sections 4.1–4.2), our first analysis focuses on solution quality from the perspective of the MSSC optimization problem (Section 4.3). We compare the solution quality obtained by HG-means with that of the current state-of-the-art algorithms, in terms of objective function value and computational time, and we study the sensitivity of the method to changes in the parameters. Our second analysis concentrates on the scalability of the method, studying how the computational effort grows with the dimensionality of the data and the number of clusters (Section 4.4). Finally, our third analysis evaluates the correlation between solution quality from an optimization viewpoint and clustering performance via external cluster validity measures. It compares the performance of HG-means, K-means, and K-means++ on a fundamental task, that of recovering the parameters of a non separable mixture of Gaussians (Section 4.5). Yet, we consider feature spaces of medium to high dimensionality (20 to 500) in the presence of a medium to large number of Gaussians (50 to 1,000). We show that the improved solution quality of HG-means directly translates into better clustering performance and more accurate information retrieval.
The algorithms of this paper were implemented in C++ and compiled with G++ 4.8.4. The source code is available at https://github.com/danielgribel/hg-means. The experiments were conducted on a single thread of an Intel Xeon X5675 3.07 GHz processor with 8 GB of RAM.
We evaluated classical and recent state-of-the-art algorithms (Bagirov, 2008; Bagirov et al., 2016; Karmitsa et al., 2017, 2018; Kivijärvi et al., 2003; Ordin and Bagirov, 2015), in terms of solution quality for the MSSC objective, on a subset of datasets from the UCI machine learning repository (http://archive.ics.uci.edu/ml/). We collected all the recent datasets used in these studies for a thorough experimental comparison. The resulting 29 datasets, listed in Table 1, arise from a large variety of applications (e.g. cognitive psychology, genomics, and particle physics) and contain numeric features without missing values.
|Bavaria Postal 1||89||3||267|
|Bavaria Postal 2||89||4||356|
|Fisher’s Iris Plant||150||4||600|
|Pima Indians Diabetes||768||8||6k|
|EEG Eye State||14,980||14||210k|
|KEGG Metabolic Relation||53,413||20||1M|
|3D Road Network||434,874||3||1M|
|Online News Popularity||39,644||58||2M|
|Sensorless Drive Diagnosis||58,509||48||3M|
Their dimensions vary significantly, from 59 to 434,874 samples, and from 2 to 5,000 features. Each dataset has been considered with different values of (number of clusters), leading to a variety of MSSC test set-ups. These datasets are grouped into four classes. Classes A1 and A2 have small datasets with 59 to 768 samples, while Class B has medium datasets with 1,060 to 20,000 samples. These three classes were considered in Ordin and Bagirov (2015). Class C has larger datasets collected in Karmitsa et al. (2018), with 13,910 to 434,874 samples and sometimes a large number of features (e.g., Isolet and Gisette).
4.2 Parameter Calibration
HG-means is driven by four parameters: two controlling the population size ( and ) and two controlling the algorithm termination (maximum number of consecutive iterations without improvement , and overall maximum number of iterations ). Changing the termination criteria leads to different nondominated trade-offs between solution quality and CPU time. We therefore set these parameters to consume a smaller CPU time than most existing algorithms on large datasets, while allowing enough iterations to profit from the enhanced exploration abilities of HG-means, leading to . Subsequently, we calibrated the population size to establish a good balance between exploitation (number of generations before termination) and exploration (number of individuals). We compared the algorithm versions with and on medium-sized datasets of class A2 and B with . The setting had the best performance and forms our baseline configuration. The impact of deviations from this parameter setting will be studied in the next section.
4.3 Performance on the MSSC Optimization Problem
GKM (Likas et al., 2003) – global K-means;
SAGA (Kivijärvi et al., 2003) – self-adaptive genetic algorithm;
MGKM (Bagirov, 2008) – modified global K-means;
MS-MGKM (Ordin and Bagirov, 2015) – multi-start modified global K-means;
DCClust and MS-DCA (Bagirov et al., 2016) – clustering based on a difference of convex functions;
DCD-Bundle (Karmitsa et al., 2017) – diagonal bundle method;
LMBM-Clust (Karmitsa et al., 2018) – nonsmooth optimization based on a limited memory bundle method.
We also report the results of the classical K-means and K-means++ algorithms (efficient implementation of Hamerly (2010)) for both a single run and the best solution of 5,000 repeated runs with different initial solutions.
The datasets and numbers of clusters indicated in Table 1 lead to 235 test set-ups. For ease of presentation, each line of Tables 3 and 3 is associated with one dataset and displays averaged results over all values of . The detailed results of HG-means are available at https://w1.cirrelt.ca/~vidalt/en/research-data.html. For each dataset and value of , the solution quality is measured as the percentage gap from the best-known solution (BKS) value reported in all previous articles (from multiple methods, runs, and parameter settings).
|Single Run||5000 Runs||Single Run||5000 Runs|
|CPU||Xe 3.07 GHz||Xe 3.07 GHz||Core2 2.5 GHz||Xe 3.07 GHz||Core2 2.5 GHz||Core2 2.5 GHz||Xe 3.07 GHz|
|Passmark||1403 (1.00)||1403 (1.00)||976 (0.70)||1403 (1.00)||976 (0.70)||976 (0.70)||1403 (1.00)|
|Single Run||5000 Runs||Single Run||5000 Runs|
|CPU||Xe 3.07 GHz||Xe 3.07 GHz||I5 2.9 GHz||Xe 3.07 GHz||I7 4.0 GHz||I7 4.0 GHz||I7 4.0 GHz||I5 2.9 GHz||I5 1.6/2.7 GHz||Xe 3.07 GHz|
|Passmark||1403 (1.00)||1403 (1.00)||1859 (1.32)||1403 (1.00)||2352 (1.68)||2352 (1.68)||2352 (1.68)||1859 (1.32)||1432 (1.02)||1403 (1.00)|
|* Considering the subset of 8 instances which is common to all methods|
This gap is expressed as Gap() , where represents the solution value of the method considered, and is the BKS value. A negative gap means that the solutions for this dataset are better than the best solutions found previously. Finally, the last two lines indicate the CPU model used in each study, along with the time-scaling factor (based on the Passmark benchmark) representing the ratio between its speed and that of our processor. All time values in this article have been multiplied by these scaling factors to account for processor differences.
HG-means produces solutions of remarkable quality, with an average gap of and on the small and large datasets, respectively. This means that its solutions are better, on average, than the best solutions ever found. For all datasets, HG-means achieved the best gap value. The statistical significance of these improvements is confirmed by pairwise Wilcoxon tests between the results of HG-means and those of other methods (with p-values ). Over all 235 test set-ups (dataset number of cluster combinations), HG-means found 113 solutions better than the BKS, 116 solutions of equal quality, and only five solutions of lower quality. We observe that the improvements are also correlated with the size of the datasets. For the smallest ones, all methods perform relatively well. However, for more complex applications involving a larger number of samples, a feature space of higher dimension, and more clusters, striking differences in solution quality can be observed between the state-of-the-art methods.
These experiments also confirm the known fact that a single run of K-means or K-means++ does not usually find a good local minimum of the MSSC problem, as shown by gap values that can become arbitrarily high. For the Eye and Miniboone datasets, in particular, a misplaced center can have large consequences in terms of objective value. The probability of falling into such a case is high in a single run, but it can be reduced by performing repeated runs and retaining the best solution. Nevertheless, even 5,000 independent runs of K-means or K-means++ are insufficient to achieve solutions of a quality comparable to that of HG-means.
In terms of computational effort, HG-means is generally faster than SAGA, MS-MGKM, DCClust, MS-DCA, and DCD-Bundle (the current best methods in terms of solution quality), but slower than LMBM-Clust, since this method is specifically designed and calibrated to return quick solutions. It is also faster than a repeated K-means or K-means++ algorithm with 5000 restarts, i.e., a number of restarts equal to the maximum number of iterations of the algorithm. This can be partly explained by the fact that the solutions generated by the exact matching crossover require less time to converge via K-means than initial sample points selected according to some probability distributions. Moreover, a careful exploitation of the search history, represented by the genetic material of high-quality parent solutions, makes the method more efficient and accurate.
Finally, we measured the sensitivity of HG-means to changes in its parameters: defining the population-size limits, and defining the termination criterion. In Table 4, we fix the termination criterion to and consider a range of population-size parameters, reporting the average gap and median time over all datasets for each configuration. The choice of and appears to have only a limited impact on solution quality and CPU time: regardless of the parameter setting, HG-means returns better average solutions than all individual best known solutions collected from the literature. Some differences can still be noted between configurations: as highlighted by pairwise Wilcoxon tests, every configuration underlined in the table performs better than every non-underlined one (with p-values 0.018). Letting the population rise to the double of the minimum population size () before survivors selection is generally a good policy. Moreover, we observe that smaller populations trigger a faster convergence but at the risk of reducing diversity, whereas excessively large populations (i.e., ) unnecessarily spread the search effort, with an adverse impact on solution quality.
|Average Gap (%)||Median Time (s)|
|Average Gap (%)||Median Time (s)|
In Table 5, we retain two of the best population-size configurations and vary the termination criterion. Naturally, the quality of the solutions improves with longer runs, but even a short termination criterion such as already gives good solutions, with an average gap of . Finally, reducing the population size to for short runs allows us to better exploit a limited number of iterations, whereas the baseline setting performs slightly better for longer runs.
The solution quality and computational efficiency of most clustering algorithms is known to deteriorate as the number of clusters grows, since this leads to more complex combinatorial problems with numerous local minima. To evaluate how HG-means behaves in these circumstances, we conduct additional experiments focused on the large datasets of class C. Table 6 reports the solution quality and CPU time of HG-means for each dataset as a function of . Moreover, to explore the case where the CPU time is more restricted, Table 7 reports the same information for a fast HG-means configuration where and .
|Gap (%)||Time (s)|
|Gap (%)||Time (s)|
As observed in Table 6, HG-means retrieves or improves the BKS for all datasets and values of . Significant improvements are more frequently observed for larger values of . A likely explanation is that the global minimum has already been found for most datasets with a limited number of clusters, whereas previous methods did not succeed in finding the global optimum for larger values of . In terms of computational effort, there is a visible correlation between the number of clusters and the CPU time. Power law regressions of the form indicate that the computational effort of HG-means grows as for Eye State, for Miniboone, and for in all other cases. Similarly, for , fitting the CPU time of the method as a power law of the form indicates that the measured CPU time of HG-means grows as , i.e., linearly with the number of samples and the dimension of the feature space.
We observe a significant reduction in CPU time when comparing the results of the fast HG-means in Table 7 with those of the standard version in Table 6. Considering the speed ratio between methods for each dataset, the fast configuration is in average seven times faster than the standard HG-means and over 10 times faster than SAGA, MS-MGKM, DCClust, DCD-Bundle and MS-DCA. Figure 4 also displays the CPU time of HG-means, its fast configuration, and the other algorithms listed in Section 4.3 as a function of . Surprisingly, the solution quality did not deteriorate much by reducing the termination criterion for these large datasets: with a percentage gap of , the solutions found by the fast HG-means are close to those of the standard version (gap of ) and still much better than all solutions found in previous studies. Therefore, researchers interested in using HG-means can easily adapt the termination criterion of the method, so as to obtain significant gains of solution quality within their own computational budget.
4.5 Solution Quality and Clustering Performance
The previous section has established that HG-means finds better MSSC local minima than other state-of-the-art algorithms and exemplified the known fact that K-means or K-means++ solutions can be arbitrarily far from the global minimum. In most situations, using the most accurate method for a problem should be the best option. However, HG-means is slower than a few runs of K-means or K-means++
. To determine whether it is worth investing this additional effort, we must determine whether a better solution quality for the MSSC problem effectively translates into better clustering performance. There have been similar investigations in other machine learning subfields, e.g., to choose the amount of effort dedicated to training deep neural networks (see, e.g.(Hoffer et al., 2017; Keskar et al., 2017)).
To explore this, we conduct an experiment in which we compare the ability of HG-means, K-means and K-means++
to classify 50,000 samples issued from a mixture of spherical Gaussian distributions:with . For each , and are uniformly selected in and , respectively. This is a fundamental setting, without any hidden structure, in which we expect the MSSC model and the associated K-means variants to be a good choice since these methods promote spherical and balanced clusters. To increase the challenge, we consider a medium to large number of Gaussians, with , in feature spaces of medium to high dimensions . For each combination of and , we repeat the generation process until we obtain a mixture that is not -separated and in which at least 99% of the pairs of Gaussians are -separated (Dasgupta, 1999). Such a mixture corresponds to Gaussians that significantly overlap. These datasets can be accessed at https://w1.cirrelt.ca/~vidalt/en/research-data.html.
Tables 9 and 9 compare the results of HG-means with those of K-means and K-means++ over a single run or 500 repeated runs, in terms of MSSC solution quality (as represented by the percentage gap) and cluster validity in relation to the ground truth. We use three external measures of cluster validity: the adjusted Rand index (CRand – Hubert and Arabie (1985)), the normalized mutual information (NMI – Kvalseth (1987)), and the centroid index (CI – Fränti et al. (2014)). CRand and NMI take continuous values in and , respectively. They converge toward as the clustering solution becomes closer to the ground truth. CI takes integer values and measures the number of fundamental cluster differences between solutions, with a value of indicating that the solution has the same cluster-level structure as the ground truth.
As in the previous experiments, a single run of K-means or K-means++ visibly leads to shallow local minima that can be improved with multiple runs from different starting points. However, even 500 repetitions of these algorithms are insufficient to reach the solution quality of HG-means, despite the similar CPU time. K-means performs better than K-means++ for these datasets, most likely because it is more robust to outliers when selecting initial center locations. A pairwise Wilcoxon test highlights significant differences between HG-means and all other methods (p-values ). With a Pearson coefficient , the dimension of the feature space is correlated to the inaccuracy (percentage gap) of the repeated K-means and K-means++ algorithms, which appear to be more easily trapped in low-quality local minima for feature spaces of larger dimension.
Comparing the external clustering performance of these methods (via CRand, NMI, and CI) leads to additional insights. For all three metrics, pairwise Wilcoxon tests highlight significant performance differences between HG-means and repeated K-means variants (with p-values ). We also observe a correlation between the solution quality (percentage gap) and the three external measures. Although the differences in the MSSC objective function values appear small at first glance (e.g., average gap of 4.87% for repeated K-means), these inaccuracies have a large effect on cluster validity, especially for datasets with feature spaces of higher dimension. When , HG-means is able to exploit the increased amount of available independent information to find a close approximation to the ground truth (average CRand of 0.99, NMI of 1.00, and CI of 1.75) whereas repeated K-means and K-means++
reach shallow local optima and obtain inaccurate results (average CRand and NMI below 0.65 and 0.94, respectively, and average CI above 23.75). Classical distance metrics are known to become more uniform as the feature-space dimension grows, and the number of local minima of the MSSC quickly increases, so feature-reduction or subspace-clustering techniques are often recommended for high-dimensional datasets. In these experiments, however, the inaccuracy of repeatedK-means (or K-means++) appears to be a direct consequence of its inability to find good-quality local minima, rather than a consequence of the MSSC model itself, since near-optimal solutions of the MSSC translate into accurate results.
Overall, we conclude that even for simple Gaussian-based distribution mixtures, finding good local minima of the MSSC problem is essential for an accurate information retrieval. This is a major difference with studies on, for example, deep neural networks, where it is conjectured that most local minima have similar objective values, and where more intensive training (e.g., stochastic gradient descent with large batches) have adverse effects on generalizationLecun et al. (2015). For clustering problems, it is essential to keep progressing toward faster and more accurate MSSC solvers, and to recognize when to use these high-performance methods for large and high-dimensional datasets.
|BKS||Gap (%)||Time (s)|
|m||d||Value||1 Run||500 Runs||1 Run||500 Runs||1 Run||500 Runs||1 Run||500 Runs|
|m||d||1 Run||500 Runs||1 Run||500 Runs||1 Run||500 Runs||1 Run||500 Runs||1 Run||500 Runs||1 Run||500 Runs|
5 Conclusions and Perspectives
In this article, we have studied the MSSC problem, a classical clustering model of which the popular K-means algorithm constitutes a local minimizer. We have proposed a hybrid genetic algorithm, HG-means, that combines the improvement capabilities of K-means as a local search with the diversification capabilities of problem-tailored genetic operators. The algorithm uses an exact minimum-cost matching crossover operator and an adaptive mutation procedure to generate strategic initial center positions for K-means and to promote a thorough exploration of the search space. Moreover, it uses population diversity management strategies to prevent premature convergence to shallow local minima.
We conducted extensive computational experiments to evaluate the performance of the method in terms of MSSC solution quality, computational effort and scalability, and external cluster validity. Our results indicate that HG-means attains better local minima than all recent state-of-the-art algorithms. Large solution improvements are usually observed for large datasets with a medium-to-high number of clusters , since these characteristics lead to MSSC problems that have not been effectively solved by previous approaches. The CPU time of HG-means is directly proportional to that of the K-means local-improvement procedure and to the number of iterations allowed (the termination criterion). It appears to grow linearly with the number of samples and feature-space dimensions, and the termination criterion can be adjusted to achieve solutions in a shorter time without a large impact on solution accuracy.
Through additional tests conducted on datasets generated via Gaussian mixtures, we observed a strong correlation between MSSC solution quality and cluster validity measures. A repeated K-means algorithm, for example, obtains solution inaccuracies (percentage gap to the best known local minima) that are small at first glance but highly detrimental for the outcome of the clustering task. In particular, a gap as small as 5% in the objective space can make the difference between accurate clustering and failure. This effect was observed in all Gaussian datasets studied, and it became more prominent in feature spaces of higher dimension. In those situations, the inability of K-means or K-means++ to provide satisfactory results seems to be tied to its inability to find good-quality local minima of the MSSC model, rather than to an inadequacy of the model itself.
Overall, beyond the immediate gains in terms of clustering performance, research into efficient optimization algorithms for MSSC remains linked to important methodological stakes. Indeed, a number of studies aim to find adequate models (e.g., MSSC) for different tasks and datasets. With that goal in mind, it is essential to differentiate the limitations of the model themselves (inadequacy for a given task or data type), and those of algorithms used to solve such models (shallow local optima). While an analysis using external measures (e.g., CRand, NMI or CI) allows a general evaluation of error (due to both sources of inaccuracy), only a precise investigation of a method’s performance in the objective space can help evaluating the magnitude of each imprecision, and only accurate or exact optimization methods can give meaningful conclusions regarding model suitability. In future research, we plan to continue progressing in this direction, generalizing the proposed solution method to other clustering models, possibly considering the use of kernel transformations, integrating semi-supervised information in the form of must-link or cannot-link constraints, and pursuing the development of high-performance optimization algorithms for other classes of applications.
The authors thank the four anonymous referees for their detailed comments, which significantly contributed to improving this paper. This research is partially supported by CAPES, CNPq [grant number 308498/2015-1] and FAPERJ [grant number E-26/203.310/2016] in Brazil. This support is gratefully acknowledged.
- Al-Sultan (1995) Al-Sultan, K. 1995. A tabu search approach to the clustering problem. Pattern Recognition 28(9) 1443–1451.
- Aloise et al. (2009) Aloise, D., A. Deshpande, P. Hansen, P. Popat. 2009. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning 75(2) 245–248.
- Aloise et al. (2012) Aloise, D., P. Hansen, L. Liberti. 2012. An improved column generation algorithm for minimum sum-of-squares clustering. Mathematical Programming 131(1) 195–220.
- An et al. (2014) An, L.T.H., L.H. Minh, P.D. Tao. 2014. New and efficient DCA based algorithms for minimum sum-of-squares clustering. Pattern Recognition 47(1) 388–401.
- Arthur and Vassilvitskii (2007) Arthur, D., S. Vassilvitskii. 2007. K-Means++: The advantages of careful seeding. SODA’07. Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, New Orleans, Louisiana, USA, 1027–1035.
- Bagirov (2008) Bagirov, A.M. 2008. Modified global k-means algorithm for minimum sum-of-squares clustering problems. Pattern Recognition 41(10) 3192–3199.
- Bagirov et al. (2016) Bagirov, A.M., S. Taheri, J. Ugon. 2016. Nonsmooth DC programming approach to the minimum sum-of-squares clustering problems. Pattern Recognition 53 12–24.
- Bagirov et al. (2011) Bagirov, A.M., J. Ugon, D. Webb. 2011. Fast modified global k-means algorithm for incremental cluster construction. Pattern Recognition 44(4) 866–876.
- Blum et al. (2011) Blum, C., J. Puchinger, G. Raidl, A. Roli. 2011. Hybrid metaheuristics in combinatorial optimization: A survey. Applied Soft Computing 11(6) 4135–4151.
- Dasgupta (1999) Dasgupta, S. 1999. Learning mixtures of Gaussians. 40th Annual Symposium on Foundations of Computer Science 1 634–644.
- Fränti et al. (1997) Fränti, P., J. Kivijärvi, T. Kaukoranta, O. Nevalainen. 1997. Genetic algorithms for large-scale clustering problems. The Computer Journal 40(9) 547–554.
- Fränti et al. (2014) Fränti, P., M. Rezaei, Q. Zhao. 2014. Centroid index: Cluster level similarity measure. Pattern Recognition 47(9) 3034–3045.
- Hamerly (2010) Hamerly, G. 2010. Making k-means even faster. SDM’10, SIAM International Conference on Data Mining. 130–140.
- Han et al. (2011) Han, J., M. Kamber, J. Pei. 2011. Data mining: concepts and techniques. 3rd ed. Morgan Kaufmann.
- Hansen and Mladenović (2001) Hansen, P., N. Mladenović. 2001. J-Means: A new local search heuristic for minimum sum of squares clustering. Pattern Recognition 34(2) 405–413.
- Hartigan and Wong (1979) Hartigan, J.A., M.A. Wong. 1979. Algorithm AS 136: A k-means clustering algorithm. Applied Statistics 28(1) 100–108.
- Hoffer et al. (2017) Hoffer, E., I. Hubara, D. Soudry. 2017. Train longer, generalize better: Closing the generalization gap in large batch training of neural networks. Advances in Neural Information Processing Systems. 1729–1739.
- Holland (1975) Holland, J.H. 1975. Adaptation in Natural and Artificial Systems. The University of Michigan Press, Ann Arbor, MI.
- Hruschka et al. (2009) Hruschka, E.R., R.J.G.B. Campello, A.A. Freitas, A.C.P.L.F. de Carvalho. 2009. A survey of evolutionary algorithms for clustering. IEEE Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews 39(2) 133–155.
- Hubert and Arabie (1985) Hubert, L., P. Arabie. 1985. Comparing partitions. Journal of Classification 2(1) 193–218.
- Ismkhan (2018) Ismkhan, H. 2018. I-k-means-+: An iterative clustering algorithm based on an enhanced version of the k-means. Pattern Recognition 79(1) 402–413.
- Jain (2010) Jain, A.K. 2010. Data clustering: 50 years beyond K-means. Pattern Recognition Letters 31(8) 651–666.
- Karmitsa et al. (2017) Karmitsa, N., A.M. Bagirov, S. Taheri. 2017. New diagonal bundle method for clustering problems in large data sets. European Journal of Operational Research 263(2) 367–379.
- Karmitsa et al. (2018) Karmitsa, N., A.M. Bagirov, S. Taheri. 2018. Clustering in large data sets with the limited memory bundle method. Pattern Recognition 83 245–259.
Keskar et al. (2017)
Keskar, N.S., D. Mudigere, J. Nocedal, M. Smelyanskiy, P.T.P. Tang. 2017.
On large-batch training for deep learning: Generalization gap and sharp minima.ICLR’17, International Conference on Learning Representations.
- Kivijärvi et al. (2003) Kivijärvi, J., P. Fränti, O. Nevalainen. 2003. Self-adaptive genetic algorithm for clustering. Journal of Heuristics 9(2) 113–129.
- Krishna and Murty (1999) Krishna, K., M.N. Murty. 1999. Genetic k-means algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 29(3) 433–439.
- Kuhn (1955) Kuhn, H.W. 1955. The Hungarian method for the assignment problem. Naval Research Logistics 2(1-2) 83–97.
- Kvalseth (1987) Kvalseth, T.O. 1987. Entropy and correlation: Some comments. IEEE Transactions on Systems, Man and Cybernetics 17(3) 517–519.
- Lecun et al. (2015) Lecun, Y., Y. Bengio, G. Hinton. 2015. Deep learning. Nature 521 436–444.
- Likas et al. (2003) Likas, A., N. Vlassis, J.J. Verbeek. 2003. The global k-means clustering algorithm. Pattern Recognition 36(2) 451–461.
- Lloyd (1982) Lloyd, S.P. 1982. Least squares quantization in PCM. IEEE Transactions on Information Theory 28(2) 129–137.
- Lu et al. (2004) Lu, Y., S. Lu, F. Fotouhi, Y. Deng, S.J. Brown. 2004. FGKA: A fast genetic K-means clustering algorithm. Proceedings of the 2004 ACM Symposium on Applied Computing 622–623.
- Maulik and Bandyopadhyay (2000) Maulik, U., S. Bandyopadhyay. 2000. Genetic algorithm-based clustering technique. Pattern Recognition 33 1455–1465.
- Merz and Zell (2002) Merz, P., A. Zell. 2002. Clustering gene expression profiles with memetic algorithms. Parallel Problem Solving from Nature. Springer, 811–820.
- Ordin and Bagirov (2015) Ordin, B., A.M. Bagirov. 2015. A heuristic algorithm for solving the minimum sum-of-squares clustering problems. Journal of Global Optimization 61(2) 341–361.
- Sarkar et al. (1997) Sarkar, M., B. Yegnanarayana, D. Khemani. 1997. A clustering algorithm using an evolutionary programming-based approach. Pattern Recognition Letters 18(10) 975–986.
- Scheunders (1997) Scheunders, P. 1997. A comparison of clustering algorithms applied to color image quantization. Pattern Recognition Letters 18 1379–1384.
- Selim and Alsultan (1991) Selim, S.Z., K. Alsultan. 1991. A simulated annealing algorithm for the clustering problem. Pattern Recognition 24(10) 1003–1008.
- Sörensen and Sevaux (2006) Sörensen, K., M. Sevaux. 2006. MAPM: Memetic algorithms with population management. Computers & Operations Research 33(5) 1214–1225.
- Steinley (2006) Steinley, D. 2006. K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology 59(1) 1–34.
- Sugar and James (2003) Sugar, C.A., G.M. James. 2003. Finding the number of clusters in a dataset. Journal of the American Statistical Association 98(463) 750–763.
- Vidal et al. (2012) Vidal, T., T.G. Crainic, M. Gendreau, N. Lahrichi, W. Rei. 2012. A hybrid genetic algorithm for multidepot and periodic vehicle routing problems. Operations Research 60(3) 611–624.