1 Introduction
Broadly defined, clustering is the problem of organizing a collection of elements into coherent groups in such a way that similar elements are in the same cluster and different elements are in different clusters. Of the models and formulations for this problem, the Euclidean minimum sumofsquares clustering (MSSC) is prominent in the literature. MSSC can be formulated as an optimization problem in which the objective is to minimize the sumofsquares of the Euclidean distances of the samples to their cluster means. This problem has been extensively studied over the last 50 years, as highlighted by various surveys and books (see, e.g., Han et al., 2011; Hruschka et al., 2009; Jain, 2010).
The NPhardness of MSSC (Aloise et al., 2009) and the size of practical datasets explain why most MSSC algorithms are heuristics, designed to produce an approximate solution in a reasonable computational time. Kmeans (Hartigan and Wong, 1979) (also called Lloyd’s algorithm (Lloyd, 1982)) and Kmeans++ (Arthur and Vassilvitskii, 2007) are two popular local search algorithms for MSSC that differ in the construction of their initial solutions. Their simplicity and low computational complexity explain their extensive use in practice. However, these methods have two significant disadvantages: (i) their solutions can be distant from the global optimum, especially in the presence of a large number of clusters and dimensions, and (ii) their performance is sensitive to the initial conditions of the search.
To circumvent these issues, a variety of heuristics and metaheuristics have been proposed with the aim of better escaping from shallow local minima (i.e., poor solutions in terms of the MSSC objective). Nearly all the classical metaheuristic frameworks have been applied, including simulated annealing, tabu search, variable neighborhood search, iterated local search, evolutionary algorithms
(AlSultan, 1995; Hansen and Mladenović, 2001; Ismkhan, 2018; Maulik and Bandyopadhyay, 2000; Sarkar et al., 1997; Selim and Alsultan, 1991), as well as more recent incremental methods and convex optimization techniques (An et al., 2014; Bagirov et al., 2016, 2011; Karmitsa et al., 2017). However, these sophisticated methods have not been predominantly used in machine learning applications. This may be explained by three main factors: 1) data size and computational time restrictions, 2) the limited availability of implementations, or 3) the belief that a nearoptimal solution of the MSSC model has little impact on clustering validity.
To help remove these barriers, we introduce a simple and efficient hybrid genetic search for the MSSC called HGmeans, and conduct extensive computational analyses to measure the correlation between solution quality (in terms of the MSSC objective) and clustering performance (based on external measures). Our method combines the improvement capabilities of the Kmeans algorithm with a problemtailored crossover, an adaptive mutation scheme, and populationdiversity management strategies. The overall method can be seen as a multistart Kmeans algorithm, in which the initial center positions are sampled by the genetic operators based on the search history. HGmeans’ crossover uses a minimumcost matching algorithm as a subroutine, with the aim of inheriting genetic material from both parents without excessive perturbation and creating child
solutions that can be improved in a limited number of iterations. The adaptive mutation operator has been designed to help cover distant samples without being excessively attracted by outliers. Finally, the population is managed so as to prohibit
clones and favor the discovery of diverse solutions, a feature that helps to avoid premature convergence toward lowquality local minima.As demonstrated by our experiments on a variety of datasets, HGmeans produces MSSC solutions of significantly higher quality than those provided by previous algorithms. Its computational time is also lower than that of recent stateoftheart optimization approaches, and it grows linearly with the number of samples and dimension. Moreover, when considering the reconstruction of a mixture of Gaussians, we observe that the standard repeated Kmeans and Kmeans++ approaches remain trapped in shallow local minima which can be very far from the ground truth, whereas HGmeans
consistently attains better local optima and finds more accurate clusters. The performance gains are especially pronounced on datasets with a larger number of clusters and a feature space of higher dimension, in which more independent information is available, but also in which pairwise distances are known to become more uniform and less meaningful. Therefore, some key challenges associated to highdimensional data clustering may be overcome by improving the optimization algorithms, before even considering a change of clustering model or paradigm.
2 Problem Statement
In a clustering problem, we are given a set of samples, where each sample is represented as a point in with coordinates , and we seek to partition into disjoint clusters so as to minimize a criterion . There is no universal objective suitable for all applications, but should generally promote homogeneity (similar samples should be in the same cluster) and separation (different samples should be in different clusters). MSSC corresponds to a specific choice of objective function, in which one aims to form the clusters and find a center position for each cluster, in such a way that the sum of the squared Euclidean distances of each point to the center of its associated cluster is minimized. This problem has been the focus of extensive research: there are many applications (Jain, 2010), and it is the natural problem for which Kmeans finds a local minimum.
A compact mathematical formulation of MSSC is presented in Equations (1)–(4
). For each sample and cluster, the binary variable
takes the value if sample is assigned to cluster , and otherwise. The variables represent the positions of the centers.(1)  
s.t.  (2)  
(3)  
(4) 
In the objective, represents the Euclidean norm. Equation (2) forces each sample to be associated with a cluster, and Equations (3)–(4) define the domains of the variables. Note that in this model, and in the remainder of this paper, we consider a fixed number of clusters . Indeed, from the MSSC objective viewpoint, it is always beneficial to use the maximum number of available clusters. For some applications such as color quantization and data compression (Scheunders, 1997), the number of clusters is known in advance (desired number of colors or compression factor). Analytical techniques have been developed to find a suitable number of clusters (Sugar and James, 2003) when this information is not available. Finally, it is common to solve MSSC for a range of values of and select the most relevant result aposteriori.
Regarding computational complexity, MSSC can be solved in time when using dynamic programming. For general and , MSSC is NPhard (Aloise et al., 2009, 2012). Optimal MSSC solutions are known to satisfy at least two necessary conditions:
Property 1
In any optimal MSSC solution, for each , the position of the center coincides with the centroid of the points belonging to :
(5) 
Property 2
In any optimal MSSC solution, for each , the sample is associated with the closest cluster such that:
(6) 
These two properties are fundamental to understand the behavior of various MSSC algorithms. The Kmeans algorithm, in particular, iteratively modifies an incumbent solution to satisfy first Property 1 and then Property 2, until both are satisfied simultaneously. Various studies have proposed more efficient data structures and speedup techniques for this method. For example, Hamerly (2010) provides an efficient Kmeans algorithm that has a complexity of per iteration. This algorithm is faster in practice than its theoretical worst case, since it avoids many of the innermost loops of Kmeans.
Other improvements of Kmeans have focused on improving the choice of initial centers (Steinley, 2006). Kmeans++ is one such method. This algorithm places the first center
in the location of a random sample selected with uniform probability. Then, each subsequent center
is randomly placed in the location of a sample , with a probability proportional to the distance of to its closest center in . Finally, Kmeans is executed from this starting point. With this rule, the expected solution quality is within a factor of the global optimum.Numerous other solution techniques have been proposed for MSSC. These algorithms can generally be classified according to whether they are exact or heuristic, deterministic or probabilistic, and hierarchical or partitional. Some process complete solutions whereas others construct solutions during the search, and some maintain a single candidate solution whereas others work with population of solutions
(Jain, 2010). The range of methods includes construction methods and local searches, metaheuristics, and mathematical programming techniques. Since this work focuses on the development of a hybrid genetic algorithm with population management, the remainder of this review focuses on other evolutionary methods for this problem, adaptations of Kmeans, as well as algorithms that currently report the best known solutions for the MSSC objective since these methods will be used for comparison purposes in Section 4.Hruschka et al. (2009)
provide a comprehensive review and analysis of evolutionary algorithms for MSSC, comparing different solution encoding, crossover, and mutation strategies. As is the case for other combinatorial optimization problems, many genetic algorithms do not rely on random mutation and crossover only, but also integrate a local search to stimulate the discovery of highquality solutions. Such algorithms are usually called
hybrid genetic or memetic algorithms (Blum et al., 2011). The algorithms of Fränti et al. (1997); Kivijärvi et al. (2003); Krishna and Murty (1999); Lu et al. (2004); Merz and Zell (2002) are representative of this type of strategy and exploit Kmeans for solution improvement. In particular, Fränti et al. (1997) and Kivijärvi et al. (2003) propose a hybrid genetic algorithm based on a portfolio of six crossover operators. One of these, which inspired the crossover of the present work, pairs the centroids of two solutions via a greedy nearestneighbor algorithm and randomly selects one center from each pair. The mutation operator relocates a random centroid in the location of a random sample, with a small probability. Although this method has some common mechanisms with HGmeans, it also misses other key components: an exact matchingbased crossover, populationmanagement mechanisms favoring the removal of clones, and an adaptive parameter to control the attractiveness of outliers in the mutation. The variation (crossover and mutation) operators of Krishna and Murty (1999); Lu et al. (2004); Merz and Zell (2002) are also different from those of HGmeans. In particular, Krishna and Murty (1999); Lu et al. (2004) do not rely on crossover but exploit random mutation to reassign data points to clusters. Finally, Merz and Zell (2002) considers an application of clustering for gene expression data, using Kmeans as a local search along with a crossover operator that relies on distance proximity to exchange centers between solutions.Besides evolutionary algorithms and metaheuristics, substantial research has been conducted on incremental variants of the Kmeans algorithm (Bagirov, 2008; Bagirov et al., 2016; Karmitsa et al., 2017; Likas et al., 2003; Ordin and Bagirov, 2015), leading to the current stateoftheart results for largescale datasets. Incremental clustering algorithms construct a solution of MSSC iteratively, adding one center at a time. The global Kmeans algorithm (Likas et al., 2003) is such a method. Starting from a solution with centers, the complete algorithm performs runs of Kmeans, one from each initial solution containing the existing centers plus sample . The best solution with centers is stored, and the process is repeated until a desired number of clusters is attained. Faster versions of this algorithm can be designed, by greedily selecting a smaller subset of solutions for improvement at each step. For example, the modified global Kmeans (MGKM) of (Bagirov, 2008) solves an auxiliary clustering problem to select one good initial solution at each step instead of considering all possibilities. This algorithm was improved in Ordin and Bagirov (2015) into a multistart modified global Kmeans (MSMGKM) algorithm, which generates several candidates at each step. Experiments on 16 realworld datasets show that MSMGKM produces more accurate solutions than MGKM and the global Kmeans algorithm. These methods were also extended in Bagirov et al. (2016) and Karmitsa et al. (2017), by solving an optimization problem over a difference of convex (DC) functions in order to choose candidate initial solutions. Finally, Karmitsa et al. (2018) introduced an incremental nonsmooth optimization algorithm based on a limited memory bundle method, which produces solutions in a short time. To this date, the MSMGKM, DCClust and DCDBundle algorithms represent the current stateoftheart in terms of solution quality.
Despite this extensive research, producing highquality MSSC solutions in a consistent manner remains challenging for large datasets. Our algorithm, presented in the next section, helps to fill this gap.
3 Proposed Methodology
HGmeans is a hybrid metaheuristic that combines the exploration capabilities of a genetic algorithm and the improvement capabilities of a local search, along with general populationmanagement strategies to preserve the diversity of the genetic information. Similarly to Kivijärvi et al. (2003); Krishna and Murty (1999) and some subsequent studies, the Kmeans algorithm is used as a local search. Moreover, the proposed method differs from previous work in its main variation operators: it relies on an exact bipartite matching crossover, uses a sophisticated adaptive mechanism in the mutation operator, as well as populationdiversity management techniques.
The general scheme is given in Algorithm 1. Each individual in the population is represented as a triplet containing a membership chromosome and a coordinate chromosome to represent the solution and a mutation parameter to help balance the influence of outliers. The algorithm first generates a randomized initial population (Section 3.1) and then iteratively applies variation operators (selection, recombination, mutation) and local search (Kmeans) to evolve this population. At each iteration, two parents and are selected and crossed (Section 3.2), yielding the coordinates and mutation parameter of an offspring . A mutation operator is then applied to (Section 3.3), leading to an individual that is improved using the Kmeans algorithm (Section 3.4) and then included in the population.
Finally, each time the population exceeds a prescribed size , a survivor selection phase is triggered (Section 3.5), to retain only a diverse subset of good individuals. The algorithm terminates after consecutive iterations (generation of new individuals) without improvement of the best solution or a total of iterations have been performed. The remainder of this section describes each component of the method, in more detail.
3.1 Solution Representation and Initial Population
Each individual contains two chromosomes encoding the solution: a membership chromosome with integers, specifying for each sample the index of the cluster with which it is associated; and a coordinate chromosome with
real vectors in
, representing the coordinates of the center of each cluster. The individual is completed with a mutation parameter, . Figure 1 illustrates this solution representation for a simple twodimensional example with three centers.Observe that either of the two chromosomes is sufficient to characterize an MSSC solution. If only the membership chromosome is known, then Property 1 states that the center of each cluster should be located at the centroid of the points associated with it, and a trivial calculation gives the coordinates of each centroid in . If only the coordinate chromosome is known, then Property 2 states that each sample should be associated with its closest center, and a simple linear search in , by calculating the distances of all the centers from each sample, gives the membership chromosome. Finally, note that the iterative decoding of one chromosome into the other, until convergence, is equivalent to the Kmeans algorithm.
3.2 Selection and Crossover
The generation of each new individual begins with the selection of two parents and . The parent selection is done by binary tournament. A binary tournament selects two random solutions in the population with uniform probability and retains the one with the best fitness. The fitness considered in HGmeans is simply the value of the objective function (MSSC cost).
Then, the coordinate chromosomes and of the two parents serve as input to the matching crossover (MX), which generates the coordinate chromosome of an offspring in two steps:

Step 1) The MX solves a bipartite matching problem to pairup the centers of the two parents. Let be a complete bipartite graph, where the vertex set represents the centers of parent , and the vertex set represents the centers of parent . Each edge , for and represents a possible association of center from parent with center from parent . Its cost is calculated as the Euclidean distance between the two centers. A minimumcost bipartite matching problem is solved in the graph using an efficient implementation of the Hungarian algorithm (Kuhn, 1955), returning pairs of centers in time.

Step 2) For each pair obtained at the previous step, the MX randomly selects one of the two centers with equal probability, leading to a new coordinate chromosome with centers and inherited characteristics from both parents.
Finally, the mutation parameter of the offspring is obtained as a simple average of the parent values: .
The MX is illustrated in Figure 2 on the same example as before. This method can be viewed as an extension of the third crossover of Fränti et al. (1997), using an exact matching algorithm instead of a greedy heuristic. The MX has several important properties. First, each center in belongs to at least one parent, therefore promoting the transmission of good building blocks (Holland, 1975). Second, although any MSSC solution admits symmetrical representations, obtained by reindexing its clusters or reordering its centers, the coordinate chromosome generated by the MX will contain the same centers, regardless of this order. In combination with population management mechanisms (Section 3.5), this helps to avoid the propagation of similar solutions in the population and prevents premature convergence due to a loss of diversity.
3.3 Mutation
The MX is deterministic and strictly inherits existing solution features from both parents. In contrast, our mutation operator aims to introduce new randomized solution characteristics. It receives as input the coordinate chromosome of the offspring and its mutation parameter . It has two steps:

Step 1) It “mutates” the mutation parameter as follows:
(7) where is a random number selected with uniform probability in . Such a mechanism is typical in evolution strategies and allows an efficient parameter adaptation during the search.

Step 2) It uses the newly generated to mutate the coordinate chromosome by the biased relocation of a center:

Select a random center with uniform probability for removal.

Reassign each sample to the closest remaining center (but do not modify the positions of the centers). Let be the distance of each sample from its closest center.

Select a random sample and add a new center in its position, leading to the new coordinate chromosome . The selection of
is done by roulette wheel, based on the following mixture of two probability distributions:
(8)

The value of in Equation (8) drives the behavior of the mutation operator. In the case where , all the samples have an equal chance of selection, independently of their distance from the closest center. When , the selection probability is proportional to the distance, similarly to the initialization phase of Kmeans++ (Arthur and Vassilvitskii, 2007)
. Considering samplecenter distances increases the odds of selecting a good position for the new center. However, this could also drive the center positions toward outliers and reduce the solution diversity. For these reasons,
is adapted instead of remaining fixed.The mutation operator requires time and is illustrated in Figure 3. The topright center is selected for removal, and a new center is reinserted on the bottomright of the figure. These new center coordinates constitute a starting point for the Kmeans local search discussed in the following section. The first step of Kmeans will be to reassign each sample to its closest center (Figure 3(c)), which is equivalent to recovering the membership chromosome . After a few iterations, Kmeans converges to a new local minimum (Figure 3(d)), giving a solution that is added to the population.
3.4 Local Search
Each coordinate chromosome generated through selection, crossover, and mutation serves as a starting point for a local search based on Kmeans. This algorithm iteratively 1) reassigns each sample to its closest center and 2) moves each center position to the centroid of the samples to which it is assigned. These two steps are iterated until convergence to a local optimum. For this purpose, we use the fast Kmeans of (Hamerly, 2010). This algorithm has a worstcase complexity of per loop, and it exploits lower bounds on the distances to eliminate many distance evaluations while retaining the same results as the classical Kmeans. Moreover, in the exceptional case in which some clusters are left empty after the application of Kmeans (i.e., one center has no allocated sample), the algorithm selects new center locations using Equation (8) and restarts the Kmeans algorithm to ensure that all the solutions have exactly clusters.
3.5 Population Management
One critical challenge for populationbased algorithms is to avoid the premature convergence of the population. Indeed, the elitist selection of parents favors good individuals and tends to reduce the population diversity. This effect is exacerbated by the application of Kmeans as a local search, since it concentrates the population on a smaller subset of local minima. Once diversity is lost, the chances of generating improving solutions via crossover and mutation are greatly reduced. To overcome this issue, HGmeans relies on diversity management operators that find a good balance between elitism and diversification and that allow progress toward unexplored regions of the search space without excluding promising individuals. Similar techniques have been shown to be essential to progress toward highquality solutions for difficult combinatorial optimization problems (Sörensen and Sevaux, 2006; Vidal et al., 2012).
Population management. The initial population of HGmeans is generated by runs of the standard Kmeans algorithm, in which the initial center locations are randomly selected from the set of samples. Moreover, the mutation parameter of each individual is randomly initialized in
with an uniform distribution. Subsequently, each new individual is directly included in the population and thus has
some chance to be selected as a parent by the binary tournament operator, regardless of its quality. Whenever the population reaches a maximum size , a survivor selection is triggered to retain a subset of individuals.Survivors selection. This mechanism selects individuals in the population for removal.
To promote population diversity, the removal first focuses on clone individuals. Two individuals and are clones if they have the same center positions.
When a pair of clones is detected, one of the two individuals is randomly eliminated.
When no more clones remain, the survivors selection proceeds with the elimination of the worst individuals, until a population of size is obtained.
Complexity analysis. HGmeans has an overall worstcase complexity of , where , , and represent the time spent in the crossover, mutation, and Kmeans procedures. The mutation and crossover methods are much faster than the Kmeans local search in practice, so HGmeans’ CPU time is proportional to the product of with the time of Kmeans. Under strict CPU time limits, and could be set to small constants to obtain fast results (see Section 4.4). Moreover, since HGmeans maintains a population of complete solutions with clusters, it has good “anytime behavior” since it can be interrupted whenever necessary to return the current best solution.
4 Experimental Analysis
We conducted extensive computational experiments to evaluate the performance of HGmeans. After a description of the datasets and a preliminary parameter calibration (Sections 4.1–4.2), our first analysis focuses on solution quality from the perspective of the MSSC optimization problem (Section 4.3). We compare the solution quality obtained by HGmeans with that of the current stateoftheart algorithms, in terms of objective function value and computational time, and we study the sensitivity of the method to changes in the parameters. Our second analysis concentrates on the scalability of the method, studying how the computational effort grows with the dimensionality of the data and the number of clusters (Section 4.4). Finally, our third analysis evaluates the correlation between solution quality from an optimization viewpoint and clustering performance via external cluster validity measures. It compares the performance of HGmeans, Kmeans, and Kmeans++ on a fundamental task, that of recovering the parameters of a non separable mixture of Gaussians (Section 4.5). Yet, we consider feature spaces of medium to high dimensionality (20 to 500) in the presence of a medium to large number of Gaussians (50 to 1,000). We show that the improved solution quality of HGmeans directly translates into better clustering performance and more accurate information retrieval.
The algorithms of this paper were implemented in C++ and compiled with G++ 4.8.4. The source code is available at https://github.com/danielgribel/hgmeans. The experiments were conducted on a single thread of an Intel Xeon X5675 3.07 GHz processor with 8 GB of RAM.
4.1 Datasets
We evaluated classical and recent stateoftheart algorithms (Bagirov, 2008; Bagirov et al., 2016; Karmitsa et al., 2017, 2018; Kivijärvi et al., 2003; Ordin and Bagirov, 2015), in terms of solution quality for the MSSC objective, on a subset of datasets from the UCI machine learning repository (http://archive.ics.uci.edu/ml/). We collected all the recent datasets used in these studies for a thorough experimental comparison. The resulting 29 datasets, listed in Table 1, arise from a large variety of applications (e.g. cognitive psychology, genomics, and particle physics) and contain numeric features without missing values.
Group  Dataset  Clusters  
A1  German Towns  59  2  118  
Bavaria Postal 1  89  3  267  
Bavaria Postal 2  89  4  356  
Fisher’s Iris Plant  150  4  600  
A2  Liver Disorders  345  6  2k  
Heart Disease  297  13  4k  
Breast Cancer  683  9  6k  
Pima Indians Diabetes  768  8  6k  
Congressional Voting  435  16  7k  
Ionosphere  351  34  12k  
B  TSPLib1060  1,060  2  2k  
TSPLib3038  3,038  2  6k  
Image Segmentation  2,310  19  44k  
Page Blocks  5,473  10  55k  
Pendigit  10,992  16  176k  
Letters  20,000  16  320k  
C  D15112  15,112  2  30k  
Pla85900  85,900  2  172k  
EEG Eye State  14,980  14  210k  
Shuttle Control  58,000  9  522k  
Skin Segmentation  245,057  3  735k  
KEGG Metabolic Relation  53,413  20  1M  
3D Road Network  434,874  3  1M  
Gas Sensor  13,910  128  2M  
Online News Popularity  39,644  58  2M  
Sensorless Drive Diagnosis  58,509  48  3M  
Isolet  7,797  617  5M  
MiniBooNE  130,064  50  7M  
Gisette  13,500  5,000  68M 
Their dimensions vary significantly, from 59 to 434,874 samples, and from 2 to 5,000 features. Each dataset has been considered with different values of (number of clusters), leading to a variety of MSSC test setups. These datasets are grouped into four classes. Classes A1 and A2 have small datasets with 59 to 768 samples, while Class B has medium datasets with 1,060 to 20,000 samples. These three classes were considered in Ordin and Bagirov (2015). Class C has larger datasets collected in Karmitsa et al. (2018), with 13,910 to 434,874 samples and sometimes a large number of features (e.g., Isolet and Gisette).
4.2 Parameter Calibration
HGmeans is driven by four parameters: two controlling the population size ( and ) and two controlling the algorithm termination (maximum number of consecutive iterations without improvement , and overall maximum number of iterations ). Changing the termination criteria leads to different nondominated tradeoffs between solution quality and CPU time. We therefore set these parameters to consume a smaller CPU time than most existing algorithms on large datasets, while allowing enough iterations to profit from the enhanced exploration abilities of HGmeans, leading to . Subsequently, we calibrated the population size to establish a good balance between exploitation (number of generations before termination) and exploration (number of individuals). We compared the algorithm versions with and on mediumsized datasets of class A2 and B with . The setting had the best performance and forms our baseline configuration. The impact of deviations from this parameter setting will be studied in the next section.
4.3 Performance on the MSSC Optimization Problem
We tested HGmeans on each MSSC dataset and number of clusters . Tables 3 and 3 compare its solution quality and CPU time with those of classical and recent stateoftheart algorithms:

GKM (Likas et al., 2003) – global Kmeans;

SAGA (Kivijärvi et al., 2003) – selfadaptive genetic algorithm;

MGKM (Bagirov, 2008) – modified global Kmeans;

MSMGKM (Ordin and Bagirov, 2015) – multistart modified global Kmeans;

DCClust and MSDCA (Bagirov et al., 2016) – clustering based on a difference of convex functions;

DCDBundle (Karmitsa et al., 2017) – diagonal bundle method;

LMBMClust (Karmitsa et al., 2018) – nonsmooth optimization based on a limited memory bundle method.
We also report the results of the classical Kmeans and Kmeans++ algorithms (efficient implementation of Hamerly (2010)) for both a single run and the best solution of 5,000 repeated runs with different initial solutions.
The datasets and numbers of clusters indicated in Table 1 lead to 235 test setups. For ease of presentation, each line of Tables 3 and 3 is associated with one dataset and displays averaged results over all values of . The detailed results of HGmeans are available at https://w1.cirrelt.ca/~vidalt/en/researchdata.html. For each dataset and value of , the solution quality is measured as the percentage gap from the bestknown solution (BKS) value reported in all previous articles (from multiple methods, runs, and parameter settings).
Kmeans  Kmeans++  GKM  SAGA  MGKM  MSMGKM  HGmeans  
Single Run  5000 Runs  Single Run  5000 Runs  
Gap  T(s)  Gap  T(s)  Gap  T(s)  Gap  T(s)  Gap  T(s)  Gap  T(s)  Gap  T(s)  Gap  T(s)  Gap  T(s)  
German  20.34  0.00  0.00  0.15  15.12  0.00  0.08  0.14  0.76  0.00  0.08  0.22  1.00  0.00  0.47  0.01  0.08  0.02 
Bavaria1  789.66  0.00  8.28  0.36  10.34  0.00  0.00  0.19  1.03  0.00  0.00  0.26  0.17  0.01  0.07  0.00  0.00  0.02 
Bavaria2  738.96  0.00  5.63  0.49  37.14  0.00  0.05  0.25  1.37  0.00  0.05  0.29  1.37  0.01  0.14  0.00  0.05  0.03 
Iris  20.47  0.00  0.00  0.48  10.92  0.00  0.00  0.49  1.48  0.01  0.00  0.53  1.48  0.01  0.09  0.03  0.00  0.09 
Liver  26.34  0.00  6.90  11.54  8.85  0.00  1.22  9.00  13.44  0.08  0.06  5.92  12.15  0.21  0.00  1.59  0.94  1.82 
Heart  15.24  0.00  3.48  11.08  7.39  0.00  1.52  11.12  1.64  0.07  0.55  16.66  1.63  0.24  0.04  1.51  0.55  2.17 
Breast  20.31  0.01  4.95  27.63  5.43  0.00  1.61  22.22  3.29  0.24  1.15  19.21  1.30  0.52  0.00  2.02  0.45  5.56 
Pima  21.75  0.01  2.60  35.56  6.12  0.01  0.82  24.72  1.01  0.32  0.87  18.58  0.95  0.64  0.00  3.70  0.11  5.60 
Congressional  7.53  0.00  2.34  20.93  5.89  0.01  1.64  23.71  2.61  0.14  0.74  16.34  0.93  0.42  0.00  2.77  0.88  3.91 
Ionosphere  14.60  0.01  5.40  30.77  15.56  0.01  3.24  27.97  4.96  0.11  1.05  21.08  0.47  1.43  0.13  1.70  1.59  5.54 
TSPLib1060  16.83  0.00  4.04  20.98  11.03  0.00  2.35  18.20  2.63  1.06  0.81  30.83  2.33  1.07  0.20  6.53  0.15  4.15 
TSPLib3038  5.54  0.02  1.02  89.16  4.54  0.02  0.74  80.96  1.43  25.32  0.00  35.77  1.07  8.16  0.23  46.31  0.24  16.66 
Image  41.67  0.08  16.45  369.38  11.01  0.06  1.63  259.30  1.26  11.67  1.54  143.90  1.31  24.75  0.24  35.34  0.01  57.51 
Page  2960.65  0.80  911.35  4538.52  14.13  0.08  1.11  340.92  1.25  129.99  61.58  134.88  0.93  97.26  0.07  31.72  0.96  143.67 
Pendigit  3.31  0.50  0.36  2504.67  2.50  0.50  0.22  2275.14  0.24  263.04  0.74  607.68  0.16  434.83  0.04  352.36  0.18  461.13 
Letters  1.98  1.26  0.14  6209.46  1.35  1.35  0.10  6751.95  0.35  1102.38  0.42  1114.36  0.13  1859.64  0.01  908.70  0.18  1326.05 
Avg. Gap  294.07  60.81  10.46  1.00  2.42  4.33  1.71  0.11  0.40  
CPU  Xe 3.07 GHz  Xe 3.07 GHz  Core2 2.5 GHz  Xe 3.07 GHz  Core2 2.5 GHz  Core2 2.5 GHz  Xe 3.07 GHz  
Passmark  1403 (1.00)  1403 (1.00)  976 (0.70)  1403 (1.00)  976 (0.70)  976 (0.70)  1403 (1.00) 
Kmeans  Kmeans++  GKM  SAGA  LMBM  MSMGKM  DCClust  MSDCA  DCDBundle  HGmeans  
Single Run  5000 Runs  Single Run  5000 Runs  
Gap  T(s)  Gap  T(s)  Gap  T(s)  Gap  T(s)  Gap  T(s)  Gap  T(s)  Gap  T(s)  Gap  T(s)  Gap  T(s)  Gap  T(s)  Gap  T(s)  Gap  T(s)  
D15112  1.60  0.04  0.00  160.69  1.18  0.03  0.00  146.53  0.34  43.25  0.19  47.29  0.34  4.58  0.13  11.28  0.12  16.26  0.13  35.73  0.13  9.60  0.00  17.52 
Pla85900  0.46  0.31  0.02  1115.56  0.79  0.26  0.02  1060.07  0.25  2023.84  0.15  260.18  0.95  22.36  0.10  2094.58  0.14  200.30  0.09  1416.24  0.15  185.61  0.02  198.14 
Eye  880402.99  0.39  0.00  1522.87  48.95  0.23  0.58  961.63  0.81  161.18  0.04  212.00  0.75  6.62  0.74  17.59  0.89  41.23  0.75  121.77  0.98  19.45  0.02  196.43 
Shuttle  181.15  0.81  121.42  2832.02  22.49  0.25  0.66  911.64  0.35  1954.72  0.90  475.43  0.10  4.55  0.41  89.27  0.46  227.47  0.41  4722.90  1.71  312.10  0.91  97.84 
Skin  9.63  0.60  0.05  2304.23  7.71  0.43  0.38  1728.25  0.28  22518.92  0.13  844.14  3.93  14.41  0.63  3021.33  0.32  1233.00  0.33  8774.03  0.32  1259.01  0.41  230.25 
Kegg  94.45  4.21  76.24  17988.47  6.96  0.58  0.49  1204.77  1.85  4147.46  3.54  686.18  1.52  10.78  1.52  445.03  1.18  488.31  1.02  3442.98  0.89  576.76  0.51  244.37 
3Droad  0.23  5.55  0.00  35237.41  0.28  2.99  0.00  13437.58  0.00  42431.72  0.02  1233.00  0.00  63.21  0.54  60180.29  0.56  4924.72  0.44  31325.93  0.01  4872.71  0.00  2862.09 
Gas  21.42  1.57  1.87  5182.01  7.27  1.12  0.17  2482.24  0.24  1550.29  0.20  983.33  0.86  86.16  0.02  256.74  0.24  814.34  0.05  2404.14  0.55  627.13  0.22  521.57 
Online  18.65  1.57  0.51  5245.66  19.74  1.33  0.17  2653.76  –  –  0.15  1160.02  4.54  96.18  0.12  795.22  0.00  1509.81  –  –  0.26  1600.50  0.17  473.23 
Sensorless  155.50  4.91  46.02  18798.70  19.06  2.50  0.44  8237.59  –  –  1.16  2004.10  1.18  25.24  0.27  1249.96  2.58  2133.54  –  –  –  –  0.41  1077.67 
Isolet  1.93  3.61  0.21  11082.28  1.33  3.74  0.21  12587.68  –  –  0.21  3961.39  0.49  97.59  0.39  677.14  0.32  1672.82  –  –  –  –  0.21  1846.70 
Miniboone  40992.86  15.63  0.07  57148.73  1.75  9.63  0.10  27611.08  –  –  0.07  4883.13  3.50  88.41  0.29  7559.34  0.24  9656.14  –  –  0.23  9291.40  0.10  2941.52 
Gisette  0.47  77.17  –  –  0.47  96.63  –  –  –  –  –  –  0.03  1871.67  0.01  39504.13  0.00  49847.13  –  –  –  –  0.52  22279.47 
Avg. Gap  110088.99  24.94  11.96  0.14  0.52  0.37  1.05  0.51  0.49  0.40  0.59  0.26  
CPU  Xe 3.07 GHz  Xe 3.07 GHz  I5 2.9 GHz  Xe 3.07 GHz  I7 4.0 GHz  I7 4.0 GHz  I7 4.0 GHz  I5 2.9 GHz  I5 1.6/2.7 GHz  Xe 3.07 GHz  
Passmark  1403 (1.00)  1403 (1.00)  1859 (1.32)  1403 (1.00)  2352 (1.68)  2352 (1.68)  2352 (1.68)  1859 (1.32)  1432 (1.02)  1403 (1.00)  
* Considering the subset of 8 instances which is common to all methods  
Considering  
Considering 
This gap is expressed as Gap() , where represents the solution value of the method considered, and is the BKS value. A negative gap means that the solutions for this dataset are better than the best solutions found previously. Finally, the last two lines indicate the CPU model used in each study, along with the timescaling factor (based on the Passmark benchmark) representing the ratio between its speed and that of our processor. All time values in this article have been multiplied by these scaling factors to account for processor differences.
HGmeans produces solutions of remarkable quality, with an average gap of and on the small and large datasets, respectively. This means that its solutions are better, on average, than the best solutions ever found. For all datasets, HGmeans achieved the best gap value. The statistical significance of these improvements is confirmed by pairwise Wilcoxon tests between the results of HGmeans and those of other methods (with pvalues ). Over all 235 test setups (dataset number of cluster combinations), HGmeans found 113 solutions better than the BKS, 116 solutions of equal quality, and only five solutions of lower quality. We observe that the improvements are also correlated with the size of the datasets. For the smallest ones, all methods perform relatively well. However, for more complex applications involving a larger number of samples, a feature space of higher dimension, and more clusters, striking differences in solution quality can be observed between the stateoftheart methods.
These experiments also confirm the known fact that a single run of Kmeans or Kmeans++ does not usually find a good local minimum of the MSSC problem, as shown by gap values that can become arbitrarily high. For the Eye and Miniboone datasets, in particular, a misplaced center can have large consequences in terms of objective value. The probability of falling into such a case is high in a single run, but it can be reduced by performing repeated runs and retaining the best solution. Nevertheless, even 5,000 independent runs of Kmeans or Kmeans++ are insufficient to achieve solutions of a quality comparable to that of HGmeans.
In terms of computational effort, HGmeans is generally faster than SAGA, MSMGKM, DCClust, MSDCA, and DCDBundle (the current best methods in terms of solution quality), but slower than LMBMClust, since this method is specifically designed and calibrated to return quick solutions. It is also faster than a repeated Kmeans or Kmeans++ algorithm with 5000 restarts, i.e., a number of restarts equal to the maximum number of iterations of the algorithm. This can be partly explained by the fact that the solutions generated by the exact matching crossover require less time to converge via Kmeans than initial sample points selected according to some probability distributions. Moreover, a careful exploitation of the search history, represented by the genetic material of highquality parent solutions, makes the method more efficient and accurate.
Finally, we measured the sensitivity of HGmeans to changes in its parameters: defining the populationsize limits, and defining the termination criterion. In Table 4, we fix the termination criterion to and consider a range of populationsize parameters, reporting the average gap and median time over all datasets for each configuration. The choice of and appears to have only a limited impact on solution quality and CPU time: regardless of the parameter setting, HGmeans returns better average solutions than all individual best known solutions collected from the literature. Some differences can still be noted between configurations: as highlighted by pairwise Wilcoxon tests, every configuration underlined in the table performs better than every nonunderlined one (with pvalues 0.018). Letting the population rise to the double of the minimum population size () before survivors selection is generally a good policy. Moreover, we observe that smaller populations trigger a faster convergence but at the risk of reducing diversity, whereas excessively large populations (i.e., ) unnecessarily spread the search effort, with an adverse impact on solution quality.
Average Gap (%)  Median Time (s)  

= 10  20  50  100  200  10  20  50  100  200  
0.31  0.31  0.28  0.24  0.20  12.03  11.81  13.54  14.73  17.79  
10  0.32  0.28  0.26  0.23  12.03  11.53  14.29  16.76  
20  0.31  0.25  0.22  15.27  15.94  13.44  
40  0.25  0.14  13.35  15.19  
80  0.11  19.48 
Average Gap (%)  Median Time (s)  

= 50  100  250  500  1000  50  100  250  500  1000  
= (5,10)  0.20  0.07  0.25  0.31  0.35  1.87  3.31  7.40  12.03  29.10 
(10,20)  0.52  0.02  0.22  0.32  0.36  1.54  4.13  8.25  12.03  29.90 
In Table 5, we retain two of the best populationsize configurations and vary the termination criterion. Naturally, the quality of the solutions improves with longer runs, but even a short termination criterion such as already gives good solutions, with an average gap of . Finally, reducing the population size to for short runs allows us to better exploit a limited number of iterations, whereas the baseline setting performs slightly better for longer runs.
4.4 Scalability
The solution quality and computational efficiency of most clustering algorithms is known to deteriorate as the number of clusters grows, since this leads to more complex combinatorial problems with numerous local minima. To evaluate how HGmeans behaves in these circumstances, we conduct additional experiments focused on the large datasets of class C. Table 6 reports the solution quality and CPU time of HGmeans for each dataset as a function of . Moreover, to explore the case where the CPU time is more restricted, Table 7 reports the same information for a fast HGmeans configuration where and .
Gap (%)  Time (s)  

= 2  3  5  10  15  20  25  2  3  5  10  15  20  25  
D15112  0.00  0.00  0.00  0.00  0.00  0.00  0.03  2.80  5.23  4.97  9.36  37.12  22.45  40.74 
Pla85900  0.00  0.00  0.00  0.00  0.00  0.02  0.13  23.94  34.96  95.62  135.77  131.92  349.98  614.79 
Eye  0.00  0.00  0.00  0.01  0.00  0.00  0.16  4.38  5.22  13.60  91.01  121.20  509.55  630.08 
Shuttle  0.00  0.00  0.00  0.02  0.00  3.67  2.68  17.53  18.93  22.91  45.74  63.10  175.70  340.98 
Skin  0.00  0.00  0.00  0.00  1.63  0.89  0.38  66.23  90.92  96.21  176.57  336.70  213.78  631.37 
Kegg  0.00  0.00  0.00  0.00  1.29  1.26  1.03  41.45  66.68  90.63  117.53  226.37  424.89  743.02 
3Droad  0.00  0.00  0.00  0.00  0.00  0.00  0.00  444.22  535.59  498.42  2824.24  2582.08  5888.17  7261.91 
Gas  0.00  0.00  0.00  0.18  0.94  0.21  0.18  93.20  87.71  156.56  222.77  827.35  920.49  1342.93 
Online  0.00  0.00  0.00  0.00  0.00  0.01  1.20  109.80  87.27  190.75  333.71  285.21  1154.91  1150.95 
Sensorless  0.00  0.00  0.00  2.42  0.00  0.63  0.17  88.43  256.11  236.20  646.31  1004.29  2179.93  3132.41 
Isolet  0.00  0.00  0.00  0.00  0.15  0.39  0.96  255.04  322.27  751.86  748.94  1992.24  2521.89  6334.68 
Miniboone  0.00  0.00  0.00  0.00  0.00  0.12  0.57  209.08  565.88  585.91  1329.57  4758.04  5061.36  8080.78 
Gisette  0.00  0.00  0.02  0.00  0.51  1.85  1.28  2304.41  3896.54  10964.95  20617.57  31767.12  39283.79  47121.92 
Gap (%)  Time (s)  

= 2  3  5  10  15  20  25  2  3  5  10  15  20  25  
D15112  0.00  0.00  0.00  0.00  0.00  0.00  0.03  0.29  0.95  0.54  1.77  1.93  4.53  13.84 
Pla85900  0.00  0.00  0.00  0.00  0.00  0.02  0.13  3.21  3.86  5.22  11.84  29.58  42.44  53.23 
Eye  0.00  0.00  0.00  0.01  0.00  0.00  0.16  0.68  1.08  1.10  12.28  43.89  31.88  92.41 
Shuttle  0.00  0.00  0.00  0.46  0.02  3.55  2.68  2.03  2.36  6.50  12.40  35.60  57.68  67.04 
Skin  0.00  0.00  0.00  0.00  1.63  0.89  0.19  5.99  9.24  14.37  14.45  35.76  25.49  96.52 
Kegg  0.00  0.00  0.00  0.00  1.29  1.25  0.93  4.05  6.17  5.91  25.21  42.94  115.85  73.93 
3Droad  0.00  0.00  0.00  0.00  0.00  0.00  0.00  18.37  30.66  38.11  394.55  166.58  288.66  619.88 
Gas  0.00  0.00  0.00  0.18  0.94  0.21  0.18  4.98  11.07  13.74  69.15  94.68  106.94  182.29 
Online  0.00  0.00  0.00  0.00  0.00  0.01  1.20  8.61  18.12  23.07  39.39  59.10  62.54  218.82 
Sensorless  0.00  0.00  0.00  2.42  0.00  0.63  0.00  13.92  16.43  22.92  102.44  217.94  397.28  429.3 
Isolet  0.00  0.00  0.00  0.00  0.08  0.39  0.79  29.37  43.21  91.75  256.56  415.56  443.05  199.26 
Miniboone  0.00  0.00  0.00  0.00  0.00  0.12  0.57  17.46  37.82  39.95  246.58  647.88  731.73  1469.62 
Gisette  0.00  0.00  0.02  0.53  0.16  1.62  1.14  232.87  1067.71  1252.21  3190.44  5671.95  13542.21  12568.59 
As observed in Table 6, HGmeans retrieves or improves the BKS for all datasets and values of . Significant improvements are more frequently observed for larger values of . A likely explanation is that the global minimum has already been found for most datasets with a limited number of clusters, whereas previous methods did not succeed in finding the global optimum for larger values of . In terms of computational effort, there is a visible correlation between the number of clusters and the CPU time. Power law regressions of the form indicate that the computational effort of HGmeans grows as for Eye State, for Miniboone, and for in all other cases. Similarly, for , fitting the CPU time of the method as a power law of the form indicates that the measured CPU time of HGmeans grows as , i.e., linearly with the number of samples and the dimension of the feature space.
We observe a significant reduction in CPU time when comparing the results of the fast HGmeans in Table 7 with those of the standard version in Table 6. Considering the speed ratio between methods for each dataset, the fast configuration is in average seven times faster than the standard HGmeans and over 10 times faster than SAGA, MSMGKM, DCClust, DCDBundle and MSDCA. Figure 4 also displays the CPU time of HGmeans, its fast configuration, and the other algorithms listed in Section 4.3 as a function of . Surprisingly, the solution quality did not deteriorate much by reducing the termination criterion for these large datasets: with a percentage gap of , the solutions found by the fast HGmeans are close to those of the standard version (gap of ) and still much better than all solutions found in previous studies. Therefore, researchers interested in using HGmeans can easily adapt the termination criterion of the method, so as to obtain significant gains of solution quality within their own computational budget.
4.5 Solution Quality and Clustering Performance
The previous section has established that HGmeans finds better MSSC local minima than other stateoftheart algorithms and exemplified the known fact that Kmeans or Kmeans++ solutions can be arbitrarily far from the global minimum. In most situations, using the most accurate method for a problem should be the best option. However, HGmeans is slower than a few runs of Kmeans or Kmeans++
. To determine whether it is worth investing this additional effort, we must determine whether a better solution quality for the MSSC problem effectively translates into better clustering performance. There have been similar investigations in other machine learning subfields, e.g., to choose the amount of effort dedicated to training deep neural networks (see, e.g.
(Hoffer et al., 2017; Keskar et al., 2017)).To explore this, we conduct an experiment in which we compare the ability of HGmeans, Kmeans and Kmeans++
to classify 50,000 samples issued from a mixture of spherical Gaussian distributions:
with . For each , and are uniformly selected in and , respectively. This is a fundamental setting, without any hidden structure, in which we expect the MSSC model and the associated Kmeans variants to be a good choice since these methods promote spherical and balanced clusters. To increase the challenge, we consider a medium to large number of Gaussians, with , in feature spaces of medium to high dimensions . For each combination of and , we repeat the generation process until we obtain a mixture that is not separated and in which at least 99% of the pairs of Gaussians are separated (Dasgupta, 1999). Such a mixture corresponds to Gaussians that significantly overlap. These datasets can be accessed at https://w1.cirrelt.ca/~vidalt/en/researchdata.html.Tables 9 and 9 compare the results of HGmeans with those of Kmeans and Kmeans++ over a single run or 500 repeated runs, in terms of MSSC solution quality (as represented by the percentage gap) and cluster validity in relation to the ground truth. We use three external measures of cluster validity: the adjusted Rand index (CRand – Hubert and Arabie (1985)), the normalized mutual information (NMI – Kvalseth (1987)), and the centroid index (CI – Fränti et al. (2014)). CRand and NMI take continuous values in and , respectively. They converge toward as the clustering solution becomes closer to the ground truth. CI takes integer values and measures the number of fundamental cluster differences between solutions, with a value of indicating that the solution has the same clusterlevel structure as the ground truth.
As in the previous experiments, a single run of Kmeans or Kmeans++ visibly leads to shallow local minima that can be improved with multiple runs from different starting points. However, even 500 repetitions of these algorithms are insufficient to reach the solution quality of HGmeans, despite the similar CPU time. Kmeans performs better than Kmeans++ for these datasets, most likely because it is more robust to outliers when selecting initial center locations. A pairwise Wilcoxon test highlights significant differences between HGmeans and all other methods (pvalues ). With a Pearson coefficient , the dimension of the feature space is correlated to the inaccuracy (percentage gap) of the repeated Kmeans and Kmeans++ algorithms, which appear to be more easily trapped in lowquality local minima for feature spaces of larger dimension.
Comparing the external clustering performance of these methods (via CRand, NMI, and CI) leads to additional insights. For all three metrics, pairwise Wilcoxon tests highlight significant performance differences between HGmeans and repeated Kmeans variants (with pvalues ). We also observe a correlation between the solution quality (percentage gap) and the three external measures. Although the differences in the MSSC objective function values appear small at first glance (e.g., average gap of 4.87% for repeated Kmeans), these inaccuracies have a large effect on cluster validity, especially for datasets with feature spaces of higher dimension. When , HGmeans is able to exploit the increased amount of available independent information to find a close approximation to the ground truth (average CRand of 0.99, NMI of 1.00, and CI of 1.75) whereas repeated Kmeans and Kmeans++
reach shallow local optima and obtain inaccurate results (average CRand and NMI below 0.65 and 0.94, respectively, and average CI above 23.75). Classical distance metrics are known to become more uniform as the featurespace dimension grows, and the number of local minima of the MSSC quickly increases, so featurereduction or subspaceclustering techniques are often recommended for highdimensional datasets. In these experiments, however, the inaccuracy of repeated
Kmeans (or Kmeans++) appears to be a direct consequence of its inability to find goodquality local minima, rather than a consequence of the MSSC model itself, since nearoptimal solutions of the MSSC translate into accurate results.Overall, we conclude that even for simple Gaussianbased distribution mixtures, finding good local minima of the MSSC problem is essential for an accurate information retrieval. This is a major difference with studies on, for example, deep neural networks, where it is conjectured that most local minima have similar objective values, and where more intensive training (e.g., stochastic gradient descent with large batches) have adverse effects on generalization
Lecun et al. (2015). For clustering problems, it is essential to keep progressing toward faster and more accurate MSSC solvers, and to recognize when to use these highperformance methods for large and highdimensional datasets.BKS  Gap (%)  Time (s)  

Objective  Kmeans  Kmeans++  HGmeans  Kmeans  Kmeans++  HGmeans  
m  d  Value  1 Run  500 Runs  1 Run  500 Runs  1 Run  500 Runs  1 Run  500 Runs  
20  20  5432601.91  0.73  0.0  1.15  0.0  0.0  2.40  668.93  3.00  764.76  1085.40 
20  50  12815114.52  6.19  0.0  3.75  1.15  0.0  2.86  860.95  3.17  1171.09  1308.96 
20  100  24266784.28  14.84  0.0  5.01  4.83  0.0  5.38  1243.53  4.75  1958.25  553.25 
20  200  59340268.17  17.70  2.57  11.79  7.00  0.0  14.90  2677.57  12.43  3938.29  1505.16 
20  500  125359202.26  16.53  8.06  25.35  8.00  0.0  30.13  6118.59  25.17  8389.50  2563.73 
50  20  5305274.24  0.47  0.0  0.43  0.0  0.0  5.03  2599.11  4.84  2755.17  3189.56 
50  50  13864882.54  2.10  0.0  3.22  0.72  0.0  7.28  2695.69  8.23  3258.11  4307.12 
50  100  25645070.92  8.86  3.70  12.04  5.76  0.0  10.78  4226.64  14.33  5871.70  2934.41 
50  200  52561077.57  14.62  7.76  19.90  9.92  0.0  22.98  7837.70  37.60  11063.60  9629.09 
50  500  143469250.17  16.92  9.79  20.0  11.11  0.0  38.89  14778.04  58.13  19077.48  18360.24 
100  20  5027688.54  0.34  0.12  0.54  0.04  0.0  19.79  7281.83  18.89  8435.48  13529.09 
100  50  12897680.57  3.07  1.17  4.81  2.25  0.0  12.07  6612.89  15.27  7962.07  10344.57 
100  100  27284752.32  6.30  4.67  10.58  6.89  0.0  24.43  11864.87  30.54  14991.49  6728.71 
100  200  51552765.51  14.03  7.97  15.78  11.13  0.0  34.63  14537.27  52.73  20128.89  20499.22 
100  500  130903680.95  18.90  15.61  22.71  15.69  0.0  61.45  25313.95  86.04  34062.29  38217.57 
200  20  4774890.45  0.72  0.45  1.24  0.53  0.0  42.85  18861.45  38.91  19896.36  38126.21 
200  50  13490838.00  1.97  1.16  2.88  1.86  0.0  34.49  18792.14  39.88  21036.63  28513.22 
200  100  27337380.17  8.08  5.29  9.68  7.56  0.0  70.30  30880.39  82.03  36219.66  39980.98 
200  200  52946223.09  15.77  11.70  19.67  14.45  0.0  74.33  37459.76  139.62  46365.94  67745.79 
200  500  135201463.76  20.97  17.32  23.83  19.28  0.0  142.85  63202.41  210.16  92765.62  93444.51 
CRand  NMI  CI  

Kmeans  Kmeans++  HGmeans  Kmeans  Kmeans++  HGmeans  Kmeans  Kmeans++  HGmeans  
m  d  1 Run  500 Runs  1 Run  500 Runs  1 Run  500 Runs  1 Run  500 Runs  1 Run  500 Runs  1 Run  500 Runs  
20  20  0.69  0.72  0.67  0.72  0.72  0.73  0.75  0.73  0.75  0.75  1  0  1  0  0 
20  50  0.76  0.98  0.86  0.92  0.98  0.91  0.98  0.94  0.96  0.98  3  0  2  1  0 
20  100  0.63  1.00  0.89  0.89  1.00  0.89  1.00  0.97  0.97  1.00  5  0  2  2  0 
20  200  0.47  0.94  0.61  0.83  1.00  0.84  0.98  0.89  0.95  1.00  7  1  5  3  0 
20  500  0.55  0.81  0.32  0.81  1.00  0.88  0.95  0.79  0.95  1.00  6  2  9  3  0 
50  20  0.58  0.59  0.57  0.59  0.59  0.67  0.68  0.67  0.68  0.68  1  0  2  0  0 
50  50  0.87  0.94  0.82  0.92  0.94  0.93  0.95  0.92  0.94  0.95  3  0  5  1  0 
50  100  0.76  0.90  0.59  0.85  1.00  0.95  0.98  0.92  0.96  1.00  9  4  12  6  0 
50  200  0.52  0.80  0.34  0.72  1.00  0.90  0.96  0.85  0.94  1.00  14  8  19  10  0 
50  500  0.41  0.69  0.24  0.39  1.00  0.88  0.94  0.83  0.91  1.00  16  9  16  10  0 
100  20  0.48  0.48  0.47  0.49  0.49  0.62  0.63  0.62  0.63  0.63  4  2  5  1  0 
100  50  0.80  0.86  0.78  0.84  0.91  0.91  0.93  0.90  0.92  0.94  9  4  13  6  0 
100  100  0.80  0.86  0.68  0.74  0.99  0.96  0.97  0.93  0.94  1.00  15  11  23  16  1 
100  200  0.63  0.79  0.53  0.74  0.99  0.93  0.96  0.92  0.95  1.00  27  16  30  20  1 
100  500  0.40  0.60  0.23  0.35  0.98  0.89  0.93  0.84  0.90  1.00  33  27  37  29  2 
200  20  0.39  0.40  0.38  0.39  0.41  0.59  0.59  0.58  0.59  0.60  22  14  25  20  6 
200  50  0.81  0.82  0.78  0.80  0.87  0.91  0.90  0.90  0.89  0.92  12  10  18  13  0 
200  100  0.71  0.81  0.66  0.73  0.96  0.94  0.95  0.94  0.94  0.99  38  27  49  38  5 
200  200  0.51  0.64  0.31  0.56  0.99  0.92  0.94  0.87  0.93  1.00  61  45  71  53  3 
200  500  0.41  0.50  0.26  0.33  0.98  0.90  0.92  0.85  0.89  1.00  65  57  74  60  5 
5 Conclusions and Perspectives
In this article, we have studied the MSSC problem, a classical clustering model of which the popular Kmeans algorithm constitutes a local minimizer. We have proposed a hybrid genetic algorithm, HGmeans, that combines the improvement capabilities of Kmeans as a local search with the diversification capabilities of problemtailored genetic operators. The algorithm uses an exact minimumcost matching crossover operator and an adaptive mutation procedure to generate strategic initial center positions for Kmeans and to promote a thorough exploration of the search space. Moreover, it uses population diversity management strategies to prevent premature convergence to shallow local minima.
We conducted extensive computational experiments to evaluate the performance of the method in terms of MSSC solution quality, computational effort and scalability, and external cluster validity. Our results indicate that HGmeans attains better local minima than all recent stateoftheart algorithms. Large solution improvements are usually observed for large datasets with a mediumtohigh number of clusters , since these characteristics lead to MSSC problems that have not been effectively solved by previous approaches. The CPU time of HGmeans is directly proportional to that of the Kmeans localimprovement procedure and to the number of iterations allowed (the termination criterion). It appears to grow linearly with the number of samples and featurespace dimensions, and the termination criterion can be adjusted to achieve solutions in a shorter time without a large impact on solution accuracy.
Through additional tests conducted on datasets generated via Gaussian mixtures, we observed a strong correlation between MSSC solution quality and cluster validity measures. A repeated Kmeans algorithm, for example, obtains solution inaccuracies (percentage gap to the best known local minima) that are small at first glance but highly detrimental for the outcome of the clustering task. In particular, a gap as small as 5% in the objective space can make the difference between accurate clustering and failure. This effect was observed in all Gaussian datasets studied, and it became more prominent in feature spaces of higher dimension. In those situations, the inability of Kmeans or Kmeans++ to provide satisfactory results seems to be tied to its inability to find goodquality local minima of the MSSC model, rather than to an inadequacy of the model itself.
Overall, beyond the immediate gains in terms of clustering performance, research into efficient optimization algorithms for MSSC remains linked to important methodological stakes. Indeed, a number of studies aim to find adequate models (e.g., MSSC) for different tasks and datasets. With that goal in mind, it is essential to differentiate the limitations of the model themselves (inadequacy for a given task or data type), and those of algorithms used to solve such models (shallow local optima). While an analysis using external measures (e.g., CRand, NMI or CI) allows a general evaluation of error (due to both sources of inaccuracy), only a precise investigation of a method’s performance in the objective space can help evaluating the magnitude of each imprecision, and only accurate or exact optimization methods can give meaningful conclusions regarding model suitability. In future research, we plan to continue progressing in this direction, generalizing the proposed solution method to other clustering models, possibly considering the use of kernel transformations, integrating semisupervised information in the form of mustlink or cannotlink constraints, and pursuing the development of highperformance optimization algorithms for other classes of applications.
6 Acknowledgments
The authors thank the four anonymous referees for their detailed comments, which significantly contributed to improving this paper. This research is partially supported by CAPES, CNPq [grant number 308498/20151] and FAPERJ [grant number E26/203.310/2016] in Brazil. This support is gratefully acknowledged.
References
 AlSultan (1995) AlSultan, K. 1995. A tabu search approach to the clustering problem. Pattern Recognition 28(9) 1443–1451.
 Aloise et al. (2009) Aloise, D., A. Deshpande, P. Hansen, P. Popat. 2009. NPhardness of Euclidean sumofsquares clustering. Machine Learning 75(2) 245–248.
 Aloise et al. (2012) Aloise, D., P. Hansen, L. Liberti. 2012. An improved column generation algorithm for minimum sumofsquares clustering. Mathematical Programming 131(1) 195–220.
 An et al. (2014) An, L.T.H., L.H. Minh, P.D. Tao. 2014. New and efficient DCA based algorithms for minimum sumofsquares clustering. Pattern Recognition 47(1) 388–401.
 Arthur and Vassilvitskii (2007) Arthur, D., S. Vassilvitskii. 2007. KMeans++: The advantages of careful seeding. SODA’07. Proceedings of the Eighteenth Annual ACMSIAM Symposium on Discrete Algorithms. SIAM, New Orleans, Louisiana, USA, 1027–1035.
 Bagirov (2008) Bagirov, A.M. 2008. Modified global kmeans algorithm for minimum sumofsquares clustering problems. Pattern Recognition 41(10) 3192–3199.
 Bagirov et al. (2016) Bagirov, A.M., S. Taheri, J. Ugon. 2016. Nonsmooth DC programming approach to the minimum sumofsquares clustering problems. Pattern Recognition 53 12–24.
 Bagirov et al. (2011) Bagirov, A.M., J. Ugon, D. Webb. 2011. Fast modified global kmeans algorithm for incremental cluster construction. Pattern Recognition 44(4) 866–876.
 Blum et al. (2011) Blum, C., J. Puchinger, G. Raidl, A. Roli. 2011. Hybrid metaheuristics in combinatorial optimization: A survey. Applied Soft Computing 11(6) 4135–4151.
 Dasgupta (1999) Dasgupta, S. 1999. Learning mixtures of Gaussians. 40th Annual Symposium on Foundations of Computer Science 1 634–644.
 Fränti et al. (1997) Fränti, P., J. Kivijärvi, T. Kaukoranta, O. Nevalainen. 1997. Genetic algorithms for largescale clustering problems. The Computer Journal 40(9) 547–554.
 Fränti et al. (2014) Fränti, P., M. Rezaei, Q. Zhao. 2014. Centroid index: Cluster level similarity measure. Pattern Recognition 47(9) 3034–3045.
 Hamerly (2010) Hamerly, G. 2010. Making kmeans even faster. SDM’10, SIAM International Conference on Data Mining. 130–140.
 Han et al. (2011) Han, J., M. Kamber, J. Pei. 2011. Data mining: concepts and techniques. 3rd ed. Morgan Kaufmann.
 Hansen and Mladenović (2001) Hansen, P., N. Mladenović. 2001. JMeans: A new local search heuristic for minimum sum of squares clustering. Pattern Recognition 34(2) 405–413.
 Hartigan and Wong (1979) Hartigan, J.A., M.A. Wong. 1979. Algorithm AS 136: A kmeans clustering algorithm. Applied Statistics 28(1) 100–108.
 Hoffer et al. (2017) Hoffer, E., I. Hubara, D. Soudry. 2017. Train longer, generalize better: Closing the generalization gap in large batch training of neural networks. Advances in Neural Information Processing Systems. 1729–1739.
 Holland (1975) Holland, J.H. 1975. Adaptation in Natural and Artificial Systems. The University of Michigan Press, Ann Arbor, MI.
 Hruschka et al. (2009) Hruschka, E.R., R.J.G.B. Campello, A.A. Freitas, A.C.P.L.F. de Carvalho. 2009. A survey of evolutionary algorithms for clustering. IEEE Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews 39(2) 133–155.
 Hubert and Arabie (1985) Hubert, L., P. Arabie. 1985. Comparing partitions. Journal of Classification 2(1) 193–218.
 Ismkhan (2018) Ismkhan, H. 2018. Ikmeans+: An iterative clustering algorithm based on an enhanced version of the kmeans. Pattern Recognition 79(1) 402–413.
 Jain (2010) Jain, A.K. 2010. Data clustering: 50 years beyond Kmeans. Pattern Recognition Letters 31(8) 651–666.
 Karmitsa et al. (2017) Karmitsa, N., A.M. Bagirov, S. Taheri. 2017. New diagonal bundle method for clustering problems in large data sets. European Journal of Operational Research 263(2) 367–379.
 Karmitsa et al. (2018) Karmitsa, N., A.M. Bagirov, S. Taheri. 2018. Clustering in large data sets with the limited memory bundle method. Pattern Recognition 83 245–259.

Keskar et al. (2017)
Keskar, N.S., D. Mudigere, J. Nocedal, M. Smelyanskiy, P.T.P. Tang. 2017.
On largebatch training for deep learning: Generalization gap and sharp minima.
ICLR’17, International Conference on Learning Representations.  Kivijärvi et al. (2003) Kivijärvi, J., P. Fränti, O. Nevalainen. 2003. Selfadaptive genetic algorithm for clustering. Journal of Heuristics 9(2) 113–129.
 Krishna and Murty (1999) Krishna, K., M.N. Murty. 1999. Genetic kmeans algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 29(3) 433–439.
 Kuhn (1955) Kuhn, H.W. 1955. The Hungarian method for the assignment problem. Naval Research Logistics 2(12) 83–97.
 Kvalseth (1987) Kvalseth, T.O. 1987. Entropy and correlation: Some comments. IEEE Transactions on Systems, Man and Cybernetics 17(3) 517–519.
 Lecun et al. (2015) Lecun, Y., Y. Bengio, G. Hinton. 2015. Deep learning. Nature 521 436–444.
 Likas et al. (2003) Likas, A., N. Vlassis, J.J. Verbeek. 2003. The global kmeans clustering algorithm. Pattern Recognition 36(2) 451–461.
 Lloyd (1982) Lloyd, S.P. 1982. Least squares quantization in PCM. IEEE Transactions on Information Theory 28(2) 129–137.
 Lu et al. (2004) Lu, Y., S. Lu, F. Fotouhi, Y. Deng, S.J. Brown. 2004. FGKA: A fast genetic Kmeans clustering algorithm. Proceedings of the 2004 ACM Symposium on Applied Computing 622–623.
 Maulik and Bandyopadhyay (2000) Maulik, U., S. Bandyopadhyay. 2000. Genetic algorithmbased clustering technique. Pattern Recognition 33 1455–1465.
 Merz and Zell (2002) Merz, P., A. Zell. 2002. Clustering gene expression profiles with memetic algorithms. Parallel Problem Solving from Nature. Springer, 811–820.
 Ordin and Bagirov (2015) Ordin, B., A.M. Bagirov. 2015. A heuristic algorithm for solving the minimum sumofsquares clustering problems. Journal of Global Optimization 61(2) 341–361.
 Sarkar et al. (1997) Sarkar, M., B. Yegnanarayana, D. Khemani. 1997. A clustering algorithm using an evolutionary programmingbased approach. Pattern Recognition Letters 18(10) 975–986.
 Scheunders (1997) Scheunders, P. 1997. A comparison of clustering algorithms applied to color image quantization. Pattern Recognition Letters 18 1379–1384.
 Selim and Alsultan (1991) Selim, S.Z., K. Alsultan. 1991. A simulated annealing algorithm for the clustering problem. Pattern Recognition 24(10) 1003–1008.
 Sörensen and Sevaux (2006) Sörensen, K., M. Sevaux. 2006. MAPM: Memetic algorithms with population management. Computers & Operations Research 33(5) 1214–1225.
 Steinley (2006) Steinley, D. 2006. Kmeans clustering: A halfcentury synthesis. British Journal of Mathematical and Statistical Psychology 59(1) 1–34.
 Sugar and James (2003) Sugar, C.A., G.M. James. 2003. Finding the number of clusters in a dataset. Journal of the American Statistical Association 98(463) 750–763.
 Vidal et al. (2012) Vidal, T., T.G. Crainic, M. Gendreau, N. Lahrichi, W. Rei. 2012. A hybrid genetic algorithm for multidepot and periodic vehicle routing problems. Operations Research 60(3) 611–624.
Comments
There are no comments yet.