1 Introduction
Grouping problems generally involves partitioning a set of items into mutually disjoint groups or clusters according to some guiding decision criteria and imperative constraints [falkenauer1998genetic]. For example, given a graph with node set and edge set , the popular NPhard graph coloring problem is to color the nodes of with available colors in such a way that two nodes linked by an edge in must receive two different colors. Equivalently, this problem can be stated in terms of a grouping problem, i.e., to partition the nodes of into color groups such that each color group gathers the nodes receiving the same color. Additional examples of grouping problems include other coloring problems (e.g., equitable coloring and sum coloring) [Lewis16], and wellknown NPhard problems such as graph partitioning [BenlicH13, BulucMSS016], bin packing [Falkenauer96] and maximally diverse grouping [BrimbergMU15, LaiH16]. Grouping problems are also useful models to formulate a number of relevant applications related to, e.g., assembly line balancing [Falkenauer97], timetabling [LewisP07], single batch–machine scheduling [KashanKK13] and routing [AbbasiPooyaK17]. More illustrations of grouping problems and their applications can be found in [falkenauer1998genetic, KashanKK13, zhou2016reinforcement].
From a perspective of computational complexity, many grouping problems including those mentioned above are difficult search problems. As a result, finding satisfactory solutions to such problems represents a real challenge from an algorithmic perspective. Given the importance of grouping problems, various solution methods have been investigated in the literature to solve grouping problems. However, a majority of existing studies on grouping problems focuses on particular problems (e.g. the abovementioned problems) and it is difficult to generate their results to other settings. Among the scarce general studies on grouping problems, grouping genetic algorithms (GGA) introduced in
[falkenauer1998genetic]are certainly the most representative and significant. As a specialization of the general framework of genetic algorithms, GGA defines a special group encoding scheme and the associated genetic operators that manipulate groups instead of group members. Following the idea of GGA, a grouping version of the particle swarm optimization algorithm (PSO) was adapted to grouping problems in
[KashanKK13]. In the adapted method (called GPSO), the particle position and velocity updating equations of PSO are modified to preserve the major characteristics of the original PSO equations and to take into account the structure of grouping problems. Contrary to GGA examining discrete spaces, GPSO works in both continuous and discrete spaces. Recently, an original reinforcement learning based local search approach (RLS) was proposed in
[zhou2016reinforcement]. RLS introduces the idea of using a probability matrix to specify an itemtogroup assignment and uses information learned from improved solutions given by local search to update the probability matrix. As such, the method learns a generative model of solution that is used to iteratively seed a local search procedure. RLS provides one of the main inspirations of our work.
In this work we are interested in solving grouping problems from a general perspective by proposing a generic method that can be applied to different grouping problems. The contributions of this work can be summarized as follows.
First, we introduce a weight learning based solution framework, called GpGrad, for solving grouping problems. We adopt the idea of using a weight matrix to specify a relative confidence score for each item to belong to each group. This weight matrix is updated step by step by gradient descent based learning in order to reach a legal solution. This approach benefits from GPU accelerated training and is able to deal with multiple parallel solution evaluations. As the first study of using gradient descent based weight learning to solve grouping problems, this work enriches the still limited toolkit of general solution methods for this important class of problems.
Second, we demonstrate the usefulness of the proposed framework by applying it on two wellknown NPhard grouping problems: the conventional Graph Coloring Problem (GCP) and the Equitable graph Coloring Problem (ECP), which have been intensively studied in the literature (see e.g., [galinier1999hybrid, lu2010memetic, titiloye2011graph, moalic2018variations, zhou2018improving] for the GCP and [diaz2014tabu, lai2015backtracking, sun2017feasible, wang2018tabu] for the ECP). For both problems, we present large computational experiments on wellknown benchmark graphs and show the competitiveness of our approach compared to the best performing stateoftheart algorithms. In particular, we report improved best results for several large geometric graphs and large random graphs for the ECP.
Third, this work shows the viability of formulating a discrete grouping problem as a realvalued weight learning problem. The competitive results on two challenging representative grouping problems invite more investigations of testing the proposed approach to other grouping problems and applications.
The remainder of the paper is organized as follows. Section 2 describes the proposed framework for solving grouping problems. Sections 3 and 4 are devoted to the application of the framework to the GCP and ECP respectively. Section 5 provides illustrative toy examples on the gradient descent learning. Section 6 reports on computational results of our algorithms compared to the state of the art. Section 7 reviews related heuristics of the literature for the GCP and ECP. Section 8 discusses the contributions and presents perspectives for future work.
2 Gradient descent based learning for grouping problems
Given a set of distinct items, the task of a grouping problem is to partition the items of set into different groups , such that and for , while taking into account some specific constraints and optimization objectives. We denote by the search space of all possible grouping assignments with groups (candidate solutions) for a given problem instance.
In this work, we assume that the number of groups is fixed. For grouping problems where the task is to find a minimum number of groups, as the bin packing problem and graph coloring problem, we can successively solve the problem with decreasing values of . Each time a solution is found for a given , it gives an upper bound of the optimal number of groups. For instance, for the general graph coloring problems which aims to determine the chromatic number of a graph (i.e., the smallest number of colors for which a coloring exists), we typically solve a series of coloring problems with decreasing [galinier2013recent].
According to this definition of a grouping problem, a candidate solution in the search space can be represented as a collection of groups: . In this work, we use an equivalent matrix formulation of a candidate solution: for a fixed number of groups , a candidate solution will be represented as a matrix in , with and for , if the th item belongs to the group . The
th row of the candidate solution matrix defines the group assignment vector of the
th item and is denoted by .We assume that the constraints and the optimization objective of the grouping problem can be formulated with a single evaluation function , where is the evaluation (fitness) of the candidate solution . The goal of the problem is to find a solution such that is minimum.
2.1 Weight formulation of the problem
Inspired by the work of [zhou2016reinforcement], we express a (discrete) grouping (a candidate solution) as a continuous grouping by defining a weight matrix in composed of real numbers vectors of size ^{1}^{1}1Note that the weight matrix can contain any real number while the probability matrix of [zhou2016reinforcement] is composed of nonnegative numbers.. This weight matrix corresponds to a learned and soft assignment of each item in each group. The weight matrix representation has the advantages of being more flexible and more informative than the binary group assignment. Indeed, during the search, for each item, the algorithm can learn to reject some group assignments by decreasing the associated weights, confirm some group assignments by increasing their weights, or even set almost equal weights for uncertain group assignments. The th row of the weight matrix defines the weight vector of the th item and is denoted by .
Itemtogroup assignment Given a weight matrix , the associated solution (grouping) can be derived using the following procedure: for , if and if . Each item selects its group with the maximum weight in . By abuse of notation and for simplicity, we rewrite this group selection for each item as a single function :
(1) 
where argmax and one_hot operations are applied along the last axis of the matrix (group axis). Figure 1 shows an example with items and groups.
Fitness evaluation Using the current solution we can compute the fitness function of the problem. As derives from , the given grouping problem becomes then the one of finding such that is minimum. As we explain below, we will use first order gradient descent to solve this continuous optimization problem.
Gradient evaluation To tackle this realvalued optimization problem by first order gradient descent over , we need to compute the gradient of the evaluation function with respect to . This is a real matrix of size whose element is
. By applying the chain rule
[rumelhart1985learning], it gives for :(2) 
2.2 Softmax approximation
The function is differentiable almost everywhere with respect to , but each term entering in equation (2) is always equal to zero. Therefore, we will use the softmax function as a continuous, differentiable approximation to argmax, with informative gradient.
can be approximated by a matrix of size , where each coefficient is computed with the elements of using the softmax function [bridle1990training] as:
(3) 
Each coefficient of the matrix denotes the probability that the th item selects the th group as its group and is a control parameter in . In the following we rewrite this probability evaluation for each item with a single matrix equation as:
(4) 
where the softmax function is applied along the last axis (group axis) for each item.
For any fixed , as goes toward infinity, converges toward . As an example for an item , if the weight vector is , the group selected for this item is . Its corresponds to . Figure 2 shows the smooth approximation depending on the value of the parameter from 0.1 to 100. When the value of increases, the probability vector goes toward .
This kind of softmax approximation has notably been used to solve assignment problems in statistical physics frameworks [peterson1989new, gold1996softmax], where at each iteration step of the optimization process the control parameter is increased in a deterministic annealing scheme, such that goes toward . More recently this kind of relaxation with the softmax
function has been employed to learn probabilistic sampling of categorical variables in stochastic computational graphs
[jang2016categorical] or to applyparameters regularization for automatic neural network structures learning
[louizos2017learning, kalainathan2018sam].2.3 Gradient descent learning of the weights
For , we approximate by , with the Kronecker symbol equal to 1 if and 0 otherwise (details given in Appendix A
). Using this StraightThrough (ST) gradient estimator
[bengio2013estimating] of in equation (2), we obtain for :(5)  
(6)  
(7) 
In our framework however, we do not use any sequential loop to compute the gradient of each parameter. Instead we simply compute a tensor product to get the full gradient matrix. This operation can easily be parallellized on GPU devices. With tensor operations, the equations (7) for become then a single equation:
(8) 
where is the gradient matrix of size of with respect to , is a matrix of 1’s of size , is the Hadamard product (elementwise product) and is the dot product between two matrices. This gradient depends on the grouping assignment problem. In general as we will see later with practical implementations (Section 3), this gradient is well defined and can easily be derived with tensor calculation.
Once the gradient with respect to the matrix of parameters is computed according to equation (8), we use first order gradient descent to update the parameters at each iteration:
(9) 
where is a fix learning rate. This kind of first order optimization is classically employed to learn neural networks (in its stochastic version) and is known to be very efficient to learn models with huge number of parameters. Alternative optimization gradient updates could be applied such as second order gradient descent with Hessian matrix, adaptive learning rate or momentum [kingma2014adam]. However for the reason of simplicity, we employ a simple and fast first order gradient descent procedure in this work.
2.4 Weight smoothing
Inspired by the Reinforcement Learning Search (RLS) framework [zhou2016reinforcement], we apply a weight smoothing procedure in order to give the possibility for the method to forget old decisions made long ago and no longer helpful: every iterations, each parameter of the weight matrix is divided by a constant number which can be done with a single tensor computation according to:
(10) 
One notices that if goes toward infinity goes toward zero and the probability matrix goes toward a matrix whose elements are all equal to (corresponding to forget everything).
2.5 General GpGrad framework
At initialization, each weight of the matrix
is sampled according to normal distribution with zero mean and
standard deviation. Then, at each iteration, the proposed GpGrad framework involves four steps summarized in Algorithm 1: group selection, fitness evaluation, weights update with gradient descent and weight smoothing.We highlight in blue the steps of the framework that are problem dependent, i.e, the fitness and gradient evaluations. All other steps could be used as it stands for different grouping problems.
The only stochastic part in GpGrad concerns the initialization of the tensor as the weights are drawn randomly according to a normal law. This random initialization has an impact on the search. Therefore for practical performance evaluation, a GpGrad algorithm will be run multiple times with different random seeds to initialize the pseudorandom generator of the normal law.
3 TensCol: an application of the GpGrad framework for graph coloring
In this section we present TensCol, an implementation of the GpGrad framework for the Graph Coloring Problem, which is a wellknown grouping problem. Specifically, we consider the graph coloring problem ( is given) which aims to find a legal coloring for a given graph , that is, a partition of the nodes (i.e., the items) of into color groups such that any pair of nodes in any color group are not linked by an edge in (such a color group is also called an independent set). In other words, a legal coloring implies that two nodes linked by an edge necessarily receive two different colors and thus belong to different color groups.
For the purpose of optimization, we define an evaluation (fitness) function that minimizes the number of color conflicts. Formally, for a given graph with available colors, a candidate solution is a partition of vertices into color groups (in other words, is the group of vertices receiving color ). The evaluation function is used to count the conflicting edges induced by :
(11) 
where , if , and , and otherwise . A candidate solution is a legal coloring when . The objective of the GCP is then to minimize , i.e, the number of conflicting edges to find a legal coloring in the search space.
The GCP has the following property concerning symmetric solutions. Let be an arbitrary permutation of the set of colors and let be a kcoloring. Then is also a coloring which is strictly equivalent to . In other words, and are two symmetric solutions representing exactly the same coloring. This property implies that the names of the color groups in a coloring are irrelevant and interchangeable. Symmetric solutions are known to be a source of difficulties for many search algorithms applied on the GCP [ramani2006breaking]. In this work, we show how this problem can be easily avoided within the proposed GpGrad approach.
3.1 Matrix formulation of the graph coloring problem
For a graph with nodes, we note its adjacency symmetric matrix. There is a link between two nodes and in if and only if . For this problem in is the matrix corresponding to the color assignment (or grouping assignment) of each node to a color (or group ). For , as one and only one color has to be assigned to each node.
If we compute , we obtain a matrix in with coefficients:
(12) 
One notices that if and only if the two nodes and have the same color (or belong to the same group). Interestingly enough this association matrix is independent by permutation of the colors (corresponding to permutation of the columns in ).
We also introduce the wrong grouping assignment matrix or conflict matrix in , where refers to elementwise product (Hadamard product). has coefficients and such that if and only if the two nodes and are in conflict, meaning that they are in the same group () and there exists an edge in between the two nodes (). The total number of conflicts or score for a candidate solution is then:
(13) 
For simplicity we rewrite this equation as:
(14)  
(15) 
with corresponding to the summation operation of all matrix elements.
3.2 Gradient evaluation
Using the matrix formulation of , we can now compute the gradient of the GCP problem that we will use in the GpGrad framework of Algorithm 1 in order to learn the group assignments.
As , for , the partial derivative of with respect to an element of is (details given in Appendix B):
(16) 
The interpretation of equation (16) is obvious as when goes from the value 0 to 1, we expect the cost to increase by the value , and this term corresponds effectively to , the number of adjacent nodes to having the color and thus equal to the number of new conflicts created when the color (or group) is assigned to the node .
Using matrix computation, equation (16) for becomes then:
(17) 
where the gradient is a real matrix of size whose elements are equal to ^{2}^{2}2Notice that the term corresponds to the matrix used in an efficient implementation proposed in [fleurent1996genetic] of the popular TabuCol algorithm for the GCP [hertz1987using]..
3.3 Algorithm
By incorporating equations (15) and (17) in the GpGrad framework of Algorithm 1 we obtain a first algorithm called whose pseudocode is displayed in Algorithm 2. The algorithm runs until a stopping condition is reached: when (i.e., a legal coloring is found) or the number of iterations is greater than (a parameter).
One can notice that contrary to local search based coloring algorithms where only one node changes its color at each iteration, may change the colors of many nodes at each iteration. We will highlight this property in Section 5 on a toy example. It proves to be quite useful to deal with large graphs.
However, this single solution update with can easily get trapped in local optima and does not enable a large exploration of the search space for difficult instances. Therefore, we propose an extension of this algorithm called TensCol which runs multiple on a CUDA parallel computer to significantly increase search diversification and reinforce search intensification as well.
3.4 TensCol, a populationbased implementation of for the GCP
To improve the search capacity of the algorithm presented in the last section, we introduce below a populationbased extension of the algorithm.
3.4.1 Populationbased implementation with multiple parallel candidate solutions
The first idea is to simultaneously minimize a set of candidate solutions with CUDA parallel computation. Each solution corresponds to a candidate coloring that we note with . In this work we will set for the graph coloring problem. We use the same framework GpGrad, but instead of matrix, we will use three dimensional tensors, to take advantage of CUDA computation on GPU hardware.
We use the same notation as in Section 2 except that we write the different variables in bold in order to indicate that we use three dimensional tensors instead of matrices. Each of these three dimensional tensors has a first axis of size , the other axis are of size and .
The tensor of weights becomes then in . The probability tensor for the solutions can be computed in parallel from with , where softmax denotes softmax tensor computation along the last axis of the three dimensional tensor . The global tensor of these candidate solutions is defined by in as , where the argmax and one_hot operators are applied along the last axis (group axis).
We note the transposed three dimensional tensor of solutions obtained from by swapping its second and third axis. For the candidate solutions, the association tensor is then , where is the dot product between two tensors. This corresponds to the sum product over the last axis of and the secondtolast axis of . For , an element of is given by:
(18)  
(19) 
We note , the three dimensional tensor of size such that for , each of its element is . It corresponds to duplication of the same matrix .
By performing the elementwise product between the tensors and , we obtain the global conflict tensor for the solutions as , where each element of is given by:
(20)  
(21) 
The vector of fitness for the solutions can be computed with a single tensor operation from as with corresponding to the summation operation along second and third axis. For :
(22)  
(23) 
3.4.2 Penalization for diversity and bonus for intensification
Minimizing independently candidate solutions in parallel accelerates the algorithm, but each of the candidate solution can still easily get trapped in a poor local optimum. The idea that we employ in the following is to introduce two dependency terms linking the different solutions for two purposes: to avoid that they repeat the same mistakes by creating conflicts for the same pairs of nodes (diversity of errors), and to encourage them to make the same correct group assignments for each pair of nodes not connected by any link (intensification of good parts of the solution).
Group concentration matrix
To do this, we first perform a summation of the three dimensional association tensor along the first axis. We obtain the group concentration matrix of size :
(24) 
with each element of given by:
(25) 
is equal to the number of candidate solutions assigning the nodes and in the same group. It corresponds to a measure of similarity between the candidate solutions. The construction of this group concentration matrix is independent by permutation of the colors (or groups) of each candidate solution .
Diversity penalization term for identical and incorrect group assignments
Using this group concentration matrix , we compute the diversity penalization term , where corresponds to a sum of all matrix elements and designates elementwise power tensor calculation:
(26)  
(27) 
where is a parameter greater than 1 in order to penalize similar conflicts of different candidate solutions. By minimizing this term, the solutions are discouraged to make the same mistakes for the same pair of connected nodes.
Bonus term for identical and correct group assignments
Then using the same idea we compute the bonus term , where is the adjacency matrix of the complement graph of :
(28)  
(29) 
with greater than 1. By maximizing this bonus term the candidate solutions are encouraged to make the same grouping assignments for any nonconnected pair of nodes.
Global loss objective
From now, we denote by the iteration step of the algorithm. Indeed, we empirically noticed that the introduction of the bonus and diversity penalization terms with factors depending linearly on the iteration step during the optimization process improved the results. The intuitive idea is that by modifying step by step the fitness landscape, the algorithm will get out of local minimum traps.
The global loss objective of the model consists in a weighted aggregation between the following three criteria: (i) the sum of the fitness of the candidate solutions (equation (23)), (ii) the diversity penalization term (equation (27)) and (iii) the bonus term (equation (29)), subtracted in order to maximize it:
(30) 
with and two weighting parameters.
Gradient of the loss
At iteration , the gradient of this loss with respect to the tensor of solutions is:
(31) 
Details are given in Appendix C.
3.4.3 TensCol algorithm
By incorporating equations (30) and (31) in the GpGrad framework of Algorithm 1, we obtain the TensCol algorithm whose pseudocode is displayed in Algorithm 3.
One can notice from Algorithm 3 that we do not need to evaluate the penalization term and bonus terms at each iteration. It saves computational and memory resources to compute only their gradients with respect to . The algorithm runs until a stopping condition is reached: when the fitness of one of the candidate solutions reaches 0 (i.e., a legal coloring is found) or when the number of iterations becomes greater than .
4 TensCol for equitable graph coloring
Now we present an application of the TensCol algorithm presented in the last section for the equitable graph coloring problem (ECP), a variant of the graph coloring problem (GCP). The equitable graph coloring problem involves finding a minimum for which an equitable legal coloring exists.
An equitable legal coloring of is a partition of the vertex set of graph into disjoint independent sets satisfying the equity constraint, i.e., , where denotes the cardinality of the group . In other words, in an equitable legal coloring, the color groups have the same or similar group sizes (limited to a size difference of at most one).
In previous heuristic algorithms proposed for the ECP, the search is often made by alternating feasible and infeasible searches respectively in the space of equitable and nonequitable colorings, using very specific operators such as two or three swap operators preserving the equity constraint [sun2017feasible, wang2018tabu] (see Section 7).
With the GpGrad framework we show that we can naturally handle this problem by only modifying the gradient formulation given by equation (31) to take into account this equitable constraint.
4.1 TensCol with equitable coloring constraint
For a coloring to be equitable the number of nodes assigned to each color group must be equal to or where denotes the integer part of a positive real number and is the number of vertices of , with a particular case of when is divisible by .
We define the equity fitness of a candidate solution as:
(32) 
It corresponds to the total number of surplus or deficit in term of number of nodes in the groups with respect to both admissible number of nodes and in each group. A legal solution of the ECP must satisfy .
We denote by the global equity fitness for candidate solutions, with for , given by equation (32).
By convention for this problem we set . Therefore the partial derivative of with respect to an element is:
(33) 
with the total number of vertices with color for the candidate solution .
Knowing that , equation (33) can be rewritten in a simpler form as:
(34) 
One can notice that this partial derivative term does not depend on the index of the vertex . Indeed the equitable constraint is a group constraint. Therefore the same gradient is applied to all nodes of the same group. According to equation (34) this gradient is equal to:

0 if the group already satisfies the constraint with or (no need to remove or add any nodes in this group)

+1 if the group is composed of more items than required. Therefore, this gradient will put an equivalent pressure for all nodes to go to another group (if already in it) or not to join the group (if not already in it).

1 if this group is not composed of enough nodes, that will encourage the nodes to stay in this group or to join this group (if not already in it).
For the TensCol algorithm with the ECP constraint we employ the same GpGrad framework as presented in Algorithm 3 except that we add the ECP gradient of equation (34
). This ECP constraint is introduced progressively with a weight increasing at each epoch by a constant value
. We also remove the bonus term as we empirically noticed that encouraging the different candidate solutions to do the same group assignments for the same pair of nodes degrades the quality of the algorithm for the ECP (this is because it is better that the different candidate solutions propose different grouping assignments for each given pair of nodes in order to explore a wider space of equitable colorings). However we keep the diversity penalization term with associated gradient . The TensCol algorithm for the ECP is displayed in Algorithm 4.5 Toy example of grouping problems learned with
In this section, we present illustrative toy examples for the GCP and ECP to give intuitive insights on the gradient descent learning.
5.1 GCP toy example
In this section we use the simple implementation of the GpGrad framework for the GCP corresponding to Algorithm 2, where only one candidate solution is evaluated at each step. We apply it on an easy instance named myciel4.col of the COLOR02 benchmark with 21 nodes and known chromatic number equal to 5. Figure 3 displays the last 3 iterations (over 8) of to color myciel4.col with 5 colors and random seed . The parameters used in are , , and . The number written on each node indicates the gradient of the global score with respect to the selected color for this node. It corresponds to the total number of conflicts, i.e., adjacent nodes with the same color (cf. equation (16)). Red edges correspond to conflicting edges. Blue circles highlight the nodes changing their color from one step to another. As we can see on this figure, tends to change the color of the nodes with the highest gradient values. One can also notice that can change the color of more than one node at each iteration as the update is done on the full candidate solution matrix at each step. A legal coloring solution is shown in Figure 3(d) with .
5.2 ECP toy example
The equitable coloring of this same graph (myciel4.col) is now considered with again . We use the same algorithm and the same parameters, but the equitable constraint is taken into account in the gradient evaluation: , with for :
(35) 
where is the total number of nodes with color .
Figure 4 displays the last three steps (over 19) of to find an equitable coloring for myciel4.col with 5 colors and random seed . The number on each node indicates the gradient of the global ECP score with respect to the selected color for this node. Therefore this number is equal for each node to the total number of conflicting edges related to this node (red edges) plus one if the group is in surplus and minus one if the group is in deficit. Blue circles highlight the nodes changing their color from one step to another. First we see that the algorithm alternates between legal coloring without any conflicting edges (Figure (a) and (c)) and illegal coloring with red conflicting edges (Figure (b)).
One can notice that many nodes (and even more than required) change from pink to red from Figure (a) to Figure (b) and the other way around from red to pink (and blue) from Figure (b) to Figure (c). This is due to the fact that the color updates are done independently for each node when using a first order gradient descent. It highlights that other update rules that can learn dependencies between nodes could be employed in order to coordinate the multiple node changes and avoid this kind of oscillation phenomenon. This is out of the scope of this work and constitutes an interesting research perspective.
6 Computational results on GCP and ECP benchmark instances
This section is dedicated to an assessment of the TensCol algorithm for solving the GCP and ECP. We first present the experimental setting of TensCol for both problems. Then, we show the computational results obtained on the GCP, followed by the results on the ECP.
6.1 Benchmark instances and comparison criterion
For the GCP we test our algorithm on the 36 most widely used benchmark instances from the second DIMACS competition^{3}^{3}3Publicly available at ftp://dimacs.rutgers.edu/pub/challenge/graph/benchmarks/color/ for assessing the performance of graph coloring algorithms in recent studies [zhou2018improving, moalic2018variations]. For the ECP we test our algorithm on the same set of 73 benchmark instances used in [sun2017feasible, wang2018tabu] from the second DIMACS and COLOR02 competitions^{4}^{4}4Publicly available at https://mat.gsia.cmu.edu/COLOR02/.
Following the common practice for reporting comparative results in the coloring literature, we focus on the best solution found by each algorithm corresponding to the smallest number of colors needed to color a graph.
It is worth mentioning that for the GCP, no single algorithm including the most recent algorithms can attain the bestknown results for all 36 tested DIMACS instances. Indeed, even the best performing algorithms miss at least two bestknown results. This is understandable given that these instances have been studied for a long time (for some 30 years) and some bestknown results have been achieved under specific and relaxed conditions (e.g., extended run time up to one month) and only by very few algorithms. Moreover, one notices that for these benchmark graphs, when is close to the chromatic number of the given graph or to the current bestknown (upper) bound , finding a legal coloring is a difficult task. As such, an algorithm able to attain the bestknown results for a majority of the benchmark instances can be qualified competitive with respect to the current stateoftheart on graph coloring.
These comments remain valid for the ECP. Meanwhile, given that the ECP has been studies less intensively compared to the GCP, one could still hope to find improved bestknown results (i.e., equitable colorings with smaller than the current best bound ). As we show in Section 6.4, this is indeed possible with the proposed TensCol algorithm, which achieves 8 new record results.
6.2 Experimental setting
The TensCol
algorithm was implemented in Python 3.5 with Pytorch 1.1 library for tensor calculation with Cuda 10.0
^{5}^{5}5The source code of our algorithm will be made publicly available.. It is specifically designed to run on GPU devices. In this work we used a Nvidia RTX 2080Ti graphic card with 12 GB memory. A Nvidia Tesla P100 with 16 GB memory was used only for the two very large graphs (C2000.5.col and C2000.9.col instances).For a small graph such as r250.5.col colored with 65 colors for an equitable coloring, when we run TensCol with 200 candidate solutions in parallel there are millions weights in the model updated at each iteration. On a Nvidia RTX 2080Ti graphic card, it takes 0.002 seconds per iteration (for 200 candidate solutions evaluated). For a large graph such as C2000.9.col colored with 431 colors for an equitable coloring, there are millions weights updated at each iteration. On a Nvidia Tesla P100, it takes 0.37 seconds per iteration.
For our experiments, each instance was solved 10 times independently (with 10 different random seeds) following the common practice in the literature for graph coloring.
We used a baseline set of parameters, given in Table 1, for the GCP and ECP, except for the smoothing parameter which is very sensitive to the different structures of the GCP and ECP instances. Indeed we noticed that is a key parameter depending on the fitness landscape of the coloring problem and thus we chose it among a set of 5 different values. For random graphs such as DSJC500.5.col or DSJC1000.5.col with rough fitness landscapes, we had to set to 100 or 200 in order to frequently reset the learned tensor of weights according to equation (10). However, for geometric graphs such as R1000.1.col or instances based on register allocation for variables in real codes such as fpsol2.i.1.col or mulsol.i.1.col, we set meaning that we do not apply any reset process during the optimization process.
Parameter  Description  Value 

number of parallel candidate solutions  200  
standard deviation of initial parameters  0.01  
exponential base in softmax approximation  1  
learning rate  0.001  
number of iterations between two smoothing procedures  5  
smoothing parameter  {1,2,10,100,200}  
exponent for diversity penalization  2.5  
increase factor for diversity penalization  
(only for GCP)  exponent for similarity bonus  1.2 
(only for GCP)  increase factor for similarity bonus  
(only for ECP)  increase factor for equitable constraint 
For some particular cases we had to deviate from this baseline set of parameters. For random graphs with high density (), we set instead of in order to prevent the diversity penalization term from becoming too important (cf. equation (27)). On the contrary, for random graphs with low density (), we set for the GCP instead of to avoid that the bonus term (cf. equation (29)) becomes too preponderant in the global loss evaluation. For families of instances with a huge number of nodes (more than 2000) such as wap02a.col and wap03a.col of the COLOR02 challenge, we had to decrease the number of candidate solutions evaluated in parallel from to in order to limit the GPU memory required. For the very difficult graph R1000.5.col on the GCP, we achieved a best coloring of 235 with the baseline set of parameters, while a better coloring with 234 colors can be obtained by TensCol with a specific fine tuning (, , and ).
For all these instances, we used as the stopping criterion, a maximum allowed number of iterations of , which corresponds to a maximum of different candidate solutions evaluated (as there are 200 solutions evaluated at each iteration).
6.3 Computational results on the GCP
This section is dedicated to an extensive experimental evaluation of TensCol on the GCP. We show a comparison of TensCol with 5 stateoftheart coloring algorithms of the literature:

the probability learning based local search algorithm (PLSCOL) [zhou2018improving], which is an improved algorithm of reinforcement learning search (RLS) [zhou2016reinforcement].

the twophase evolutionary algorithm (MMT)
[malaguti2008metaheuristic]. 
the distributed evolutionary quantum annealing algorithm (QA) [titiloye2011graph]

the memetic algorithm (MA) [lu2010memetic]

the recent implementation of the memetic approach with two elite solutions in the population (HEAD) [moalic2018variations]
A more detailed presentation of each reference algorithm is given in the Section 7.1.
The results of TensCol and the reference algorithms are reported in Table 2: column 2 indicates the number of vertices in each instance, column 3 shows the chromatic number number (if known)^{6}^{6}6For some instances such as flat1000_76_0 the chromatic number is known by construction of the graph (and equal to 76 in this case) but no heuristic has yet found a legal coloring with this chromatic number.. Column 4 gives the bestknown results (best upper bound of ) ever reported by an algorithm. Columns 5 to 9 show the best results () of PLSCOL, MMT, MA, QA and HEAD respectively. The remaining columns reports the results obtained by our TensCol algorithm: the best result (), the success runs over 10 runs during which TensCol attained its best result, and for indicative purpose the average computation time (in seconds) of TensCol for a successful run.
As we observe on Table 2, TensCol always finds the bestknown coloring in a small amount of time on the instances with less than 200 nodes such as DSJC125.1.col or R125.1. On medium graphs with up to 500 nodes such as DSJC250.5.col or DSJC500.5.col, Tensol is very competitive compared to all the other algorithms.
For the large DSJC1000.5.col or DSJC1000.9.col instances, TensCol performs worse than MA and HEAD, two bestperforming memetic algorithms that rely on recombining large independent sets between solutions, a technique which has proven to be very powerful for this kind of large random graphs. However using this mechanism can be a drawback for large geometric graphs as shown by the poor results obtained by MA and HEAD on the R1000.5.col instance. Indeed, as noticed in [wu2012coloring], for this R1000.5.col instance, some large independent sets are not part of any optimal 234coloring and extracting such independent sets cannot help to decrease the number of colors needed for the whole graph. TensCol however can recover the optimal coloring for all the family of geometric graphs and in particular for this difficult R1000.5.col instance with an optimal 234coloring. It is notable that this optimal coloring was only previously found with the MMT algorithm [malaguti2008metaheuristic], which is a complex twophase method mixing an effective tabu search, a specialized grouping crossover operator, and a postoptimization phase based on a set covering formulation of the GCP.
TensCol also obtains good results for the large latin square graph (latin_square_10), which is difficult in particular for the MMT algorithm, but also for the MA and HEAD memetic algorithms.
Finally, it is worth noting that given the particularity of the GCP benchmark instances as discussed in Section 6.1, it is not suitable to apply statistical tests.
PLSCOL  MMT  MA  QA  HEAD  TensCol  

Instance  SR  t(s)  
DSJC125.1  125  ?  5  5  5  5  5  5  5  10/10  40 
DSJC125.5  125  ?  17  17  17  17  17  17  17  10/10  68 
DSJC125.9  125  ?  44  44  44  44  44  44  44  10/10  22 
DSJC250.1  250  ?  8  8  8  8  8  8  8  10/10  95 
DSJC250.5  250  ?  28  28  28  28  28  28  28  10/10  199 
DSJC250.9  250  ?  72  72  72  72  72  72  72  10/10  87 
DSJC500.1  500  ?  12  12  12  12  12  12  12  10/10  1098 
DSJC500.5  500  ?  47  48  48  48  48  47  48  5/10  7807 
DSJC500.9  500  ?  126  126  127  126  126  126  126  6/10  18433 
DSJC1000.1  1000  ?  20  20  20  20  20  20  20  10/10  10225 
DSJC1000.5  1000  ?  82  87  84  83  83  82  84  9/10  32495 
DSJC1000.9  1000  ?  222  223  225  223  222  222  224  6/10  58084 
DSJR500.1  500  ?  12  12  12  12< 
Comments
There are no comments yet.