1 Introduction
A graph coloring problem is to assign colors to the vertices of a graph subject to certain constraints. One of the most common graph coloring problem is the vertex coloring problem. Given an undirected graph with a set of vertices and a set of edges , the problem is to color the vertices of the graph such that two adjacent vertices receive different colors. This problem can also be seen as finding a partition of the vertex set into different color groups (also called independent sets or color classes) such that two vertices linked by an edge belong to different color groups. In some variants of graph coloring, the problem is to find such legal coloring of the graph while considering additional objective function to minimize.
The search space of a graph coloring problem is composed of the partitions of vertices into color groups:
(1) 
where , . This search space is in general huge and finding an optimal solution is in general intractable unless P=NP, as most of the graph coloring problems are NPhard.
Graph coloring problems have been studied very intensively in the past decades. A first category of methods proposed in the literature are local search procedures. Starting from a solution constructed using greedy methods, local search improves the current solution by considering best moves in a given neighborhood. To be effective, local search heuristics usually incorporate mechanisms to escape local optima based on tabu lists [blochliger2008reactive, hertz1987using] or perturbation strategies [jin2019solving, nogueira2021iterated]. However for very difficult instances of graph coloring, this is often not enough to find the global optimum as the search may be restricted to a single region of the search space. To overcome these difficulties, hybrid algorithms have been proposed, in particular based on the memetic framework that combines local searches and crossovers [MABook2012]. The memetic framework has been very successful in solving several graph colorig problems [jin2014memetic, lu2010memetic, moalic2018variations, porumbel2010evolutionary]
. These hybrid algorithms combine the benefits of local search for intensification with a population of highquality solutions offering diversification possibilities. The memetic algorithms proposed in the literature for graph coloring typically use a small population with no more than 100 individuals. At each generation, one offspring solution is usually created by randomly selecting two or more individuals in the population and applying a crossover operator. One of the most popular crossovers used for graph coloring problems is the Greedy Partition Crossover (GPX) introduced in the hybrid evolutionary algorithm (HEA)
[galinier1999hybrid]. The GPX produces offspring by choosing alternatively the largest color class in the parent solutions. The offspring that is generated from this crossover is then improved by a local search procedure.These crossover operators of the memetic algorithm allow to produce new restarting points for the local search procedure, which are expected to be better than a pure random initialization. However, when using such mechanisms, there is usually no way of knowing in advance whether the new restarting point indicates a promising area that is really worth being explored by the local search procedure. Indeed, sometimes, the use of a crossover can bring back to an already visited region of the search space without any chance of further improvement, or to a new region far from the global optimum. Moreover, hybrid algorithms do not have a specific memory to store information about past trajectories done in the search space that could be used to learn useful patterns to solve the problem (although sharing groups with crossovers can be seen as some sort of ”learning” of good patterns).
On the other hand, numerous algorithms have been proposed since decades from the machine learning community to leverage statistical learning methods for solving difficult combinatorial search problems. We refer the reader to the recent survey of
[bengio2020machine]on this topic. These attempts have been given a new lease of life, with the emergence of deep learning techniques for combinatorial optimization problems
[bello2016neural, dai2017learning], inspired from the great success of the AlphaZero algorithm for combinatorial games [silver2018general]. In particular, some recent works using reinforcement learning and deep learning have been applied to solve graph coloring problems
[huang2019coloring, lemos2019graph]. Nevertheless, these studies rarely exploit specific knowledge of the problem, which greatly limits the interest and performance of these approaches. Indeed, the results obtained by this type of approach are for the moment far from the results obtained by state of the art algorithms on graph coloring problems such as hybrid algorithms
[lu2010memetic, malaguti2008metaheuristic, moalic2018variations, porumbel2010evolutionary] and simulated annealing algorithms [titiloye2011quantum]. We can mention however new works which are trying to pair efficient local search algorithms and machine learning techniques [goudet2021population, zhou2016reinforcement, zhou2018improving] with promising results for graph coloring problems.In this paper, we aim to push further the integration of machine learning and combinatorial optimization, by proposing a new framework which combines deep neural networks with the best tools of ”classical” metaheuristics for graph coloring (local search well adapted with tabu search [hertz1987using] and use of recombination of solutions like those used in memetic algorithms [galinier1999hybrid, lu2010memetic, moalic2018variations, porumbel2010evolutionary]), so as to solve very difficult graph coloring problems which still resist the best current methods. In order to achieve this integration, we propose to revisit an idea proposed by Boyan and Moore twenty years ago. In [boyan2000learning], the authors remarked that the performance of a local search procedure depends on the state from which the search starts and therefore proposed to use a regression algorithm to predict the results of a local search algorithm. Once learned this predictive model can help to select new good starting points for the local search and the authors show on some examples that it can help to find quicker the optimal solution in the search space. We propose to exploit this idea with the use of modern deep learning techniques, in order to help in the selection of promising new crossovers among those possible ones in each generation. We design a specific neural network architecture for graph coloring problems inspired by deep set networks [lucas2018mixed, zaheer2017deep, zhang2019deep], in order to make it invariant by permutation of the color classes. Furthermore, as training a deep learning algorithm requires a large amount of data, we implement a memetic algorithm with a very large population (), whose individuals evolve in parallel in the search space, adapting the framework recently proposed in [goudet2021massively] for latin squares completion. In order to learn the neural network and to compute all the local searches in parallel for all the individuals of the population, we leverage on GPU (Graphic Processing Units) computation.
As a proof of concept, we apply this approach to solve the weighted vertex graph coloring problem (WVCP) which has recently attracted a lot of interest in the literature [nogueira2021iterated, wang2020reduction]. In the WVCP, a strictly positive weight is associated to each vertex . The goal of the weighted vertex coloring problem (WVCP) is to find a legal coloring minimizing the global score:
(2) 
This problem is well suited for our framework as predicting this global score, instead of more sophisticated constraints, is well adapted when using regression machine learning techniques.
The WVCP has a number of practical applications in different fields such as matrix decomposition problems [prais2000reactive], batch scheduling [gavranovic2000graph] and manufacturing [hochbaum1997scheduling]. It has been addressed by exact methods [cornaz2017solving, furini2012exact, malaguti2009models] and heuristics [prais2000reactive, sun2018adaptive, wang2020reduction]. The current most performing heuristic for this problem is an iterated local search algorithm exploring two different neighborhoods and based on tabu search [nogueira2021iterated].
2 General framework  revisiting the STAGE algorithm with deep learning and memetic algorithm
Given a problem whose the goal is to find an optimal solution, minimizing an objective function , the expected search results of stochastic local search algorithm A can be defined as:
(3) 
where
is the probability that a search starting from
will terminate in state . evaluates ’s promise as a starting state for the algorithm A.The main idea of the STAGE algorithm [boyan2000learning] was to approximate this expectancy by a regression approximation model
, taking as input the encoded realvalued feature vector
(with features) of a state . This functioncan be a regression model such as a linear regression model or a more complex non linear model such as a neural network.
After initializing a first random candidate solution , and the function approximator , the STAGE algorithm evolves in three steps:

Optimize f using A. From , it runs the local search algorithm A, producing a search trajectory that ends at a local optimum .

Train . For each point on the search trajectory, use as a new training pair for the function approximator.

Optimize using hillclimbing. Continuing from , perform a hillclimbing search on the learned objective function . This results in a new state which should be a new good starting point for A.
We propose to revisit this idea with an adaptation for each of the three points of the STAGE algorithm:

First, regarding the first step of the STAGE algorithm, we propose to run in parallel local searches with algorithm A starting from different states, instead of only one. It allows to produce different search trajectories allowing to build a training dataset with a great diversity of examples.

Secondly, regarding the step 2, we do not use any prior mapping from states to features. Following the current trend in deep learning, the embedding of the state can be directly learned in a end to end pipeline with a deep neural network, denoted as . We make this neural network invariant by permutation of the group of colors in the coloring which is a very important feature of all graph coloring problems, by adapting the deep set network architecture proposed in [zaheer2017deep] (see Section 2.3).

Thirdly, in the step 3 of the original STAGE algorithm, a hillclimbing algorithm is used to optimize the current solution guided by the objective function . However, as we address a very complex problem and we use a complex non convex and nonlinear function (deep neural network), it is difficult to optimize it using a hillclimbing algorithm. We tried to use more complex algorithms such as tabu search to optimize it, but there is a deeper problem, which is the question of generalization. Indeed, if a state is too different from the states already seen before in the training dataset, and in particular if the color groups that composed it are too different from the color groups already seen before by the neural network, we expect that
can be very inaccurate for the estimation of
. Therefore, we propose to replace this hillclimbing procedure by a crossover operation between different members of a population of candidate solutions. By recombining the different color groups already seen before by the learning algorithm we expect the approximation of , given by , to be more precise.
The pseudocode of the proposed new deep learning guided memetic framework for graph coloring (DLMCOL) is shown in Algorithm 1.
The algorithm takes a graph as input and tries to find a legal coloring with the minimum score . At the beginning, all the individuals of the population are initialized in parallel using a greedy random algorithm (cf. Section 2.1) and the neural network is initialized with random weights. Then, the algorithm repeats a loop (generation) until a stopping criterion (e.g., a cutoff time limit or a maximum of generations) is met. Each generation involves the execution of five components:

The individuals of the current population are simultaneously improved by running in parallel local searches on the GPU to find new legal solutions with a minimum score (cf. Section 2.2). For each of the improved individuals from step 1, we record , the legal state with the lowest score obtained during each local search trajectory.

The distances between all pairs of the existing individuals and new individuals are computed in parallel (cf. Section 2.4).

Then the population updating procedure (cf. Section 2.5) merges the existing and new individuals to create a new population of individuals, by taking into account the fitness of each individual (score of the WVCP) and the distances between individuals in order to maintain some diversity in the population.

Finally each individual is matched with its nearest neighbors in the population. For each individual, offsprings are generated and the one with the best expected score evaluated with the neural network is selected (cf. Section 2.6). After this selection procedure, offspring individuals are selected and become the new starting points which are improved by the parallel by the local search procedure during the next generation ().
The algorithm stops when a predefined condition is reached and return the , the best legal solution found so far. The subsequent subsections present the five components of this deep learning guided memetic framework applied on the WVCP.
2.1 Initialization with a greedy random algorithm for the WVCP
In order to initialize the individuals of the population, we use a greedy random procedure which has already been seen as very effective for the WVCP in [nogueira2021iterated, sun2018adaptive].
First all the nodes are sorted by descending order of weight and by degree, then a color is assigned to each node without creating conflicts by randomly choosing a color in the set of already used color. If no color is available for the node with weight a new color is created (and the score of the current solution is increased by ).
Notice that for this problem the number of colors that can be used in order to find a legal coloring minimizing the global score is not known in advance. It is at least strictly greater than the chromatic number of the graph .
However, for our algorithm a predefined maximum allowed number of colors need to be set in order to specify the size of the layers of the neural network and to allocate memory for the local searches on the GPU. We evaluate this number as the maximum number of colors used in these greedy random procedures applied to initialize the whole population.
The new search space restricted with maximum available colors and composed of the partitions of vertices into color groups is defined as:
(4) 
Even if it can be seen as quite restrictive to limit the search in instead of , we empirically notice that the best solution found for each of the different instances of the WVCP presented in the experimental section of this paper (cf. Section 3) still used significantly less than color.
2.2 Parallel iterated Tabu Search with feasible and infeasible searches
For local optimization, we employ a parallel iterated tabu search algorithm to simultaneously improve the individuals of the current population. It relies on the adaptive feasible and infeasible tabu search procedure for weighted vertex coloring (AFISA) proposed in [sun2018adaptive] for the WVCP, with some slight modifications. This feasible and infeasible tabu search uses a sequential procedure that improves a starting legal or illegal coloring by optimizing the fitness function given by:
(5) 
where with:
(6) 
and where is a penalty coefficient for the number of conflicts in the solution.
The procedure improves the current coloring by successively changing the color of a vertex in the search space (with a maximum of colors). Such a change is called an onemove. To prevent the search from revisiting already visited colorings, a node cannot changes its color for the next (called tabu tenure) iterations^{1}^{1}1In the original AFISA algorithm, the tabu tenure is concerning past moves (as in the original TabuCol algorithm [hertz1987using]) instead of completely freezing a node, but we empirically observed that it seems more effective for the WVCP to freeze a node which has just changed color in order to avoid to much color changes of the same node without any improvement of the score (plateau).. The tabu tenure is set to be , where is a random integer from , is parameter set to 0.2, and is the number of nodes in the graph^{2}^{2}2In the original AFISA algorithm, the tabu tenure was set to where corresponds to the extended evaluation function given by equation 5, but we remark that it seems more appropriate to relate this tabu tenure to the number of nodes in the graph as it is directly proportional to the number of available moves at each step. It seems preferable that the global extended score , which depends on the magnitude of the weights for each specific instance, does not have such a direct impact on the tabu tenure..
Like in the AFISA algorithm, we perform successive searches by changing dynamically the value of in order to navigate in the space of legal and illegal colorings. The number of successive searches is set to . At the beginning the parameter is set to the value for each individual of the population, then at the end of each successive local search, if the best current solution found by the tabu search procedure is legal () then the value of is divided by 2 (in order to increase the chance of visiting infeasible solutions), otherwise if the solution is illegal () then is multiplied by (in order to guide the search toward feasible regions)^{3}^{3}3In the original AFISA algorithm is initially set to 1, cannot be lower than 1 and the adaptive mechanism only increase or decrease its value by 1. We have found empirically more efficient to initialize to a value depending on the density of the graph (via the ratio ) and the amplitude of the weights a we must balance a total weight score with a number of conflicts. Additionally, dividing or multiplying its value by 2 appears to be empirically more effective for faster adjustments, especially for instances with heavy weights than simply changing its value by 1..
During the last iteration of this iterative local search algorithm, we set in order to be sure to guide at least one time each local search toward a legal solution.
As shown in Algorithm 2, our parallel iterative tabu search runs in parallel on the GPU to raise the quality of the current population in parallel.
All the data structures required during the search are stored in each local thread memory running tabu search except the information of the graph which is stored in the global memory and shared.
2.3 Deep neural network training
Once all the local searches are done in parallel, we collect the starting states and the best score found on each local thread. It allows to build a supervised training dataset with examples whose entries are the ’s in and the corresponding targets are real values .
A neural network , parametrized by a vector of parameters (initialized at random at the beginning), will be successively trained on each new dataset produced at each generation (online training) in order to be able to be more and more accurate at predicting the expected score obtained after the local search procedure for any new starting point .
This neural network takes directly as input a coloring as a set of vectors , , where each is a binary vector of size indicating if the vertex belongs to the color group . For such an entry , the neural network outputs a real value noted .
One important characteristic of our neural network is that it be invariant by permutation of the group of colors of any solution given as input. It should be a function from to such that for any permutation of the input color groups,
(7) 
As indicated in [lucas2018mixed, zaheer2017deep], such permutation invariant functions can be obtained by combining the treatments of each color group vector with an additional ”coloraveraging” operation that performs an average of the features across the different color groups. In this deep set architecture, this averaging is the only form of interaction between the different color groups. It has notably been shown in [lucas2018mixed] that such operations are sufficient to recover all invariant functions from to .
Using the notations proposed in [lucas2018mixed], for a coloring , the color group invariant network is defined as:
(8) 
where each is a permutation invariant function from to , where ’s are the layer sizes.
Each equivariant layer operation with input features and ouput features includes a weight matrix that treats each color group independently, a colormixing weight matrix
and a bias vector
.As in classical multilayer feedforward neural network,
processes each color group vector of the solution independently. Then, the weight matrix computes an average across the different color groups for each feature given by:(9)  
(10) 
The output of permutationequivariant layer is a matrix in , which is the concatenation of output vectors of size :
(11) 
where for ,
(12) 
where
is a non linear activation function.
After each local search procedure the neural network is trained during epochs on the new dataset using Adam optimizer [kingma2014adam] with initial learning rate
and batches of size 100 in order to minimize the mean square error loss (MSE) between the outputs and the targets. In order to speed up the training, prior to the non linearity we apply a batch normalization layer
[ioffe2015batch], adapted to keep the invariant property of the network.Once learned this neural network will be used to select new crossovers for the next generation (see Section 2.6 below), but before performing crossovers, we must decide if the new legal colorings obtained after the parallel local search procedure can be inserted into the population. In order to do this, a distanceandquality based pool update strategy is used to create a new population satisfying a minimum spacing among the individuals to ensure population diversity [PorumbelHK11]. Maintaining this minimum spacing requires the computation of pairwise distances between the solutions, which is presented the next subsection.
2.4 Distance computation
Following [goudet2021population, lu2010memetic, porumbel2010evolutionary], for population updating, we use a matrix to record all the distances between any two solutions of the population. This symmetric matrix is initialized with the pairwise distances computed for each pair of individuals in the initial population, and then updated each time a new individual is inserted in the population.
To merge the new solutions and the existing solutions, we need to evaluate (i) distances between each individual in the population and each improved individual in and (ii) distances between all the pairs of individuals in . All the distance computations are independent from one another, and are performed in parallel on the GPU (one computation per thread).
Given two colorings and , we use the settheoretic partition distance to measure the dissimilarity between and , which corresponds to the minimum number of vertices that need to be displaced between color classes of to transform to [porumbel2011efficient]. The exact partition distance between two solutions can be calculated with the Hungarian algorithm [kuhn1955hungarian] in time. However, given that we need to compute millions of distances at each generation with the large population, we instead adopt the efficient approximation algorithm presented in [porumbel2011efficient], which scales in .
2.5 Population update
According to [porumbel2010evolutionary, PorumbelHK11], the population update procedure aims to keep the best individuals, but also to ensure a minimum spacing distance between the individuals. The update procedure is sequential, as we need to compare one by one existing individuals in the population at generation and the local search improved offspring solutions in the population .
We use the population update procedure proposed by [goudet2021population]. This procedure greedily adds one by one the best individuals of in the population of the next generation until reaches individuals, such that ( is the number of vertices), for any , . Each corresponds to the approximation of the settheoretic partition distance which was precomputed in the last step of the algorithm.
2.6 Parent matching and selection of crossovers with the neural network
At each generation, each individual of the population is matched with its nearest neighbors in the population (in the sense of the distance evaluated in subsection 2.4).
For each individual , offspring solutions () are generated using the wellknown GPX crossover [galinier1999hybrid, moalic2018variations], where the individual is taken as the first parent and its neighbor is the second parent (the GPX crossover is not symmetric).
For each individual , among these crossovers, we select the one with the best expected score evaluated with the neural network:
(13) 
After this selection procedure, offspring solutions are selected and serve as the new starting points that are further improved in parallel by the local search procedure during the next generation ().
3 Experimental results
This section is dedicated to a computational assessment of the proposed deep learning memetic framework for solving the weighted vertex coloring problem, by making comparisons with the stateoftheart methods.
3.1 Implementation on graphic processing units
The DLMCOL algorithm was programmed in Python with the Numba 0.53.1 library for CUDA kernel implementation (local search, distance computation, crossovers). The neural network is implemented in Pytorch 1.8.1. DLMCOL is specifically designed to run on GPUs. In this work we used a V100 Nvidia graphic card with 32 GB memory.
3.2 Benchmark instances
We carried out extensive experiments on the benchmark used in the recent papers on the WVCP [nogueira2021iterated, sun2017feasible, wang2020reduction]: the pxx, rxx, DIMACS/COLOR small, and DIMACS/COLOR large instances. The pxx and rxx instances are based on matrixdecomposition problems [prais2000reactive], while DIMACS/COLOR small [cornaz2017solving, furini2012exact] and DIMACS/COLOR large [sun2017feasible] are based on DIMACS and COLOR competitions.
As indicated in [nogueira2021iterated, wang2020reduction], a preprocessing procedure can be applied to reduce a graph with the set of weight for the WVCP. For each clique with nodes, if we note the the smallest weight of this set, all the nodes in the graph with a degree equal to and a weight can be removed from the graph without changing the optimal WVCP score that can be found for this instance. Enumerating all the cliques of the graph is a challenging problem. We used the igraph python package^{4}^{4}4https://igraph.org/python/ with a ten seconds timeout for all instances. For small instances it is enough to enumerate all the cliques of a graph.
3.3 Parameter setting
The population size of DLMCOL is set to , which is chosen as a multiple of the number of 64 threads per block. This large population size offers a good performance ratio on the Nvidia V100 graphics cards that we used in our experiments, while remaining reasonable for pairwise distance calculations in the population, as well as the memory occupation on the GPU for medium instances (). However for large instances (), we set in order to limit the memory occupation on the device.
In addition to the population size, the parameter of the tabu search is set to 0.2 and the number of tabu iterations depends on the size of the graph. The maximum number of iterated local searches launched at each generation, , is set to 10. The minimum spacing distance used for pool update is set to .
For the neural network we implement an architecture with 4 hidden layers of size , , and
. The non linear activation function is a Leaky Relu function defined as
, with . The neural network is trained during epochs at each generation with Adam optimizer and initial learning rate .Table 1 summarizes the parameter setting, which can be considered as the default and is used for all our experiments.
Parameter  Description  Value 

Population size  20480 (8192)  
Number of iterations local search  10  
Number of iterations tabu search  
Tabu tenure parameter  0.2  
Minimum spacing between two individuals  
Learning rate of the neural network  0.001  
Number of epochs of the training  20  
Number of considered neighbors for crossover selection  32 
3.4 Comparative results on WVCP bechmarks
This section shows a comparative analysis on the pxx, rxx, DIMACS/COLOR small, and DIMACS/COLOR large instances with respect to the stateoftheart methods [nogueira2021iterated, sun2017feasible, wang2020reduction]. The reference methods include the three best recent heuristics: AFISA [sun2017feasible], RedLS [wang2020reduction] and ILSTS [nogueira2021iterated]. When they are available, we also include the optimal scores obtained with the MWSS exact algorithm [cornaz2017solving] and reported in [nogueira2021iterated].
Given the stochastic nature of the DLMCOL algorithm, each instance was independently solved 10 times. For small instances presented in Tables 2 and 3, a time limit of 1 hour was used. However for medium and large instance in Tables 4 and 5, as training the neural network and performing all the local searches for the large population is time consuming, a cutoff limit of 48 hours is retained.
For a fair comparison, we also launched the local search methods RedLS and ILSTS during 48 hours, until no improvements is observed. It was done on a computer with an Intel Xeon E52670 processor (2.5 GHz and 2 GB RAM). As the available AFISA binary code does not allow setting a cutoff time, we only report its results mentioned in the original article [sun2017feasible].
However, we acknowledge that the comparison remains difficult in term of computational time between DLMCOL and the competitors, as DLMCOL uses a GPU while the other algorithms, AFISA, RedLS and ILSTS use a CPU. Therefore the computation times are given for indicative purposes only.
Columns 1, 2, and 3 of Tables 2–5 show the characteristics of each instance (i.e., name of the instance, number of vertices , and optimal result reported in the literature). Columns 49 present the best and average scores obtained by the reference algorithms for the difference instances, as well as the average time in second required to obtain the best results. The results of the proposed DLMCOL algorithm are reported in columns 10 and 11. Bold numbers show the dominating values while a star indicates a new upper bound^{5}^{5}5The certificates of the new bestknown solutions from DLMCOL are available at https://github.com/GoudetOlivier/DLMCOL..
Due to memory limit on each thread of the GPU for the local searches, the DLMCOL algorithm was not runned on the biggest instances of the DIMACS/COLOR benchmarks: C2000.5, C2000.9, DSJC1000.9 and wap014a.
As can be seen from Tables 2–4, for all the small DIMACS/COLOR, pxx and rxx instances, DLMCOL can attain the known optimality or best result reported in the literature with almost 100 % success rate over 10 runs. Hoverer, the computational time required to achieve these results is in general higher in particular when compared with ILTTS. It may be explained by the fact that DLMCOL can only obtain a first result after 10 iterations of local search in parallel of the whole population.
For the larger instances reported in Table 5, DLMCOL obtains excellent results by reaching the bestknown score for 31 over 49 instances. For 11 of them, DLMCOL even finds new upper bounds that were never been reported before. In particular, we notice important improvements in comparison with the best methods of the literature for 3 instances, by significant reducing their bestknown scores: DSJC500.5 from 707 to 686, flat1000_50_0 from 1184 to 924 and latin_square_10 from 1542 to 1483.
However, DLMCOL does not work in general for large graphs with low density of edges such as DSJC1000.1, inithhx.i.2, inithhx.i.3 and wapXXa. In fact, for these instances, it is very hard for the neural network to learn a common backbone of good solutions. Indeed, the good solutions for these instances are characterized by a low ratio of the number of color groups over the total number of vertices. As for the WVCP, only the maximum weight of each color group has an impact on the score, for these instances many different groupings of vertices are possible without impacting the score. It results in a very high average distance between the best solutions of the population. Therefore, the neural network fails to learn relevant patterns in this too diversified population.
For the largest graphs, we notice that the convergence of the algorithm is quite slow. Even after 48 hours, there is still room for further improvements. For the DJSC1000.5, flat1000_60_0 and flat1000_76_0 instances, it is possible to obtain still better new upper bounds respectively of 1185, 1162, 1165, after 138, 98 and 95 hours, respectively.
Instance  AFISA  RedLS  ILSTS  DLMCOL  
Graph name  Opti.  Best (Avg.)  t (s)  Best (Avg.)  t (s)  Best (Avg.)  t (s)  Best (Avg.)  t (s)  
DSJC125.1g  124    23 (24)  3016  23  0.01  23  3  23  22 
DSJC125.1gb  124    90 (92.5)  402  91 (91.7)  696  90  15  90  21 
DSJC125.5g  125    71 (72.3)  216  72  32895  71  77  71  416 
DSJC125.5gb  125    243 (250.2)  369  241 (241.3)  13528  240  219  240  38 
DSJC125.9g  125  169  169 (169.9)  16  169  3493  169  1.27  169  120.8 
DSJC125.9gb  125  604  604 (605.5)  444  604  17.62  604  59.8  604  81 
GEOM100  100  65  65 (65.0)  0.81  65 (67.5)  0.01  65  0.03  65  14 
GEOM100a  100  89  89 (89.5)  110  90 (93.3)  0.01  89  0.65  89  19 
GEOM100b  100  32  32 (33.1)  59  32  0.6  32  0.02  32  17 
GEOM110  110  68  68 (68.0)  34  68 (69.9)  0.03  68  0.05  68  18 
GEOM110a  110  97  97 (97.8)  177  97 (99.4)  0.04  97  0.58  97  22 
GEOM110b  110  37  37 (37.9)  131  37 (37.71)  22.1  37  0.25  37  56 
GEOM120  120  72  72  33  72 (73.1)  9.0  72  0.03  72  17 
GEOM120a  120  105  105 (106.3)  156  105 (105.9)  1.96  105  0.77  105  27 
GEOM120b  120  35  35 (37.3)  67.7  35 (35.25)  14.38  35  0.81  35  36 
GEOM30b  30  12  12  0.02  12  0.01  12  0.01  12  6 
GEOM40b  40  16  16  0.03  16 (16.6)  0.01  16  0.01  16  6 
GEOM50b  50  18  18  0.02  18 (18.2)  62.02  18  0.01  18  7 
GEOM60b  60  23  23  0.22  23  0.01  23  0.01  23  8 
GEOM70  70  47  47  5  47 (48.6)  0.01  47  0.02  47  9 
GEOM70a  70  73  73  4  73 (73.6)  0.25  73  0.03  73  10 
GEOM70b  70  24  24  12  24  15.1  24  0.01  24  11 
GEOM80  80  66  66  2  67 (67.4)  0.01  66  0.01  66  9 
GEOM80a  80  76  76 (76.1)  137  76 (78.2)  1.3  76  0.04  76  12 
GEOM80b  80  27  27 (27.8)  67  27  24.9  27  0.06  27  12 
GEOM90  90  61  61 (61.2)  89  61 (63.5)  1.74  61  0.15  61  12 
GEOM90a  90  73  73 (74)  512  73 (74.1)  3.65  73  0.55  73  14 
GEOM90b  90  30  30 (30.1)  67  30 (30.1)  0.17  30  0.02  30  16 
R100_1g  100  21  21 (22)  114  21 (21.8)  508.0  21  7.14  21  300 
R100_1gb  100  81  81 (83.8)  3  81 (81.4)  1279.7  81  1.98  81  15 
R100_5g  100    59 (60.1)  7  59  193  59  0.2  59  23 
R100_5gb  100    221 (224.1)  187  220 (222)  687  220  4  220  23 
R100_9g  100  141  141 (141.3)  21  141  0.62  141  40.8  141  42 
R100_9gb  100  518  518 (5449.3)  1152  518  8.17  518 (518.3)  1066.1  518  41 
R50_1g  50  14  14  0.14  14 (14.1)  0.01  14  0.01  14  7 
R50_1gb  50  53  53 (53.0)  0.24  53 (53.1)  0.07  53  0.01  53  7 
R50_5g  50  37  37 (37.0)  0.95  37  0.01  37  0.02  37  9 
R50_5gb  50  135  135 (135.3)  4  135  0.09  135  0.21  135  9 
R50_9g  50  74  74  1  74  0.02  74  0.01  74  11 
R50_9gb  50  262  262  13  262  0.01  262  1.4  262  11 
R75_1g  75  18  18 (18.4)  11  18  2.62  18  0.28  18  9 
R75_1gb  75  70  70 (70.1)  2  70 (72.4)  0..35  70  0.23  70  9 
R75_5g  75  51  51 (51.4)  01  51 (51.2)  598.6  51  0.39  51  14 
R75_5gb  75  186  186 (189)  19  186  51.1  186  2  186  14 
R75_9g  75  110  110  3  110  0.08  110  0.1  110  22 
R75_9gb  75  396  396 (396.4)  146  396  0.42  396  3.9  396  22 
Best rate  95%  89%  100%  100% 
Instance  AFISA  RedLS  ILSTS  DLMCOL  

Graph name  Opti.  Best (Avg.)  t (s)  Best (Avg.)  t (s)  Best (Avg.)  t (s)  Best (Avg.)  t (s)  
P06  16  565  565  0.0  565  0.01  565  0.01  565  7 
P07  24  3771  3771  0.0  3771 (3773.5)  0.01  3771  0.01  3771  6 
P08  24  4049  4049  0.2  4049  0.03  4049  0.21  4049  6 
P09  25  3388  3388 (3388.2)  1  3388  0.01  3388  0.01  3388  6 
P10  16  3983  3983  0.7  3983  0.01  3983  0.01  3983  5 
P11  18  3380  3380  0.0  3380  0.01  3380  0.01  3380  5 
P12  26  657  657  0.0  657  0.01  657  0.01  657  6 
P13  26  3220  3220 (3221.1)  0.7  3220 (3229)  0.01  3220  0.09  3220  6 
P14  31  3157  3157  0.0  3157  0.01  3157  0.01  3157  6 
P15  34  341  341  1.8  341 (343.1)  0.01  341  0.01  341  6 
P16  34  2343  2343  0.7  2343 (2383.3)  0.01  2343  0.01  2343  6 
P17  37  3281  3281 (3322.2)  2.7  3281 (3282.6)  0.05  3281  0.01  3281  7 
P18  35  3228  3228  0.1  3228  0.01  3228  0.01  3228  6 
P19  36  3710  3710  0.4  3710  0.01  3710  0.01  3710  6 
P20  37  1830  1830 (1841)  4.9  1830 (1844)  0.05  1830  0.38  1830  6 
P21  38  3660  3660 (3660.5)  0.8  3660 (3707.0)  0.01  3660  0.01  3660  7 
P22  38  1912  1912 (1912.2)  0.3  1912 (1946)  0.21  1912  0.01  1912  7 
P23  44  3770  3770 (3793.0)  0.3  3770 (3804)  0.04  3770  0.01  3770  7 
P24  34  661  661  0.0  661 (667.1)  0.19  661  0.01  661  6 
P25  36  504  504  0.3  504  0.01  504  0.01  504  6 
P26  37  520  520  0.1  520  0.01  520  0.01  520  7 
P27  44  216  216  0.1  216 (219)  0.01  216  0.07  216  7 
P28  44  1729  1729 (1735.1)  2.6  1729  0.01  1729  0.01  1729  7 
P29  53  3470  3470  0.1  3470  0.01  3470  0.01  3470  8 
P30  60  4891  4891  54  4891 (4901)  0.01  4891  0.01  4891  10 
P31  47  620  620  3.7  620  0.01  620  0.01  620  6 
P32  51  2480  2480  0.4  2480  0.01  2480  0.01  2480  8 
P33  56  3018  3018 (3029.7)  0.4  3018 (3096)  0.02  3018  0.01  3018  9 
P34  74  1980  1980 (1980.5)  3.0  1980 (1994)  0.01  1980  0.03  1980  14 
P35  86  2140  2140 (2145.0)  4.5  2140 (2161)  0.01  2140  0.02  2140  18 
P36  101  7210  7210 (7385)  0.1  7210  0.01  7210  0.01  7210  24 
P38  87  2130  2130 (2139.5)  9.5  2140 (2161)  0.01  2130  0.29  2130  18 
P40  86  4984  4984 (5016.6)  5.1  5005 (5082.7)  0.01  4984  0.33  4984  18 
P41  116  2688  2688 ( 2688.1)  0.1  2688 (2785.8)  0.04  2688  0.32  2688  453 
P42  138  2466  2466 (2671.2)  931.0  2482 (2539.9)  0.02  2466  9.02  2466  27 
Best rate  100%  91%  100 %  100% 
Instance  AFISA  RedLS  ILSTS  DLMCOL  

Graph name  Opti.  Best (Avg.)  t (s)  Best (Avg.)  t (s)  Best (Avg.)  t (s)  Best (Avg.)  t (s)  
r01  144  6724  6724 (6727.8)  49.5  6732 (6769.2)  0.01  6724  0.96  6724  31 
r02  142  6771  6771 (6780.6)  85.3  6774 (6818.6)  0.01  6771  0.25  6771  30 
r03  139  6473  6473 (6490.8)  190.2  6505 (6597.7)  233.5  6473  4.53  6473  51 
r04  151  6342  6342 (6403.2)  467.3  6349 (6427.7)  0.42  6342  0.83  6342  81 
r05  142  6408  6408 (6466.3)  71.7  6411 (65010.3)  0.42  6408  2.57  6408  30 
r06  148  7550  7550 (7555.9)  29.2  7550 (7558.9)  0.01  7550  0.01  7550  31 
r07  141  6889  6889 (7555.9)  34.8  6910 (6974.2)  954  6889  7.29  6889  29 
r08  138  6057  6057 (6080.3)  311.7  6071 (6147.4)  0.04  6057  1.6  6057  234 
r09  129  6358  6358 (6393.8)  395.2  6390 (6451.9)  66.13  6358  1.84  6358  26 
r10  150  6508  6508 (6519.3)  461.1  6518 (65078.6)  0.05  6508  2.46  6508  33 
r11  208  7654  7654 (7710.6)  9542.2  7691 (7739.5)  489.45  7654  5.27  7654  432 
r12  199  7690  7691 (7710.4)  9542.2  7694 (7730.2)  2.61  7690  4.33  7690  58 
r13  217  7500  7521 (7558.3)  619.5  7524 (7566.7)  0.04  7500  5.8  7500  71 
r14  214  8254  8254 (8283.9)  8044.1  8288 (8371.4)  0.87  8254  4.78  8254  68 
r15  198  8021  8021 (8126.8)  2559.1  8021 (8024.0)  0.01  8021  0.01  8021  172 
r16  188  7755  7755 (7789.2)  195.5  7764 (7809.4)  0.01  7755  11.22  7755  51 
r17  213  7979  7979 (8030.3)  855.4  8011 (8064.3)  0.86  7979  4.39  7979  242 
r18  200  7232  7232 (7278.9)  868.2  7240 (7295.3)  11.53  7232  26.4  7232  4374 
r19  185  6826  6840 (6868.1)  395.5  6826 (6850.5)  39.15  6826  2.09  6826  189 
r20  217  8023  8023 (8102.0)  1028.5  8031 (8138.3)  1.68  8023  13.08  8023  3027 
r21  281  9284  9284 (9384.5)  4588.7  9294 (9320.1)  0.01  9284  9.15  9284  6103 
r22  285  8887  8887 (8959.3)  12911  8924 (9030.6)  0.01  8887  63.42  8887  1521 
r23  288  9136  9136 (9267.9)  3252.0  9145 (9222.0)  0.05  9136  42.77  9136 (9137.7)  25716 
r24  269  8464  8464 (8572.9)  13142.6  8468 (8534.5)  0.01  8464  0.51  8464  797 
r25  266  8426  8468 (8560.8)  874.8  8579 (8649.6)  0.01  8426  36.03  8426  7088 
r26  284  8819  8819 (8927.9)  14225.1  8937 (9035.3)  0.01  8819  82.2  8819  22861 
r27  259  7975  7975 (8019.7)  14074.9  7975 (7997.3)  1.71  7975  9.86  7975  102 
r28  288  9407  9407 (9599.4)  8691.0  9409 (9475.4)  0.01  9407  0.44  9407  1891 
r29  281  8693  8693 (8743.7)  7613.1  8701 (8743.7)  0.03  8693  4.54  8693  3429 
r30  301  9816  9816 (10003.2)  8838.6  9820 (9877.1)  0.01  9816  1.36  9816  147 
Best rate  87%  13.3%  100 %  100% 
Instance  AFISA  RedLS  ILSTS  DLMCOL  
Graph name  Opti.  Best (Avg.)  t (s)  Best (Avg.)  t (s)  Best (Avg.)  t (s)  Best (Avg.)  t (s)  
C2000.5  2000    2400 (2425.1)  3134.0  2151 (2162.4)  11827  2250 (2266.2)  11030     
C2000.9  2000    6228 (6284.0)  2798.3  5486 (5507.9)  166246  5808 (5849.3)  161980     
DSJC1000.1  1000    359 (362.9)  430.5  300 (302.6)  98115  305 (305.9)  97025  342.0 (344.5)  1105 
DSJC1000.5  1000    1357 (1430.9)  371.7  1220 (1228.4)  234  1242 (1270.4)  1929  1230.0 (1260.0)  167639 
DSJC1000.9  1000    3166 (3231.0)  490.2  2864 (2875.7)  48298  2975 (2997.8)  101238     
DSJC250.1  250    140 (141.9)  48.9  130 (132)  1  127 (127)  2576  127 (127)  1353 
DSJC250.5  250    415 (428.1)  269.2  404 (407.7)  43579  393 (393.3)  58615  392 (392)  9226 
DSJC250.9  250  934  939 (943.2)  926  940 (943.11)  801.22  934 (936.4)  7670.05  934  4722 
DSJC500.1  500    210 (215.6)  426  188 (189.8)  50  184 (185.4)  72845  184 (184.9)  23049 
DSJC500.5  500    778 (845.1)  159.3  726 (729.0)  2466  707 (716.9)  51407  686* (688.4)  47064 
DSJC500.9  500    1790 (1854.5)  831.1  1681 (1685.3)  148914  1692 (1702.5)  136637  1662* (1664)  121518 
DSJR500.1  500    169 (175.4)  458.9  171 (173.4)  1  169 (169)  0.56  169 (171.7)  35389 
flat1000_50_0  1000    1289 (1315.7)  981.8  1184 (1186.7)  93711  1218 (1226.4)  141135  924*  82068 
flat1000_60_0  1000    1338 (1354)  201.9  1220 (1230.6)  264  1242 (1258.0)  60631  1224 (1231.6)  168937 
flat1000_76_0  1000    1314 (1337.6 )  2396.6  1200 (1210.5)  316  1240 (1246.7)  141292  1210.0 (1220.6)  164934 
inithx.i.1  864    587 (587.9 )  527.5  569 (571.4)  2  569  ¡ 0.01  569  214 
inithx.i.2  645    341 (341.6)  0.01  329 (331.5)  1328  329  184  336 (337.0)  127 
inithx.i.3  621    352 (355.6 )  0.01  337 (337.4)  5  337  1  350 (352.2)  142 
latin_square_10  900    1690 (1900.0)  780.3  1548 (1561.8)  36451  1542 (1557.0)  142925  1483* (1486.2)  149656 
le450_15a  450    241 (247.1)  288.4  217 (218.7)  1  213 (213.8)  83054  213 (216.5)  43854 
le450_15b  450    239(245.1)  368.3  219 (225.3)  1  217 (218.2)  2505  217 (219.6)  71563 
le450_15c  450    313 (320.8)  432.9  288 (292.2)  2  277 (280.3)  17392  275* (277.4)  46789 
le450_15d  450    306 (314.1)  113.7  285 (288.9)  979  274 (275.8)  155735  272* (272.4)  22917 
le450_25a  450    317 (329.9)  362.3  308 (310.4)  45732  306 (306.0)  715  307 (308.6)  26228 
le450_25b  450    318 (325.8)  285.9  308 (310.4)  45732  307 (307.0)  18  307 (308.8)  43456 
le450_25c  450    378 (387.9 )  359.4  360 (364.1)  2  349 (352.1)  82371  342* (343)  57924 
le450_25d  450    375 (385.3)  254.8  342 (350.2)  4  339 (342.4)  16881  329* (330.1)  45128 
miles1000  128  431  432 (444.7 )  480.0  431  6.06  431  18.81  431  581 
miles1500  128  797  797  1802  797 (797.3)  0.03  797  1.68  797  76 
miles250  128  102  102 (102.7)  56.6  102 (103.6)  0.7  102  0.18  102  21 
miles500  128  260  260 (261.3)  48.4  260 (260.4)  1.39  260  0.18  260  30 
queen10_10  100    166 (169.2 )  68.4  162 (165.3)  10400  162 (162.0)  24  162 (162)  19 
queen11_11  121    178 (182.3)  55.2  179 (180.1)  909  172 (172.7)  68208  172 (172)  1753 
queen12_12  144    194 (198.6)  92.7  188 (189.4)  133726  185 (185.2)  23709  185 (185)  1826 
queen13_13  169    204 (207.5)  199.8  197 (200.9)  8100  194 (194.8)  20802  194 (194)  1150 
queen14_14  196    224 (227.4)  360.1  217 (219.7)  2  216 (217.1)  469  215* (215.2)  16621 
queen15_15  225    237 (241.2)  183.4  230 (233.5)  1  224 (226.7)  2685  223* (224.1)  9276 
queen16_16  256    253 (256.3)  300.9  242 (246.0)  579  238 (239.0)  60187  234* (234.8)  14751 
wap01a  2368    638 (653.1)  1133.5  545 (558.6)  89184  549 (550.5)  62497     
wap02a  2464    637 (638.1)  3270.46  538 (547.1)  40143  541 (543.1)  46848     
wap03a  4730    687 (707.5)  2901.5  562 (566.9)  134923  577 (579.7)  55554     
wap04a  5231    698 (709.0 )  4.79  563 (583.0)  52833  570 (573.0)  538455     
wap05a  905    598 (610.9)  1574.5  542 (548.1)  1348  542 (542.8)  62323  587 (588.0)  1139 
wap06a  947    599 (607.6)  65.3  519 (529.4)  134  519 (520.8)  46077  574 (587.0)  1244 
wap07a  1809    680 (692.5)  384.82  561 (563.8)  3433  567 (571.0)  11943  685 (686.0)  8294 
wap08a  1870    663 (673.4)  2627.2  529 (540.3)  37505  546 (549.7)  67847  663 (665)  7397 
zeroin.i.1  211  511  518  0.01  511  0.08  511  1.7  511  38 
zeroin.i.2  211  336  336 (337.6)  440.8  336  0.22  336  0.01  336  33 
zeroin.i.3  206  298  299 (301.7)  139.6  298 (298.7)  2.25  298  10.67  298  31 
Best rate  10.2%  53.1%  48.9 %  63.3% 
3.5 Impact of the selection selection strategy driven by deep learning
In this section we aim to analyze the benefits of the neural network based crossover selection strategy (see Section 2.6), which is one of the key components of the proposed DLMCOL framework.
We launched 10 replications of DLMCOL on 4 instances (DSJC500.1, DSJC500.5, le450_25c and le450_25d) with a cutoff time of 20 hours, with and without the neural network crossover selection. In the version without neural network, a second parent is randomly chosen among the nearest neighbors of each individual in order to perform a crossover.
The average best score obtained at each generation is displayed on Figure 1. Green curve corresponds to the standard DLMCOL algorithm while the red curve corresponds to the version without deep learning. One first notices that the version without deep learning can perform more generations in the same amount of time because there is no time spent in the neural network training and offspring evaluations.
However we observe that the green curve is always below the red curve and that better results can be achieved in the same amount of time. This highlights that the neural network can really help in the selection of promising crossovers for the memetic algorithm.
4 Conclusion
A deep learning guide memetic framework for graph coloring problems was presented in this paper, as well as an implementation on GPU devices to solve the typical weighted graph coloring problem. This approach uses the deep set architecture to learn an invariant by color permutation regression model, useful to select the most promising crossovers at each generation. It can take advantage of GPU computations to perform massively parallel local search computations of a large population.
The proposed approach was assessed on popular DIMACS and COLOR challenge benchmarks of the weighted graph coloring problem. The computational results show that the algorithm competes globally well with the best algorithms on this problem. It can find 11 new upper bounds for very difficult instances and significantly improve the previous best results for three graphs. The same framework with the same type of neural network architecture could be applied to solve other graph coloring problems.
The achieved results reveals however three main limitations of the proposed approach. First, due to the memory capacity on the GPU devices we used, the DLMCOL algorithm has trouble to deal with very large instances. In particular, for the parallel local searches, the memory available on each thread of the GPU can be a huge limitation. Secondly, the algorithm can be quite slow to converge in comparison with sequential local search algorithms, due to the large population and the time spent to train the neural network at each generation. Thirdly, the algorithm fails for large instances with a low density (sparse graphs), as for these instance and for the studied coloring problem, the neural network has trouble to learn good patterns to effectively guide the selection of promising crossovers.
Other future works could be done. In particular, it could be interesting to use deep learning techniques to learn a specific crossover for this weighted graph coloring problem instead of the classical GPX crossover used in this algorithm. Other neural network structures could be investigated to overcome the difficulty encountered on sparse graphs.
Acknowledgment
We would like to thank Dr. Wen Sun for sharing the binary code of their AFISA algorithm [sun2017feasible], Dr. Yiyuan Wang for sharing the source code of their RedLS algorithm [wang2020reduction] and Pr. Bruno Nogueira for sharing the source code of their ILSTS algorithm [nogueira2021iterated].
This work was granted access to the HPC resources of IDRIS (Grant No. 2020A0090611887) from GENCI.