I Introduction
Traveling Salesman Problem (TSP) is one of the most
famous combinatorial optimization
problems. Given the coordinates of some points, the goal in the TSP problem is to find a shortest tour that visits each point exactly once and returns to the starting point. TSP is an NP hard problem [PAPADIMITRIOU1977237] , even in its symmetric 2D Euclidean version, which is this paper’s focus. Traditional approaches to solve TSP can be classified as exact
or heuristic. Exact solvers, such as Concorde [CC]or based on integer linear programming, can
find an optimal solution. However, since TSP is NP hard, such algorithms have computational times that increase exponentially with the size of a TSP instance. In contrast, heuristic approaches provide a TSP solution with a much shorter computational time compared to exact solvers , but do not guarantee optimality. These approaches are either constructive (e.g., farthest insertion [attention]), perturbative (e.g., 2opt [2opt], LKH [LKH]), or hybrid. However, they may not provide any good performance guarantee and are still computationally costly. Indeed, even a quadratic computational complexity may become prohibitive when dealing with large TSP instances (e.g., 1000 cities).Thus, recent research work has focused on using Deep Learning (DL) to design faster heuristics to solve TSP problems. Since training on large TSP instances is costly, generalization is a key factor in such DLbased approaches. They are either based onSupervised Learning (SL) [45283, joshi2019efficient, Fu] or Reinforcement Learning (RL) [DBLP:journals/corr/BelloPLNB16, attention, GPN, DBLP:journals/corr/abs200401608]. These different approaches , which are either constructive, perturbative, or hybrid, have different pros and cons. For example, Fu’s model , which combines DL with Monte Carlo Tree Search (MCTS) [MCTS], has great generalization capabilities. Namely, they can train on small TSP instances and perform well on larger instances. However, the computational cost of Fu’s model is high due to MCTS. In contrast, other models (e.g., [joshi2019efficient, attention]) can solve small TSP instances with fast speed and great performance, but they lack generalizability.
In this paper, we propose a novel deep RL approach that can achieve excellent performance with good generalizability for a reasonable computational cost. The contributions of this paper can be summarized as follows. Our approach is based on an encoderdecoder model (using Graph Neural Network (GNN) [GNN] and Multilayer Perceptron (MLP) [MLP] as the encoder and an attention mechanism [attention_mechanism] as the decoder), which is trained with a new deep RL method that interleaves policy gradient updates (with a simple baseline called policy rollout baseline) and local search (with a novel combined local search technique). Moreover, curriculum learning is applied to help with training and generalization. Due to all the used techniques, we name our model as MAGIC (MLP for M, Attention for A, GNN for G, Interleaved local search for I, and Curriculum Learning for C). Finally, we empirically show that MAGIC is a stateoftheart deep RL solver for TSP, which offers a good tradeoff in terms of performance, generalizability, and computational time.
This paper is structured as follows. Section II overviews related work. Section III recalls the necessary background. Section IV introduces our model architecture. Section V describes our novel training technique by explaining how we apply local search, the policy rollout baseline, and curriculum learning during training. Section VI presents the experimental results and Section VII concludes.
Ii Related Work
RL can be used as a constructive heuristic to generate a tour or as a machine learning method integrated in a traditional method, such as
[DBLP:journals/corr/abs200401608], which learns to apply 2opt. For space reasons, we mainly discuss deep RL work in the constructive approach (see [Bai] for a more comprehensive survey), since they are the most related to our work. Besides, a recent work [joshi_learning_2020] suggests that RL training may lead to better generalization than supervised learning.Such deep RL work started with Pointer Network [45283] , which was proposed as a general model that could solve an entire class of TSP instances.
It has an encoderdecoder architecture, both based on recurrent neural networks, combined with an
attention mechanism [Bahdanau2015]. The model is trained in a supervised way using solutions generated by Concorde [CC]. The results are promising, but the authors focused only on smallscale TSP instances ( with up to 50 cities) and did not deal with generalization.This approach was extended to the RL setting [DBLP:journals/corr/BelloPLNB16]
and shown to scale to TSP with up to 100 cities. The RL training is based on an actorcritic scheme using tour lengths as unbiased estimates of the value of a policy.
In contrast to [DBLP:journals/corr/BelloPLNB16], a valuebased deep RL [dai_learning_2017] was also investigated to solve graph combinatorial optimization problems in general and TSP in particular. The approach uses graph embeddings to represent partial solutions and RL to learn a greedy policy.The Attention Model [attention] improves the Pointer Network [45283] notably by replacing the recurrent neural networks by attention models [attention_mechanism] and using RL training with a simple greedy rollout baseline. These changes allowed them to achieve better results on smallscale TSP instances, as well as to generalize to 100city TSP instances. However, their model fails to generalize well to largescale TSP (e.g., with 1000 cities) and their algorithm does not scale well in terms of memory usage.
A similar, although slightly more complex, approach is proposed in [deudon_learning_2018], which also suggests to improve the tour returned by the deep RL policy with a 2opt local search, which makes the overall combination a hybrid heuristics. In contrast to that work, we not only apply local search as a final improvement step, but also integrate local search in the training of our deep RL model. Moreover, we use a more sophisticated local search.
Moreover, the Graph Pointer Network (GPN) model [GPN] was proposed to improve over previous models by exploiting graph neural networks [GNN] and using a central selfcritic baseline, which is a centered greedy rollout baseline. Like [deudon_learning_2018], 2opt is also considered. As a result, they report good results when generalizing to largescale TSP instances. Our simpler model and new training method outperforms GPN on both small and larger TSP instances.
Iii Background
This section provides the necessary information to understand our model architecture (Section IV) and our training method (Section V). For any , denotes
. Vectors and matrices are denoted in bold.
Iiia Traveling Salesman Problem
A Traveling Salesperson Problem (TSP) can informally be stated as follows. Given cities, the goal in a TSP instance is to find a shortest tour that visits each city exactly once. Formally, the set of cities can be identified to the set . In the symmetric 2D Euclidean version of the TSP problem, each city is characterized by its 2Dcoordinates . Let denote the set of city coordinates and the matrix containing all these coordinates. The distance between two cities is usually measured in terms of the L2norm :
(1) 
A feasible TSP solution, called a tour, corresponds to a permutation over . Its length is defined as:
(2) 
where for , is the th city visited in the tour defined by , and by abuse of notation, . Therefore, the TSP problem can be viewed as the following optimization problem:
(3) 
Since scaling the city positions does not change the TSP solution, we assume in the remaining of the paper that the coordinates of all cities are in the square , as done in previous work [DBLP:journals/corr/BelloPLNB16, DBLP:journals/corr/abs200401608, attention, GPN].
IiiB Insertion heuristic s and kopt optimization for TSP
Since TSP is an NPhard problem [PAPADIMITRIOU1977237], various heuristic techniques have been proposed to quickly compute a solution, which may however be suboptimal. We recall two family of heuristics: insertion heuristics [attention] and kopt [kopt].
Insertion heuristics (including nearest, farthest , and random insertion) are constructive, i.e., they iteratively build a solution. They work as follows. They first randomly choose a starting city and repeatedly insert one new city at a time until obtaining a complete tour. Let denote a partial tour, i.e., a partial list of all cities. Different insertion heuristics follow different rule s to choose a new city : random insertion choose s a new city randomly; nearest insertion chooses according to:
(4) 
and farthest insertion chooses according to the following rule:
(5) 
where means city is not in the partial tour and means city is in the partial tour. The position where city is inserted into is determined such that: is minimized.
A classic local search heuristic is opt, which aims to improve an exist ing tour by swapping chosen edge s at each iteration. The simplest one is opt, which can replace by where if . This kind improvement can be found in different ways. For instance, traditional 2opt may examine all pairs of edges, while random 2opt examines randomlyselected pairs. LKH [LKH] is one algorithm that applies opt and achieve s nearly optimal results. However, LKH has a long run time, especially for largescale TSP problems.
Iv Model and Architecture
RL can be used as a constructive method to iteratively generate a complete tour: at each iteration , a new city with coordinates is selected based on the list of previously selected cities and the description of the TSP instance. Formally, this RL model is defined as follows. A state is composed of the TSP description and the sequence of already visited cities at time step . State denotes the initial state where no city has been selected yet and state represents the state where the whole tour has already been constructed. An action corresponds to the next city to be visited, i.e., . This RL problem corresponds to a repeated horizon sequential decisionmaking problem where the action set for any time step depends on the current state and only contains the cities that have not been visited yet. The immediate reward for performing an action in a state is given as the negative length between the last visited city and the next chosen one:
(6) 
After choosing the first city, no reward can be computed yet. After the last city, a final additional reward is provided given by . Thus, a complete trajectory corresponds to a tour and the return of a trajectory is equal to the negative the length of that tour. Most RLbased constructive solver is based on this RL formulation. In Section V, we change the return provided to the RL agent to improve its performance using local search.
To perform the selection of the next city, we propose the MAGIC architecture (see Fig. 1), which corresponds to a stochastic policy (see Section V for more details). It is composed of three parts: (A) an encoder implemented with a graph neural network (GNN) [GNN] and a multilayer perceptron (MLP), (B) a decoder based on an attention mechanism [attention_mechanism], and (C) a sampling process.
Iva Encoder
When solving a TSP problem, not only should the last selected city be considered, but also the whole city list should be taken into account as background information. Since the information contained in 2D coordinates is limited and does not include the topology of the cities, we leverage GNN and MLP to encode city coordinates into a higher dimensional space, depicted in Fig. 1. The GNN is used to encode the city coordinates into where is the dimension of the embedding space. The MLP is used to encode the last selected city at iteration into . Therefore, generally speaking, the GNN and MLP in MAGIC can be viewed as two functions:
(7) 
IvA1 Gnn
GNN is a technique which can embed all nodes in a graph together. Similarly to the GPN model [GPN], we use a GNN to encode the whole city list of a TSP instance. Fig. 2 shows the detailed architecture of the GNN used in MAGIC. After is transformed into a vector , will go through layers of GNN. Each layer of GNN can be expressed as
(8) 
where is the input of the layer of the GNN for , , is an learnable matrix, which is represented by a neural network, is the aggregation function [GNN] , and is a trainable parameter.
IvA2 Mlp
While the GNN provides us with general information within the whole city list , we also need to encode the last selected city . In contrast to previous work using complex architectures like GNN or LSTM [LSTM], we simply use an MLP. Using a GNN would make the embedding of the last selected city depend on the whole city list included the alreadyvisited cities, while using an LSTM would make the embedding depends on the order of visited cities, which is in fact irrelevant.
IvB Decoder
The decoder of the MAGIC model is based on an attention mechanism , which was also used in several previous studies [DBLP:journals/corr/BelloPLNB16, attention, GPN, Fu]. The output of the decoder is a pointer vector [DBLP:journals/corr/BelloPLNB16], which can be expressed as:
(9) 
where is the entry of the vector , is the row of the matrix , and are trainable matrices with shape , is a trainable weight vector. For the definitions of and , please refer to Fig. 1.
A softmax transformation is used to turn
into a probability distribution
over cities:(10) 
where is the
entry of the probability distribution
and is the entry of the vector . Notice that if the city is visited, then due to (9). Under this circumstance, according to (10). That is to say, all visited cities cannot be visited again.IvC Sampling Process
After we obtain the probability distribution , it is trivial to select the next city. Indeed, corresponds to the RL policy at time step :
(11) 
where (resp. ) is the state (resp. action) at time step and is the probability of choosing as the city. Therefore, we just need to sample the next city according to the probability distribution .
V Algorithm and Training
For the training of MAGIC, we propose to interleave standard policy gradient updates with local search. In contrast to previous work, our idea is to learn a policy that can generate tours that can be easily improved with local search. Next, we explain our local search technique, which include a novel local insertionbased heuristics. Then, we present how policy gradient with a simple policy rollout baseline can be applied. Finally, we motivate the use of stochastic curriculum learning method in our setting.
Va Local search
We describe the local search technique that we use for training our model and to improve the tour output by the RL policy. Our technique uses two local search heuristics in combination: random opt and a local insertion heuristics, which is novel to the best of our knowledge. The two heuristics have been chosen and designed to be computationally efficient, which is important since we will apply them during RL training. The motivation for combining two heuristics is that when one method gets stuck in some local minimum, the other method may help escape from it.
For random 2opt, we randomly pick 2 arcs for improvement and repeat for , where is the number of the cities and and
are two hyperparameters. We set
and here to have a flexible control of the strength of this local search and make it stronger if needed for larger TSP problems. With this procedure, random 2opt can be much faster than traditional 2opt.Inspired by the insertion heuristics, we propose local insertion optimization. Let be the current tour and if and . This method (see Algorithm 1) first iterates through all indices , and for each index , we let , where is a hyperparameter, and then replace by . The rationale for restricting the optimization with hyperparameter is as follows. For a good suboptimal tour , cities that are close in terms of visit order in are usually also close in terms of distance. In that case, is unlikely to improve over when and are far apart. Thus, we set to limit the search range to increase the computational efficiency of this heuristics.
We call our local search technique combined local search (see Algorithm 2), which applies random 2opt followed by local insertion optimization repeatedly for times, where is a hyperparameter.
VB Interleaved RL training with the policy rollout baseline
Our model is trained with the REINFORCE [Williams92simplestatistical] algorithm. The novelty is that we interleave local search with the policy gradient updates. When the current policy outputs a tour , this solution is further improved with our combined local search technique to obtain a new tour . In contrast to previous work, this tour instead of is used to evaluate policy . The rationale for this procedure is to make the RL policy and local search work in synergy by favoring learning policies that generate tours that can be easily improved by the combined local search. If the RL training and local search are not coordinated, as done in previous work, then a trained policy may generate tours that are hard to improve by local search.
VB1 Policy Gradient
We recall first standard policy gradient and then explain how we modify it. With the reward function in (6), the RL goal would be to find such that
(12) 
where for , , is a trajectory, and is the probability distribution over tours induced by policy . Recall the gradient of [Williams92simplestatistical] is:
(13) 
where stands for . For a large enough batch of trajectories, (13) is approximated with the empirical mean:
(14) 
where , (resp. ) is the state (resp. action) at time step of the th trajectory generated by , and denotes the empirical mean operation. Then the policy gradient in (13) can be approximated by:
(15) 
Instead of updating with this policy gradient, in our interleaved training, we use:
(16) 
where is the improved tour obtained from our combined local search from , the tour induced by trajectory . By construction, .
VB2 Policy rollout baseline
In order to reduce the variance of the policy gradient estimate (
16), we use a simple baseline and update in the following direction:(17) 
where is the baseline, which we call the policy rollout baseline. Such a baseline gives more weight in the policy gradient when local search can make more improvement. In our experiments, our baseline performs better than the previous greedy baselines [attention, GPN] in our training process. One other nice feature of our baseline is that it does not incur any extra computation since is already computed when policy generates .
VC Stochastic Curriculum Learning
Curriculum Learning (CL) is a widelyused technique in machine learning (and RL) [CL_survey], which can facilitate learning and improve generalization. Although it can be implemented in various ways, its basic principle is to control the increasing difficulty of the training instances.
To train MAGIC, we propose a stochastic CL technique where the probability of choosing harder instances increases over training steps. We choose the number of cities as a measure of difficulty for a TSP instance, which is assumed to be in in our experiments. We explain next how this selection probability is defined.
Forepoch , we define the vector (since there are 41 integers between 10 and 50) to be
(18) 
where represents the entry of , and
is a hyperparameter which represents the standard deviation of the normal distribution. Then, we use a softmax to formulate the probability distribution
of this epoch(19) 
where , and the entry of represents to probability of choosing TSP of cities at epoch .
VD Overall training process
In this part, we summarize our training process by providing the corresponding pseudo code in Algorithm 3.
Notice that line 7 can be replaced by any model that can used to generate a tour, showing that a variety of models can fit in our training process to improve their performance for TSP problems.
Vi Experiment s
Method  Type  TSP20  TSP50  TSP100  
Length  Gap  Time(s)  Length  Gap  Time(s)  Length  Gap  Time(s)  
Concorde  Exact Solver  3.830  0.00%  138.6  5.691  0.00%  820.8  7.761  0.00%  3744 
Gurobi  Exact Solver  3.830  0.00%  139.8  5.691  0.00%  1572  7.761  0.00%  12852 
2opt  Heuristic  4.082  6.56%  0.33  6.444  13.24%  2.25  9.100  17.26%  9.32 
Random Insertion  Heuristic  4.005  4.57%  196  6.128  7.69%  502.2  8.511  9.66%  1039 
Nearest Insertion  Heuristic  4.332  13.10%  229.8  6.780  19.14%  633  9.462  21.92%  1289 
Farthest Insertion  Heuristic  3.932  2.64%  239.8  6.010  5.62%  617  8.360  7.71%  1261 
GCN (Joshi et al.)  SL (Greedy)  3.855  0.65%  19.4  5.893  3.56%  120  8.413  8.40%  664.8 
AttGCRN+MCTS (Fu et al.)  SL+ MCTS  3.830  0.00%  98.3  5.691  0.01%  475.2  7.764  0.04%  873.6 
GAT (Deudon et al.)  RL (Sampling)  3.874  1.14%  618  6.109  7.34%  1171  8.837  13.87%  2867 
GAT (Kool et al.)  RL (Greedy)  3.841  0.29%  6.03  5.785  1.66%  34.9  8.101  4.38%  109.8 
GAT (Kool et al.)  RL (Sampling)  3.832  0.05%  988.2  5.719  0.49%  1371  7.974  2.74%  4428 
GPN (Ma et al.)  RL  4.074  6.35%  0.77  6.059  6.47%  2.50  8.885  14.49%  6.23 
MAGIC (Ours)  RL (Local Search)  3.870  1.09%  3.06  5.918  4.00%  14.8  8.256  6.39%  50.4 

* refers to methods whose results we directly use from others’ papers.
Method  Type  TSP200  TSP500  TSP1000  
Length  Gap  Time(s)  Length  Gap  Time(s)  Length  Gap  Time(s)  
Concorde  Solver  10.719  0.00%  206.4  16.546  0.00%  2260  23.118  0.00%  23940 
Gurobi  Solver                   
2opt  Heuristic  12.841  19.80%  34.0  20.436  23.51%  201.7  28.950  25.23%  826.2 
Random Insertion  Heuristic  11.842  10.47%  27.1  18.588  12.34%  68.3  26.118  12.98%  137.0 
Nearest Insertion  Heuristic  13.188  23.03%  28.8  20.614  24.59%  79.8  28.971  25.32%  176.6 
Farthest Insertion  Heuristic  11.644  8.63%  33.0  18.306  10.64%  84.0  25.743  11.35%  175.5 
GCN (Joshi et al.)  SL (Greedy)  17.014  58.73%  59.1  29.717  79.61%  400.2  48.615  110.29%  1711 
AttGCRN+MCTS (Fu et al.)  SL+ MCTS  10.814  0.88%  149.6  16.966  2.54%  354.6  23.863  3.22%  748.3 
GAT (Deudon et al.)  RL (Sampling)  13.175  22.91%  290.4  28.629  73.03%  1211  50.302  117.59%  2262 
GAT (Kool et al.)  RL (Greedy)  11.610  8.31%  5.03  20.019  20.99%  90.6  31.153  34.75%  190.8 
GAT (Kool et al.)  RL (Sampling)  11.450  6.82%  269.4  22.641  36.84%  938.4  42.804  85.15%  3838 
GPN (Ma et al.)  RL        19.605  18.49%    28.471  23.15%   
GPN+2opt (Ma et al.)  RL+2opt        18.358  10.95%    26.129  13.02%   
GPN (Ma et al.)  RL  13.278  23.87%  2.5  23.639  42.87%  7.13  37.849  63.72%  18.35 
MAGIC (Ours)  RL (Local Search)  11.539  7.65%  69.9  18.098  9.38%  207.8  25.542  10.49%  487.8 

* refers to methods whose results we directly use from others’ papers.
Method  Full version  w/o RL  w/o CL  w/o baseline  w/o local search  

Length  Gap  Length  Gap  Length  Gap  Length  Gap  Length  Gap  
TSP20  3.871  1.07%  3.9556  3.27%  3.917  2.27%  3.911  2.21%  3.988  4.10% 
TSP50  5.957  4.69%  6.1391  7.88%  5.959  4.72%  5.983  5.15%  5.962  4.76% 
TSP100  8.302  6.97%  8.5419  10.06%  8.343  7.51%  8.395  8.17%  8.331  7.35% 
TSP200  11.567  7.91%  11.9299  11.30%  11.682  8.99%  11.842  10.48%  11.631  8.50% 
TSP500  18.321  10.73%  18.9036  14.25%  18.332  10.80%  18.516  11.91%  18.526  11.97% 
TSP1000  25.854  11.84%  26.9936  16.76%  25.954  12.27%  26.188  13.28%  26.505  14.65% 
To demonstrate the performance and generalization ability of our proposed training process, we evaluate our method on randomly generated TSP instances and compare with other existing algorithms, which cover different kinds of methods for completeness and include two exact solvers, four traditional heuristics, and seven learning based ones. If the hardware and experiment setting of other papers are the same as ours, we will directly use their reported results on time and performance. To ensure a fair comparison of the runtimes and performances, all algorithms are executed on a computer with an Intel(R) Xeon(R) CPU E52678 v3 and a single GPU 1080Ti, and parallel computation is utilized as much as possible for all the algorithms. Moreover, to show the power of RL, CL, our policy rollout baseline and the combined local search, we also carry out an ablation study of those four components.
Via Data sets and hyperparameters
We denote TSP the set of random TSP instances with cities. For the sake of consistency with other research work, the coordinates of cities are
randomly (from a uniform distribution) generated from
. For training, the TSP size varies from 10 to 50 decided by CL in every epoch. After training the model according to our training process, we test our model on both small and large TSP instances. For the testing of small TSP problems, we test on 10,000 instances respectively for TSP 20, 50 and 100. For the testing of large TSP problems, we test on 128 instances respectively for TSP 200, 500 and 1000 to test the generalization ability.For the hyperparameters of the training, we train for 200 epochs, and we process 1000 batches of 128 instances every epoch. For the learning rate, we set it initially to be 0.001 with a 0.96 learning rate decay. For the hyperparameters of local search, we set , , and after a quick hyperparameter search. Those settings aim to train a model with fast speed and generalization ability.
For the model architecture, the aggregation function used in GNN is represented by a neutral network followed by a ReLU function on each entry of the output. Our MLP has an input layer with dimension 2, two hidden layers with dimension
and respectively, and an output layer with dimension. Layers are fully connected and we choose to use the ReLU as the activation function. And finally, we set
, .ViB Performance on smallscale and largescale TSP instances
The results are shown in Tables I and II. Column 1 and 2 respectively specify the method and its type, where SL refers to supervised learning, Greedy means a greedy construction from the probability given by the policy and Sampling refers to sampling multiple solutions from the probability given by the policy and choose the best one. Column 3 indicates the average tour length, Column 4 provides the gap to Concorde’s performance, which corresponds to the optimal solution for the smallscale TSP problems and nearly optimal solution for the largescale TSP problems, and Column 5 lists the total runtime. For comparisons, we have listed out 12 other methods covering from exact solvers, heuristics to learningbased algorithms.
As shown in Table I, the exact solvers have the best performance but with relatively long runtime; most of the learning based methods, including ours, receive better tour length than heuristics. Within the learning based methods, most methods are not more than better than ours. For TSP 100, only Fu and those who applied a sampling methods, which all use methods to search for best tour, have better tour than ours. For the speed, our method is fast and this is more prominent when the size is bigger. Those learningbased methods with better solutions than ours all run slower and only GPN is faster but with a significantly worse solution than ours. For the results in Table II, the solver also gives the best solution while its speed is relatively slow; many learningbased methods now expose their poor generalization ability and give worse results than heuristics. For heuristics, insertions methods show a good performance on large TSP problems. For our methods, we outperform all the learningbased models except for Fu for TSP 500 and 1000, showing a very good generalization ability. Plus, for the runtime, we are generally fast and especially faster than Fu.
Notice that heuristics has a good generalization ability and previous learningbased algorithms do well in small TSP problems. Our learning based method combined with local search, which is inspired by the heuristics, tends to receive the advantage of learning based methods and heuristics. Plus, it is also fast in terms of runtime, making it a comprehensive excellent method.
ViC The ablation study
To demonstrate the importance of RL, CL, the policy rollout baseline, and the combined local search in the training process, we perform an ablation study on them. For this study, since we only need to show the importance of each technique we apply, we turn down the hyperparameter to have a shorter runtime. For the ablation study on RL, we do not perform any learning and directly apply the combined local search on randomly generated tours. For the ablation study on CL, we follow the same settings except that the TSP size is fixed to be 50 for all epochs. For the ablation study on the policy rollout baseline, we use instead the central selfcritic baseline [GPN], which is inspired by the selfcritic training [selfcritics] and the greedy selfcritic baseline [attention]. For the ablation study on the combined local search, we do not perform any local search during training, but we still apply it in testing in order to show that interleaving local search with policy gradient updates outperforms only doing postoptimization. Note that the policy rollout baseline depends on the combined local search. Therefore, since the local search is ablated for this study, the policy rollout baseline also needs to be changed, and here we replace it by the central selfcritic baseline [GPN]. The results of the ablation study presented in Table III demonstrate that all the components used in our method contribute to its good performance.
Vii Conclusion
We introduced a novel deep RL approach for solving TSP achieving stateoftheart results in terms of gap to optimality and runtime compared to previous deep RL methods. The results are particularly promising in terms of generalizability. Our proposition consists of a simple deep learning architecture combining graph neural network, multilayer perceptron, and attention in addition to a novel RL training procedure, which interleaves local search and policy gradient updates, uses a novel return evaluation, and exploits curriculum learning.
As future work, we plan to evaluate our novel training procedure with other deep learning architectures proposed for TSP, but also other combinatorial optimization problems. Another research direction to improve further our results is to optimize the deep learning architecture in order to improve the encoding of the TSP problem.
Acknowledgment
This work is supported in part by the Innovative Practice Program of Shanghai Jiao Tong University (IPP21141), the program of National Natural Science Foundation of China (No. 61872238), and the program of the Shanghai NSF (No. 19ZR1426700). Moreover, we thank GPN for sharing their source code, which served as initial basis for our work.
Comments
There are no comments yet.