Improving Generalization of Deep Reinforcement Learning-based TSP Solvers

10/06/2021 ∙ by Wenbin Ouyang, et al. ∙ 0

Recent work applying deep reinforcement learning (DRL) to solve traveling salesman problems (TSP) has shown that DRL-based solvers can be fast and competitive with TSP heuristics for small instances, but do not generalize well to larger instances. In this work, we propose a novel approach named MAGIC that includes a deep learning architecture and a DRL training method. Our architecture, which integrates a multilayer perceptron, a graph neural network, and an attention model, defines a stochastic policy that sequentially generates a TSP solution. Our training method includes several innovations: (1) we interleave DRL policy gradient updates with local search (using a new local search technique), (2) we use a novel simple baseline, and (3) we apply curriculum learning. Finally, we empirically demonstrate that MAGIC is superior to other DRL-based methods on random TSP instances, both in terms of performance and generalizability. Moreover, our method compares favorably against TSP heuristics and other state-of-the-art approach in terms of performance and computational time.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Traveling Salesman Problem (TSP) is one of the most

famous combinatorial optimization

problems. Given the coordinates of some points, the goal in the TSP problem is to find a shortest tour that visits each point exactly once and returns to the starting point. TSP is an NP- hard problem [PAPADIMITRIOU1977237] , even in its symmetric 2D Euclidean version, which is this paper’s focus

. Traditional approaches to solve TSP can be classified as exact

or heuristic. Exact solvers, such as Concorde [CC]

or based on integer linear programming, can

find an optimal solution. However, since TSP is NP- hard, such algorithms have computational times that increase exponentially with the size of a TSP instance. In contrast, heuristic approaches provide a TSP solution with a much shorter computational time compared to exact solvers , but do not guarantee optimality. These approaches are either constructive (e.g., farthest insertion [attention]), perturbative (e.g., 2-opt [2opt], LKH [LKH]), or hybrid. However, they may not provide any good performance guarantee and are still computationally costly. Indeed, even a quadratic computational complexity may become prohibitive when dealing with large TSP instances (e.g., 1000 cities).

Thus, recent research work has focused on using Deep Learning (DL) to design faster heuristics to solve TSP problems. Since training on large TSP instances is costly, generalization is a key factor in such DL-based approaches. They are either based onSupervised Learning (SL) [45283, joshi2019efficient, Fu] or Reinforcement Learning (RL) [DBLP:journals/corr/BelloPLNB16, attention, GPN, DBLP:journals/corr/abs-2004-01608]. These different approaches , which are either constructive, perturbative, or hybrid, have different pros and cons. For example, Fu’s model , which combines DL with Monte Carlo Tree Search (MCTS) [MCTS], has great generalization capabilities. Namely, they can train on small TSP instances and perform well on larger instances. However, the computational cost of Fu’s model is high due to MCTS. In contrast, other models (e.g., [joshi2019efficient, attention]) can solve small TSP instances with fast speed and great performance, but they lack generalizability.

In this paper, we propose a novel deep RL approach that can achieve excellent performance with good generalizability for a reasonable computational cost. The contributions of this paper can be summarized as follows. Our approach is based on an encoder-decoder model (using Graph Neural Network (GNN) [GNN] and Multilayer Perceptron (MLP) [MLP] as the encoder and an attention mechanism [attention_mechanism] as the decoder), which is trained with a new deep RL method that interleaves policy gradient updates (with a simple baseline called policy rollout baseline) and local search (with a novel combined local search technique). Moreover, curriculum learning is applied to help with training and generalization. Due to all the used techniques, we name our model as MAGIC (MLP for M, Attention for A, GNN for G, Interleaved local search for I, and Curriculum Learning for C). Finally, we empirically show that MAGIC is a state-of-the-art deep RL solver for TSP, which offers a good trade-off in terms of performance, generalizability, and computational time.

This paper is structured as follows. Section II overviews related work. Section III recalls the necessary background. Section IV introduces our model architecture. Section V describes our novel training technique by explaining how we apply local search, the policy rollout baseline, and curriculum learning during training. Section VI presents the experimental results and Section VII concludes.

Ii Related Work

RL can be used as a constructive heuristic to generate a tour or as a machine learning method integrated in a traditional method, such as

[DBLP:journals/corr/abs-2004-01608], which learns to apply 2-opt. For space reasons, we mainly discuss deep RL work in the constructive approach (see [Bai] for a more comprehensive survey), since they are the most related to our work. Besides, a recent work [joshi_learning_2020] suggests that RL training may lead to better generalization than supervised learning.

Such deep RL work started with Pointer Network [45283] , which was proposed as a general model that could solve an entire class of TSP instances.

It has an encoder-decoder architecture, both based on recurrent neural networks, combined with an

attention mechanism [Bahdanau2015]. The model is trained in a supervised way using solutions generated by Concorde [CC]. The results are promising, but the authors focused only on small-scale TSP instances ( with up to 50 cities) and did not deal with generalization.

This approach was extended to the RL setting [DBLP:journals/corr/BelloPLNB16]

and shown to scale to TSP with up to 100 cities. The RL training is based on an actor-critic scheme using tour lengths as unbiased estimates of the value of a policy.

In contrast to [DBLP:journals/corr/BelloPLNB16], a value-based deep RL [dai_learning_2017] was also investigated to solve graph combinatorial optimization problems in general and TSP in particular. The approach uses graph embeddings to represent partial solutions and RL to learn a greedy policy.

The Attention Model [attention] improves the Pointer Network [45283] notably by replacing the recurrent neural networks by attention models [attention_mechanism] and using RL training with a simple greedy rollout baseline. These changes allowed them to achieve better results on small-scale TSP instances, as well as to generalize to 100-city TSP instances. However, their model fails to generalize well to large-scale TSP (e.g., with 1000 cities) and their algorithm does not scale well in terms of memory usage.

A similar, although slightly more complex, approach is proposed in [deudon_learning_2018], which also suggests to improve the tour returned by the deep RL policy with a 2-opt local search, which makes the overall combination a hybrid heuristics. In contrast to that work, we not only apply local search as a final improvement step, but also integrate local search in the training of our deep RL model. Moreover, we use a more sophisticated local search.

Moreover, the Graph Pointer Network (GPN) model [GPN] was proposed to improve over previous models by exploiting graph neural networks [GNN] and using a central self-critic baseline, which is a centered greedy rollout baseline. Like [deudon_learning_2018], 2-opt is also considered. As a result, they report good results when generalizing to large-scale TSP instances. Our simpler model and new training method outperforms GPN on both small and larger TSP instances.

Iii Background

This section provides the necessary information to understand our model architecture (Section IV) and our training method (Section V). For any , denotes

. Vectors and matrices are denoted in bold.

Iii-a Traveling Salesman Problem

A Traveling Salesperson Problem (TSP) can informally be stated as follows. Given cities, the goal in a TSP instance is to find a shortest tour that visits each city exactly once. Formally, the set of cities can be identified to the set . In the symmetric 2D Euclidean version of the TSP problem, each city is characterized by its 2D-coordinates . Let denote the set of city coordinates and the matrix containing all these coordinates. The distance between two cities is usually measured in terms of the L2-norm :


A feasible TSP solution, called a tour, corresponds to a permutation over . Its length is defined as:


where for , is the -th city visited in the tour defined by , and by abuse of notation, . Therefore, the TSP problem can be viewed as the following optimization problem:


Since scaling the city positions does not change the TSP solution, we assume in the remaining of the paper that the coordinates of all cities are in the square , as done in previous work [DBLP:journals/corr/BelloPLNB16, DBLP:journals/corr/abs-2004-01608, attention, GPN].

Fig. 1: Architecture of the model used in MAGIC.

Iii-B Insertion heuristic s and k-opt optimization for TSP

Since TSP is an NP-hard problem [PAPADIMITRIOU1977237], various heuristic techniques have been proposed to quickly compute a solution, which may however be sub-optimal. We recall two family of heuristics: insertion heuristics [attention] and k-opt [kopt].

Insertion heuristics (including nearest, farthest , and random insertion) are constructive, i.e., they iteratively build a solution. They work as follows. They first randomly choose a starting city and repeatedly insert one new city at a time until obtaining a complete tour. Let denote a partial tour, i.e., a partial list of all cities. Different insertion heuristics follow different rule s to choose a new city : random insertion choose s a new city randomly; nearest insertion chooses according to:


and farthest insertion chooses according to the following rule:


where means city is not in the partial tour and means city is in the partial tour. The position where city is inserted into is determined such that: is minimized.

A classic local search heuristic is -opt, which aims to improve an exist ing tour by swapping chosen edge s at each iteration. The simplest one is -opt, which can replace by where if . This kind improvement can be found in different ways. For instance, traditional 2-opt may examine all pairs of edges, while random 2-opt examines randomly-selected pairs. LKH [LKH] is one algorithm that applies -opt and achieve s nearly optimal results. However, LKH has a long run time, especially for large-scale TSP problems.

Iv Model and Architecture

RL can be used as a constructive method to iteratively generate a complete tour: at each iteration , a new city with coordinates is selected based on the list of previously selected cities and the description of the TSP instance. Formally, this RL model is defined as follows. A state is composed of the TSP description and the sequence of already visited cities at time step . State denotes the initial state where no city has been selected yet and state represents the state where the whole tour has already been constructed. An action corresponds to the next city to be visited, i.e., . This RL problem corresponds to a repeated -horizon sequential decision-making problem where the action set for any time step depends on the current state and only contains the cities that have not been visited yet. The immediate reward for performing an action in a state is given as the negative length between the last visited city and the next chosen one:


After choosing the first city, no reward can be computed yet. After the last city, a final additional reward is provided given by . Thus, a complete trajectory corresponds to a tour and the return of a trajectory is equal to the negative the length of that tour. Most RL-based constructive solver is based on this RL formulation. In Section V, we change the return provided to the RL agent to improve its performance using local search.

To perform the selection of the next city, we propose the MAGIC architecture (see Fig. 1), which corresponds to a stochastic policy (see Section V for more details). It is composed of three parts: (A) an encoder implemented with a graph neural network (GNN) [GNN] and a multilayer perceptron (MLP), (B) a decoder based on an attention mechanism [attention_mechanism], and (C) a sampling process.

Iv-a Encoder

When solving a TSP problem, not only should the last selected city be considered, but also the whole city list should be taken into account as background information. Since the information contained in 2D coordinates is limited and does not include the topology of the cities, we leverage GNN and MLP to encode city coordinates into a higher dimensional space, depicted in Fig. 1. The GNN is used to encode the city coordinates into where is the dimension of the embedding space. The MLP is used to encode the last selected city at iteration into . Therefore, generally speaking, the GNN and MLP in MAGIC can be viewed as two functions:

Fig. 2: Detailed Architecture of GNN.

Iv-A1 Gnn

GNN is a technique which can embed all nodes in a graph together. Similarly to the GPN model [GPN], we use a GNN to encode the whole city list of a TSP instance. Fig. 2 shows the detailed architecture of the GNN used in MAGIC. After is transformed into a vector , will go through layers of GNN. Each layer of GNN can be expressed as


where is the input of the layer of the GNN for , , is an learnable matrix, which is represented by a neural network, is the aggregation function [GNN] , and is a trainable parameter.

Iv-A2 Mlp

While the GNN provides us with general information within the whole city list , we also need to encode the last selected city . In contrast to previous work using complex architectures like GNN or LSTM [LSTM], we simply use an MLP. Using a GNN would make the embedding of the last selected city depend on the whole city list included the already-visited cities, while using an LSTM would make the embedding depends on the order of visited cities, which is in fact irrelevant.

Iv-B Decoder

The decoder of the MAGIC model is based on an attention mechanism , which was also used in several previous studies [DBLP:journals/corr/BelloPLNB16, attention, GPN, Fu]. The output of the decoder is a pointer vector [DBLP:journals/corr/BelloPLNB16], which can be expressed as:


where is the entry of the vector , is the row of the matrix , and are trainable matrices with shape , is a trainable weight vector. For the definitions of and , please refer to Fig. 1.

A softmax transformation is used to turn

into a probability distribution

over cities:


where is the

entry of the probability distribution

and is the entry of the vector . Notice that if the city is visited, then due to (9). Under this circumstance, according to (10). That is to say, all visited cities cannot be visited again.

Iv-C Sampling Process

After we obtain the probability distribution , it is trivial to select the next city. Indeed, corresponds to the RL policy at time step :


where (resp. ) is the state (resp. action) at time step and is the probability of choosing as the city. Therefore, we just need to sample the next city according to the probability distribution .

V Algorithm and Training

For the training of MAGIC, we propose to interleave standard policy gradient updates with local search. In contrast to previous work, our idea is to learn a policy that can generate tours that can be easily improved with local search. Next, we explain our local search technique, which include a novel local insertion-based heuristics. Then, we present how policy gradient with a simple policy rollout baseline can be applied. Finally, we motivate the use of stochastic curriculum learning method in our setting.

1:  Input: A set of city coordinates , current tour , and
2:  Output: An improved tour
3:  for  to  do
6:  end for
7:  return  
Algorithm 1 Local Insertion Optimization Algorithm

V-a Local search

We describe the local search technique that we use for training our model and to improve the tour output by the RL policy. Our technique uses two local search heuristics in combination: random opt and a local insertion heuristics, which is novel to the best of our knowledge. The two heuristics have been chosen and designed to be computationally efficient, which is important since we will apply them during RL training. The motivation for combining two heuristics is that when one method gets stuck in some local minimum, the other method may help escape from it.

For random 2-opt, we randomly pick 2 arcs for improvement and repeat for , where is the number of the cities and and

are two hyperparameters. We set

and here to have a flexible control of the strength of this local search and make it stronger if needed for larger TSP problems. With this procedure, random 2-opt can be much faster than traditional 2-opt.

Inspired by the insertion heuristics, we propose local insertion optimization. Let be the current tour and if and . This method (see Algorithm 1) first iterates through all indices , and for each index , we let , where is a hyperparameter, and then replace by . The rationale for restricting the optimization with hyperparameter is as follows. For a good suboptimal tour , cities that are close in terms of visit order in are usually also close in terms of distance. In that case, is unlikely to improve over when and are far apart. Thus, we set to limit the search range to increase the computational efficiency of this heuristics.

We call our local search technique combined local search (see Algorithm 2), which applies random 2-opt followed by local insertion optimization repeatedly for times, where is a hyperparameter.

1:  Input: A set of city coordinates , current tour , hyperparameters , , and for local search.
2:  Output: An improved tour
3:  for  to  do
4:     for  to  do
5:         apply random 2-opt on
6:     end for
7:      apply Local Insertion Optimization
8:  end for
9:  return  
Algorithm 2 Combined Local Search Algorithm

V-B Interleaved RL training with the policy rollout baseline

Our model is trained with the REINFORCE [Williams92simplestatistical] algorithm. The novelty is that we interleave local search with the policy gradient updates. When the current policy outputs a tour , this solution is further improved with our combined local search technique to obtain a new tour . In contrast to previous work, this tour instead of is used to evaluate policy . The rationale for this procedure is to make the RL policy and local search work in synergy by favoring learning policies that generate tours that can be easily improved by the combined local search. If the RL training and local search are not coordinated, as done in previous work, then a trained policy may generate tours that are hard to improve by local search.

V-B1 Policy Gradient

We recall first standard policy gradient and then explain how we modify it. With the reward function in (6), the RL goal would be to find such that


where for , , is a trajectory, and is the probability distribution over tours induced by policy . Recall the gradient of [Williams92simplestatistical] is:


where stands for . For a large enough batch of trajectories, (13) is approximated with the empirical mean:


where , (resp. ) is the state (resp. action) at time step of the -th trajectory generated by , and denotes the empirical mean operation. Then the policy gradient in (13) can be approximated by:


Instead of updating with this policy gradient, in our interleaved training, we use:


where is the improved tour obtained from our combined local search from , the tour induced by trajectory . By construction, .

V-B2 Policy rollout baseline

In order to reduce the variance of the policy gradient estimate (

16), we use a simple baseline and update in the following direction:


where is the baseline, which we call the policy rollout baseline. Such a baseline gives more weight in the policy gradient when local search can make more improvement. In our experiments, our baseline performs better than the previous greedy baselines [attention, GPN] in our training process. One other nice feature of our baseline is that it does not incur any extra computation since is already computed when policy generates .

V-C Stochastic Curriculum Learning

Curriculum Learning (CL) is a widely-used technique in machine learning (and RL) [CL_survey], which can facilitate learning and improve generalization. Although it can be implemented in various ways, its basic principle is to control the increasing difficulty of the training instances.

To train MAGIC, we propose a stochastic CL technique where the probability of choosing harder instances increases over training steps. We choose the number of cities as a measure of difficulty for a TSP instance, which is assumed to be in in our experiments. We explain next how this selection probability is defined.

Forepoch , we define the vector (since there are 41 integers between 10 and 50) to be


where represents the entry of , and

is a hyperparameter which represents the standard deviation of the normal distribution. Then, we use a softmax to formulate the probability distribution

of this epoch


where , and the entry of represents to probability of choosing TSP of cities at epoch .

V-D Overall training process

In this part, we summarize our training process by providing the corresponding pseudo code in Algorithm 3.

1:  Input: Total number of epochs , training steps per epoch , batch size , hyperparameters , , and for local search
2:  Initialize
3:  for  to  do
4:      Sample from according to Stochastic CL
5:     for  to  do
6:         Random TSP instance with cities
7:         Apply on
8:         Apply the combined local search on
9:        Compute gradient in (17) using and
10:         Update in the direction of this gradient
11:     end for
12:  end for
Algorithm 3 REINFORCE with Stochastic CL, Policy Rollout Baseline and Combined local search

Notice that line  7 can be replaced by any model that can used to generate a tour, showing that a variety of models can fit in our training process to improve their performance for TSP problems.

Vi Experiment s

Method Type TSP20 TSP50 TSP100
Length Gap Time(s) Length Gap Time(s) Length Gap Time(s)
Concorde Exact Solver 3.830 0.00% 138.6 5.691 0.00% 820.8 7.761 0.00% 3744
Gurobi Exact Solver 3.830 0.00% 139.8 5.691 0.00% 1572 7.761 0.00% 12852
2-opt Heuristic 4.082 6.56% 0.33 6.444 13.24% 2.25 9.100 17.26% 9.32
Random Insertion Heuristic 4.005 4.57% 196 6.128 7.69% 502.2 8.511 9.66% 1039
Nearest Insertion Heuristic 4.332 13.10% 229.8 6.780 19.14% 633 9.462 21.92% 1289
Farthest Insertion Heuristic 3.932 2.64% 239.8 6.010 5.62% 617 8.360 7.71% 1261
GCN (Joshi et al.) SL (Greedy) 3.855 0.65% 19.4 5.893 3.56% 120 8.413 8.40% 664.8
Att-GCRN+MCTS (Fu et al.) SL+ MCTS 3.830 0.00% 98.3 5.691 0.01% 475.2 7.764 0.04% 873.6
GAT (Deudon et al.) RL (Sampling) 3.874 1.14% 618 6.109 7.34% 1171 8.837 13.87% 2867
GAT (Kool et al.) RL (Greedy) 3.841 0.29% 6.03 5.785 1.66% 34.9 8.101 4.38% 109.8
GAT (Kool et al.) RL (Sampling) 3.832 0.05% 988.2 5.719 0.49% 1371 7.974 2.74% 4428
GPN (Ma et al.) RL 4.074 6.35% 0.77 6.059 6.47% 2.50 8.885 14.49% 6.23
MAGIC (Ours) RL (Local Search) 3.870 1.09% 3.06 5.918 4.00% 14.8 8.256 6.39% 50.4
  • * refers to methods whose results we directly use from others’ papers.

TABLE I: Results and comparisons on small TSP cases, obtained by testing on 10,000 instances for TSP 20, 50 and 100
Method Type TSP200 TSP500 TSP1000
Length Gap Time(s) Length Gap Time(s) Length Gap Time(s)
Concorde Solver 10.719 0.00% 206.4 16.546 0.00% 2260 23.118 0.00% 23940
Gurobi Solver - - - - - - - - -
2-opt Heuristic 12.841 19.80% 34.0 20.436 23.51% 201.7 28.950 25.23% 826.2
Random Insertion Heuristic 11.842 10.47% 27.1 18.588 12.34% 68.3 26.118 12.98% 137.0
Nearest Insertion Heuristic 13.188 23.03% 28.8 20.614 24.59% 79.8 28.971 25.32% 176.6
Farthest Insertion Heuristic 11.644 8.63% 33.0 18.306 10.64% 84.0 25.743 11.35% 175.5
GCN (Joshi et al.) SL (Greedy) 17.014 58.73% 59.1 29.717 79.61% 400.2 48.615 110.29% 1711
Att-GCRN+MCTS (Fu et al.) SL+ MCTS 10.814 0.88% 149.6 16.966 2.54% 354.6 23.863 3.22% 748.3
GAT (Deudon et al.) RL (Sampling) 13.175 22.91% 290.4 28.629 73.03% 1211 50.302 117.59% 2262
GAT (Kool et al.) RL (Greedy) 11.610 8.31% 5.03 20.019 20.99% 90.6 31.153 34.75% 190.8
GAT (Kool et al.) RL (Sampling) 11.450 6.82% 269.4 22.641 36.84% 938.4 42.804 85.15% 3838
GPN (Ma et al.) RL - - - 19.605 18.49% - 28.471 23.15% -
GPN+2opt (Ma et al.) RL+2opt - - - 18.358 10.95% - 26.129 13.02% -
GPN (Ma et al.) RL 13.278 23.87% 2.5 23.639 42.87% 7.13 37.849 63.72% 18.35
MAGIC (Ours) RL (Local Search) 11.539 7.65% 69.9 18.098 9.38% 207.8 25.542 10.49% 487.8
  • * refers to methods whose results we directly use from others’ papers.

TABLE II: Results and comparisons on large TSP cases, obtained by testing on 128 instances for TSP 200, 500 and 1000
Method Full version w/o RL w/o CL w/o baseline w/o local search
Length Gap Length Gap Length Gap Length Gap Length Gap
TSP20 3.871 1.07% 3.9556 3.27% 3.917 2.27% 3.911 2.21% 3.988 4.10%
TSP50 5.957 4.69% 6.1391 7.88% 5.959 4.72% 5.983 5.15% 5.962 4.76%
TSP100 8.302 6.97% 8.5419 10.06% 8.343 7.51% 8.395 8.17% 8.331 7.35%
TSP200 11.567 7.91% 11.9299 11.30% 11.682 8.99% 11.842 10.48% 11.631 8.50%
TSP500 18.321 10.73% 18.9036 14.25% 18.332 10.80% 18.516 11.91% 18.526 11.97%
TSP1000 25.854 11.84% 26.9936 16.76% 25.954 12.27% 26.188 13.28% 26.505 14.65%
TABLE III: Ablation study on RL, CL, the policy rollout baseline and the combined local search. Testing on 10,000 instances for TSP 20, 50 and 100, and 128 instances for TSP 200, 500 and 1000.

To demonstrate the performance and generalization ability of our proposed training process, we evaluate our method on randomly generated TSP instances and compare with other existing algorithms, which cover different kinds of methods for completeness and include two exact solvers, four traditional heuristics, and seven learning based ones. If the hardware and experiment setting of other papers are the same as ours, we will directly use their reported results on time and performance. To ensure a fair comparison of the runtimes and performances, all algorithms are executed on a computer with an Intel(R) Xeon(R) CPU E5-2678 v3 and a single GPU 1080Ti, and parallel computation is utilized as much as possible for all the algorithms. Moreover, to show the power of RL, CL, our policy rollout baseline and the combined local search, we also carry out an ablation study of those four components.

Vi-a Data sets and hyperparameters

We denote TSP the set of random TSP instances with cities. For the sake of consistency with other research work, the coordinates of cities are

randomly (from a uniform distribution) generated from

. For training, the TSP size varies from 10 to 50 decided by CL in every epoch. After training the model according to our training process, we test our model on both small and large TSP instances. For the testing of small TSP problems, we test on 10,000 instances respectively for TSP 20, 50 and 100. For the testing of large TSP problems, we test on 128 instances respectively for TSP 200, 500 and 1000 to test the generalization ability.

For the hyperparameters of the training, we train for 200 epochs, and we process 1000 batches of 128 instances every epoch. For the learning rate, we set it initially to be 0.001 with a 0.96 learning rate decay. For the hyperparameters of local search, we set , , and after a quick hyper-parameter search. Those settings aim to train a model with fast speed and generalization ability.

For the model architecture, the aggregation function used in GNN is represented by a neutral network followed by a ReLU function on each entry of the output. Our MLP has an input layer with dimension 2, two hidden layers with dimension

and respectively, and an output layer with dimension

. Layers are fully connected and we choose to use the ReLU as the activation function. And finally, we set

, .

Vi-B Performance on small-scale and large-scale TSP instances

The results are shown in Tables I and II. Column 1 and 2 respectively specify the method and its type, where SL refers to supervised learning, Greedy means a greedy construction from the probability given by the policy and Sampling refers to sampling multiple solutions from the probability given by the policy and choose the best one. Column 3 indicates the average tour length, Column 4 provides the gap to Concorde’s performance, which corresponds to the optimal solution for the small-scale TSP problems and nearly optimal solution for the large-scale TSP problems, and Column 5 lists the total runtime. For comparisons, we have listed out 12 other methods covering from exact solvers, heuristics to learning-based algorithms.

As shown in Table I, the exact solvers have the best performance but with relatively long runtime; most of the learning based methods, including ours, receive better tour length than heuristics. Within the learning based methods, most methods are not more than better than ours. For TSP 100, only Fu and those who applied a sampling methods, which all use methods to search for best tour, have better tour than ours. For the speed, our method is fast and this is more prominent when the size is bigger. Those learning-based methods with better solutions than ours all run slower and only GPN is faster but with a significantly worse solution than ours. For the results in Table II, the solver also gives the best solution while its speed is relatively slow; many learning-based methods now expose their poor generalization ability and give worse results than heuristics. For heuristics, insertions methods show a good performance on large TSP problems. For our methods, we outperform all the learning-based models except for Fu for TSP 500 and 1000, showing a very good generalization ability. Plus, for the runtime, we are generally fast and especially faster than Fu.

Notice that heuristics has a good generalization ability and previous learning-based algorithms do well in small TSP problems. Our learning based method combined with local search, which is inspired by the heuristics, tends to receive the advantage of learning based methods and heuristics. Plus, it is also fast in terms of runtime, making it a comprehensive excellent method.

Vi-C The ablation study

To demonstrate the importance of RL, CL, the policy rollout baseline, and the combined local search in the training process, we perform an ablation study on them. For this study, since we only need to show the importance of each technique we apply, we turn down the hyper-parameter to have a shorter runtime. For the ablation study on RL, we do not perform any learning and directly apply the combined local search on randomly generated tours. For the ablation study on CL, we follow the same settings except that the TSP size is fixed to be 50 for all epochs. For the ablation study on the policy rollout baseline, we use instead the central self-critic baseline [GPN], which is inspired by the self-critic training [self-critics] and the greedy self-critic baseline [attention]. For the ablation study on the combined local search, we do not perform any local search during training, but we still apply it in testing in order to show that interleaving local search with policy gradient updates outperforms only doing post-optimization. Note that the policy rollout baseline depends on the combined local search. Therefore, since the local search is ablated for this study, the policy rollout baseline also needs to be changed, and here we replace it by the central self-critic baseline [GPN]. The results of the ablation study presented in Table III demonstrate that all the components used in our method contribute to its good performance.

Vii Conclusion

We introduced a novel deep RL approach for solving TSP achieving state-of-the-art results in terms of gap to optimality and runtime compared to previous deep RL methods. The results are particularly promising in terms of generalizability. Our proposition consists of a simple deep learning architecture combining graph neural network, multi-layer perceptron, and attention in addition to a novel RL training procedure, which interleaves local search and policy gradient updates, uses a novel return evaluation, and exploits curriculum learning.

As future work, we plan to evaluate our novel training procedure with other deep learning architectures proposed for TSP, but also other combinatorial optimization problems. Another research direction to improve further our results is to optimize the deep learning architecture in order to improve the encoding of the TSP problem.


This work is supported in part by the Innovative Practice Program of Shanghai Jiao Tong University (IPP21141), the program of National Natural Science Foundation of China (No. 61872238), and the program of the Shanghai NSF (No. 19ZR1426700). Moreover, we thank GPN for sharing their source code, which served as initial basis for our work.