Solve Traveling Salesman Problem by Monte Carlo Tree Search and Deep Neural Network

by   Zhihao Xing, et al.

We present a self-learning approach that combines deep reinforcement learning and Monte Carlo tree search to solve the traveling salesman problem. The proposed approach has two advantages. First, it adopts deep reinforcement learning to compute the value functions for decision, which removes the need of hand-crafted features and labelled data. Second, it uses Monte Carlo tree search to select the best policy by comparing different value functions, which increases its generalization ability. Experimental results show that the proposed method performs favorably against other methods in small-to-medium problem settings. And it shows comparable performance as state-of-the-art in large problem setting.



page 1

page 2

page 3

page 4


StarCraft II Build Order Optimization using Deep Reinforcement Learning and Monte-Carlo Tree Search

The real-time strategy game of StarCraft II has been posed as a challeng...

Monte-Carlo Tree Search as Regularized Policy Optimization

The combination of Monte-Carlo tree search (MCTS) with deep reinforcemen...

Hierarchical Policy for Non-prehensile Multi-object Rearrangement with Deep Reinforcement Learning and Monte Carlo Tree Search

Non-prehensile multi-object rearrangement is a robotic task of planning ...

MCTSteg: A Monte Carlo Tree Search-based Reinforcement Learning Framework for Universal Non-additive Steganography

Recent research has shown that non-additive image steganographic framewo...

Deep Reinforcement Learning for Dynamic Spectrum Sharing of LTE and NR

In this paper, a proactive dynamic spectrum sharing scheme between 4G an...

Generalize a Small Pre-trained Model to Arbitrarily Large TSP Instances

For the traveling salesman problem (TSP), the existing supervised learni...

A* Tree Search for Portfolio Management

We propose a planning-based method to teach an agent to manage portfolio...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Travelling salesman problem(TSP) enjoys a long history and has many practical applications in real life. Its goal is to find the shortest route that visits each city once and ends in the origin city. Despite the importance of the problem, it is well-known as a NP-hard problem[papadimitriou1977euclidean].

Traditional methods for solving TSP can be categorized into three directions. First, all permutations are traversed to search for the optimal solution, which is only limited to small-scale problem. Second, approximation algorithms are applied to solve the problem, but the best solution cannot be guaranteed. Third, heuristic algorithms can be used to find a satisfactory solution within a reasonable time, but it requires well-designed heuristics to assists in the search.

Recent advances in deep learning have achieved an amazing breakthrough in many fields

[Krizhevsky2012ImageNet, Graves2013Speech]

. Most of these achievements benefit from supervised learning where various neural network architectures are proposed, including multi-layer perceptrons

[Rosenblatt1960Perceptrons], convolutional networks [Lecun1989Backpropagation] and so on. However, training a deep neural network requires a huge number of data. For example, the most famous dataset [Deng2009ImageNet] for image classification has about 3.2 million images. But for TSP, we cannot easily obtain so much ground truth data. Therefore, researches have adopted reinforcement learning to allow the network to learn by rewards and punishments.

Monte Carlo tree search (MCTS) has become a popular approach to solve two-player game problems since the appearance of AlphaGo Zero [Silver2016Mastering]. With the help of deep neural network, MCTS can solve problems with a tremendously large solution space. Researches have applied MCTS to find solutions for other problems similar to TSP[rimmel2011optimization, bnaya2011repeated].

In this paper, we present a new self-learning approach with the combination of deep reinforcement learning and Monte Carlo tree search to solve the famous travelling salesman problem. On 2D Euclidean graphs with up to 100 nodes, the proposed method significantly outperforms the supervised-learning approach [Vinyals2015Pointer] and obtains performance close to reinforcement learning approach [Dai2017Learning].

The remainder of the paper is organized as follows: After related work reviewed in Section 2, we introduce the proposed DMR framework in Section 3. Experimental results are shown in Section 4. In Section 5, we come to our conclusion and future work.

Related Work

In this section, we introduce three different directions to solve the TSP problem.

Shallow Neural Network

In 1985, Hopfield et al. proposed a neural network to solve TSP[hopfield1985neural]

. This is the first time that researchers attempted to use the neural network to solve combinatorial optimization problems. Since the impressive results produced by this approach, many researchers have made efforts on improving the performance

[den1988traveling, alan1988alternative]. Many shallow network architectures were also proposed to solve the combinatorial optimization problem[favata1991study, fort1988solving, angeniol1988self, kohonen1982self].

Deep Neural Network

Recently years, deep neural networks have been adopted to solve TSP. Vinyals et al. introduced a neural architecture called Pointer Network(Ptr-Net)[Vinyals2015Pointer]. Ptr-Net is a simple model based on sequence-to-sequence model. Compared to the sequence-to-sequence model, Ptr-Net introduces an attention mechanism to output a dictionary whose length is proportional to the input sequence. Two flaws exist in the network. First, Ptr-Net can only be applied to solve problems of small scale. If the number of cities reaches 40, the performance of the algorithm suffers greatly. Second, invalid routes might be generated by the approach. For example, it might output a route with two repeated cities.

Deep Reinforcement Learning

With the use of deep reinforcement learning, deep Q-networkn[Mnih2013Playing] becomes a general framework that is applied in many different methods including [Bello2016Neural, Dai2017Learning].

Bello et al. proposed Neural Combinatorial Network[Bello2016Neural] to combine neural network and reinforce learning to deal with combinatorial optimization problem. This framework consists of two stages, the RL pretrained stage and the active search

stage. The first stage is responsible for optimizing a recurrent neural network, and the second stage is to iteratively optimize the RNN with the expected reward objective.

Dai et al. proposed a method called S2V-DQN [Dai2017Learning] which combines graph embedding and reinforcement learning. The method can extract topological information between different nodes in a graph. As a result, the approach can generalize to large-scale graphs even trained on small-scale instances with the help of graph embedding.

Kool et al. also combined deep neural network and reinforcement Learning to solve TSP[Kool2018Attention]. They integrate Attention Mechanism [vaswani2017attention] into their framework, where encoder and decoder are both entirely based on attention. They improved the state-of-art performance among 20, 50 and 100 cities. However, the pretrained network has to precisely match the problem scale, which weakened the generalization ability of their framework.

Proposed Approach

This section describes our novel approach to solving combinatorial optimization problems, which, as shown in Fig.xx, consists of three modules: deep neural network, Monte Carlo tree search, and reinforcement learning. In our framework, the original problem that finding an optimal solution in the graph is converted into searching the least-cost path in a tree. The deep neural network is responsible for extracting topological information as node’s features from the graph as an alternative to designing features manually. The Monte Carlo tree search is used to narrow the search space with the help of value function module of deep neural networks. We follow reinforcement learning paradigm to generate experience used to train deep neural network. We empirically demonstrate that our approach can start from initial random choice to converge to the optimal solution.

Problem-solving tasks are typically implemented in a large number of steps. At each step, there are a number of branches among which one is selected to be implemented. The traveling salesman problem can also be solved according to the above process.
We use a node to represent a city. Then one instance of the TSP problem can be described by a undirected weighted graph , where is the set of finite nodes, is the edge between and , and is the weight of edge . Given a set of cities, we are concerned with finding the path traversing each city once, which is noted as a tour, and has the shortest length.

We convert the original problem of finding the shortest tour in a graph to searching a path with the least cost in the tree.

Tree Search

Tree search methods aim to find the optimal path in a tree. We use represent a path started with and ended with , so is an ordered sequence of traversed cities. We use , denotes the set of non-traversed cities.

In tree search, the traversed path denotes one where denotes the root of the tree and the leaf node corresponds to . Tree search needs to select the best node in the candidate sets step by step according to the present state. There are two traditional methods called Breadth-First-Search (BFS) and Depth-First-Search (DFS), but both of them have the complexity of the order in not only the worst sense but also the average sense.

Monte Carlo tree search [Pearl1984Heuristics, Kocsis2006Bandit] is a heuristic search algorithm for some kinds of the decision process, most notably those employed in gameplay such as Total War and Go game. Different from DFS and BFS, Monte Carlo tree search aims to get the most promising moves and consists of the random sampling of the search space in tree search. Before making a decision, MCTS repeats the process called for many times and at each time consists of four steps, which is illustrated in Figure 2.
Selection: Start from root node and then select a child node of R according to a default policy. The newly selected node will be the root node and then repeat the above process until a leaf node is reached.
Expansion:Create one or more child nodes of and select one node unless the game ends.
Simulation: Start with node and play with a random strategy such as uniform random move until the game is over.
Backpropagation: Update node information on the path from node to node using the result of the random game.

Figure 1: Monte Carlo tree search

For traveling salesman problem, we propose an adapted version of MCTS. The details of the four phases of MCTS is as follows:
Selection Strategy. [Kocsis2006Bandit] proposed one selection strategy called Upper Confidence bounds applied to Trees (UCT), which has achieved great success in the game. There are some differences between game-playing and combinatorial optimization problems. Firstly, a branch with the highest average rate of winning is preferred in game-playing while combinatorial optimization aims to find the extreme, which may locate in the direction without a good average value. So given a node , we modify the policy of UCT to selecting child of that maximizes the following formulation,


where , is defined as follows,


where is known and represents the actual length of ordered sequence from the first node to the last node, is unknown and supposed to be the optimal length from to the goal . In our framework, is evaluated by a deep neural network, which will be described in the next section. is the best reward found under subtree of node . and are the number of visits of node and node respectively. is a parameter used to balance exploitation and exploration.

What’s more, the range of value is different between game-playing and combinatorial optimization problems. In game-playing, the result of a game is composed of , , and , i.e., . The average reward of a node always stays within . In the combinatorial optimization problems, an arbitrary returned reward may not fall in the predefined interval. Thus, we normalize the best reward of each node whose parent is node to [0,1] with the following formulation,


where and are the maximum and minimum reward among all children nodes of node respectively.
Expansion Strategy. When a leaf node is reached, we expand the node until its visitation count reaches a preset threshold(we set this threshold to 40). This avoids generating too many branches so as to distract the search and save computation resource. Similar to A* algorithm [hart1968formal], we expand all children nodes of the leaf node at the same time.
Simulation Strategy. We use value function in Equation 5 to evaluate all children nodes which are expanded in the expansion stage.
Back-Propagation Strategy. Instead of propagating a child node’s simulation reward, we choose to use the best reward among all children nodes to back propagate to the root.

Neural Network Architecture

Inspired by graph embedding network [dai2016discriminative, Dai2017Learning]

, we propose to use graph convolutions to extract features from the graph. Each node in the graph is represented by a feature vector and merges its neighbor nodes’ information recursively according to the graph topology. For each node, the feature is expressed as a 9-dimensional vector. We use an element 0 or 1 to represent whether one node has been traversed or not. Besides current node information (traversal state, x-coordinate, y-coordinate), we especially take notice of the first and the last node in the traversed path due to the solution path is the Hamiltonian path. What’s more, we use edge weight as supplementary feature.

We now describe the parameterization of graph convolutions using the graph embedding. We map the features of each node in the graph to the hidden space by using the following formula:


where , and are the parameters, and

is the rectified linear unit (relu).

and are the node’s features and distance 111Euclidean distance: given two points and in two-dimensional plane, between two nodes mentioned above respectively. And denotes the neighbor nodes of node .

After T iterations each node is embedded in the graph, we will use these embedding information to define mentioned in Equation (2). Similar to [Dai2017Learning], we compute as follows,


where , and denotes the concatenation operator. As suggested by [dai2016discriminative], the number of iterations T for graph embedding is 4. The architecture of the neural network is illustrated in Figure 2.

Figure 2: Neural network architecture. Each node in the graph is embedded to -dimensional vector after the Graph Convolution Layers. The first fully connected layer (green) is responsible for integrating all nodes’ embedded features in the graph. The last fully connected layer (gray) predicts the value of the selected node (orange).


[Vinyals2015Pointer] proposed Pointer Net, which is trained with supervised learning. For combinatorial optimization problems, however, training a model in this way has some issues: (1) the performance of model depends on the quality of labeled data, (2) getting highly qualified labeled data for learning is not feasible or costly in some combinatorial optimization problems. By contrast, we believe that reinforcement learning, which requires little direction, is a natural framework for learning the value function in Equation 5.

Reinforcement learning formulation

We define , , in the reinforcement learning framework as follows:

  • : a state is an ordered sequence of traversed nodes on a graph . We use graph embedding to encode each state as a vector in the -dimensional space. The terminal state is that we have traversed all the nodes.

  • : transition is deterministic in traveling salesman problem, and correspond to adding selected node to , where and are the traversed sequence and non-traversed sequence respectively.

  • : an action is a node of in the non-traversed sequence .

  • : When all nodes in are traversed, the length of ordered sequence can be calculated according to following formulation,


    We can also calculate the length of partial sequence when the node is added to as follows,


    We define the reward function at state as the length of the partial ordered sequence of where the starting node is . That is,

  • : Based on the value function of neural network, we use Monte Carlo tree search as default policy to select next action . After repeated times , we choose a action among all valid actions of the root state by following formulation,


    where is the set of all valid actions of root state , and is the reward of state, which is obtained by taking action from the root state.

Learning algorithm

Similar to [Silver2016Mastering], we perform end-to-end learning of neural network. First, the parameters of neural network are initialized to random weights . When an episode ends where all the nodes have been traversed, the data for each time-step is stored as , where can be calculated according to Equation 8. The neural network is trained from sampling uniformly among all time-steps . Specially, the parameters

are learned by gradient descent on a loss function

over the mean-squared error,


where c is a parameter that control the level of L2 weight regularization.

Our training algorithm, described in Algorithm 1,

1:Initialize experience replay memory M to capacity N
2:for  do
3:     Draw graph G from distribution D
4:     Initialize the state to empty S=()
5:     for  do
7:         Add to partial solution:
8:     end for
9:     Add tuple to M,
10:     Sample random batch from B M
11:     Update by Adam over (10) for B
12:end for
Algorithm 1 Training Algorithm

Experimental Evaluation

Instance generation.

To evaluate the proposed approach against other deep learning approaches, we generate graph instances by the instance generator from the DIMACS TSP Challenge [johnson2007experimental]. We produce two types of graphs: random instances include points scattered uniformly at random in the square and clustered instances includes points which are clustered into four groups. We use the state-of-the-art solver, Gurobi222 to compute optimal solutions.

Experimental Details.

For our approach, the graph representations and hyper-parameters are described as follows. We embed nodes’ features to a 64 dimensional vector. We train our method using Adam optimizer [kingma2014adam] and use the learning rate of . We use 400 simulations for selecting each move in the Monte Carlo tree search during training and testing. We use Bayesian Optimization to find the best value of the and get the best performance when setting to 0.5.

Details on Training and Testing

We train different models for TSP20 and TSP50 respectively using 40 graphs randomly selected from the dataset. During testing, we use the pre-trained model for TSP20 to evaluate performance on TSP20 and use the pre-trained model for TSP50 to evaluate performance on TSP50. While for TSP100, we use the same model which trained for TSP50. We use 100 graphs to test for the above three problems. Instead of using Active Search in [Bello2016Neural], we use the pre-trained mode directly to select the best solution among the results which are obtained starting different nodes.

Results and Analyses

We compare our approach with three excellent work, Pointer Network [Vinyals2015Pointer], S2V-DQN [Dai2017Learning] and AttentionTSP [Kool2018Attention]. We use a machine with CUDA Titan XP for training and testing above three methods. For Pointer network, we do not reproduce successfully the results reported in the paper. We keep the original experimental setup for training S2V-DQN and AttentionTSP. Before we test the new instances generated by us, the performance of S2V-DQN and AttentionTSP has achieved the results as shown in the paper. And then we fine-tune the parameters of the above two methods using data generated by us. Rather than reporting the approximation ration we report the average optimality gap mentioned in [Kool2018Attention].

We report the average optimality gap of the above approaches on random graphs in Table 1. Each approach is trained on random graphs and then tested on random graphs. Our approach performs favorably against Pointer network and gets comparable performances compared with S2V-DQN.

Approach TSP20 TSP50 TSP100
Pointer Network 1.102 1.128
AttentionTSP 1.003 1.017 1.045
S2V-DQN 1.019 1.062 1.081
Our 1.010 1.063 1.095
Table 1: Average optimality gap of different models on random instances. We directly use the result reported in the paper of Pointer Network.

Table 3 is the average optimality gap of the above approaches on clustered graphs. Each approach is trained on random graphs and then tested on clustered graphs. Our approach gets better result than S2V-DQN on TSP20. When the number of nodes in the graph increases from 50 to 100, our approach is more stable than S2V-DQN. What’s more, the performance of AttentionTSP is poor on TSP100. Our approach can generalize on different kinds of graphs well than AttentionTSP.

Approach TSP20 TSP50 TSP100
Pointer Network
S2V-DQN 1.027 1.061 1.082
AttentionTSP 1.017 1.101 1.685
Our 1.025 1.106 1.109
Table 2: Average optimality gap of different models on clustered instances. We exclude Pointer network as the approach do not test on the cluster graphs in the original paper.

Besides the experiments for synthetic data, we evaluate our approach on the real-world dataset called TSPLIB 333
. Due to the limitation of computing resources, we only test the instances whose node’s number is less than 100. Our approach can get the comparable performance of S2V-DQN.

Instance OPT Our S2V-DQN
eil51 426 442 439
berlin52 7542 7598 7542
st70 675 695 696
eil76 538 545 564
pr76 108159 108576 108446
average optimality gap 1 1.003 1.002
Table 3: Best solutions of different models on real-world instances. We also evaluate AttentionTSP on those instances, but its performance is very poor. For example, the best solutions for eil51 and berlin52 are 1733 and 28233 respectively.


We proposed a new framework to solve traveling salesman problem, which combines Monte Carlo tree search and deep reinforcement learning. Inconsistent with previous works in which labeled data or hand-crafted features may occupy an important place, our framework is completely unsupervised and can learn with samples generated by itself. The core idea of our approach lies in converting TSP into tree search problem. Our framework is, to our best of knowledge, the first tree-search combined with the deep neural network method in combinatorial optimization. We have demonstrated that the proposed framework performs favorably against other methods in small-to-medium problem settings. And it shows comparable performance as state-of-the-art in large problem setting.