DAN: Decentralized Attention-based Neural Network to Solve the MinMax Multiple Traveling Salesman Problem

09/09/2021 ∙ by Yuhong Cao, et al. ∙ National University of Singapore 115

The multiple traveling salesman problem (mTSP) is a well-known NP-hard problem with numerous real-world applications. In particular, this work addresses MinMax mTSP, where the objective is to minimize the max tour length (sum of Euclidean distances) among all agents. The mTSP is normally considered as a combinatorial optimization problem, but due to its computational complexity, search-based exact and heuristic algorithms become inefficient as the number of cities increases. Encouraged by the recent developments in deep reinforcement learning (dRL), this work considers the mTSP as a cooperative task and introduces a decentralized attention-based neural network method to solve the MinMax mTSP, named DAN. In DAN, agents learn fully decentralized policies to collaboratively construct a tour, by predicting the future decisions of other agents. Our model relies on the Transformer architecture, and is trained using multi-agent RL with parameter sharing, which provides natural scalability to the numbers of agents and cities. We experimentally demonstrate our model on small- to large-scale mTSP instances, which involve 50 to 1000 cities and 5 to 20 agents, and compare against state-of-the-art baselines. For small-scale problems (fewer than 100 cities), DAN is able to closely match the performance of the best solver available (OR Tools, a meta-heuristic solver) given the same computation time budget. In larger-scale instances, DAN outperforms both conventional and dRL-based solvers, while keeping computation times low, and exhibits enhanced collaboration among agents.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The traveling salesman problem (TSP) is a challenging NP-hard problem, where given a group of cities (i.e., nodes) of a given graph (often complete), an agent needs to find a complete tour of this graph, i.e., a closed path from a given starting node that visits all other nodes exactly once with minimal path length. TSP can be further extended to multiple traveling salesman problem (mTSP), where multiple agents collaborate with each other to visit all cities from a common starting node. Compared to TSP, mTSP has more general real world applications such as last-mile delivery, UAV patrolling and transportation planning [bektas_multiple_2006]. As classical combinatorial optimization problems, TSP and mTSP are commonly solved using exact or heuristic algorithms. Exact algorithms can theoretically guarantee optimal solutions [bektas_multiple_2006, CPLEX], but rely on centralized, exhaustive planning, and thus do not scale well with the number of agents and cities. On the other hand, heuristic algorithms [bektas_multiple_2006, ORtools] only find suboptimal solutions but are significantly faster than exact algorithms. In recent years, encouraged by the development of deep reinforcement learning (dRL), neural-based methods have been developed to solve TSP instances [vinyals2015pointer, bello2016neural, kool2018attention] and showed promising advantages over heuristic algorithms. However, neural-based methods for mTSP are scarce[park_schedulenet_2021, hu2020reinforcement], due to the high dimensionality of this problem. In this work, we introduce DAN, a decentralized attention-based neural network to solve the MinMax mTSP. The MinMax objective is a standard metric for mTSP, which aims to minimize the max tour length among the agents, i.e., the time needed for the whole team to distributedly visit all cities and return to the depot (i.e., the makespan).

Instead of solving mTSP as a combinatorial optimization, we focus on solving it as a decentralized cooperation problem, where agents each construct their own tour towards a common objective. To this end, we rely on a threefold approach: first, we formulate mTSP as a sequential decision making problem and introduce a decision time gap that allows agents to make decisions asynchronously for enhanced collaboration. Second, we propose an attention based neural network to allow agents to make individual decisions according to their own observations, which provides agents with the ability to implicitly predict other agents’ future decisions, by modeling the dependencies of all the agents and cities. Third, we train our model using multi-agent reinforcement learning with parameter sharing, which provides our model with natural scalability with the number of agents. We note that these tools are more general than mTSP, and could extend to other robotic problems that need to address agent allocation and/or distribution, such as multi-robot patrolling, distributed search/coverage, or collaborative manufacturing.

Figure 1: DAN’s final solution to an example mTSP problem.

We present test results on randomized mTSP instances involving 50 to 1000 cities and 5 to 20 agents, and compare DAN’s performance with that of meta-heuristic and deep dRL methods. There, we experimentally demonstrate that our model achieves performance close to OR Tools, a highly optimized meta-heuristic baseline [ORtools], in relatively small-scale mTSP (fewer than 100 cities). In relatively large-scale mTSP, our model is able to significantly outperform OR Tools both in terms of solution quality and computing time. DAN also outperforms two recent dRL based methods in terms of solution quality for nearly all instances, while keeping computation times low.

The paper is structured as follows: Section II discusses existing TSP solvers and mTSP solvers. Section III formulates the specific mTSP considered. Section IV casts the mTSP into the RL framework. Section V introduces DAN, our decentralized attention-based neural network solution to the mTSP, and Section VI describes how training is carried out. Finally, Section VII presents and discusses our simulation results, while Section VIII contains closing remarks.

Ii Prior Works

Ii-a Single TSP

There are three kinds of methods to solve TSP instances: exact algorithms, heuristic algorithms, and neural-based approaches. Exact algorithms like dynamic programming and integer programming can theoretically guarantee optimal solutions. However, these algorithms do not scale well with the number of cities, : the complexity of the best known exact algorithm is

. Meta-heuristics algorithms like genetic algorithms 

[ea] and ant colony optimization [aco] have also been proposed to balance solution quality and computing time. Nevertheless, exact algorithms with handcrafted heuristics (e.g., CPLEX [CPLEX]) remain state-of-the-art, since they can reduce the search space efficiently.

Neural network methods for TSP became competitive after the recent advancements in dRL. Vinyals et al. [vinyals2015pointer]

first built the connection between deep learning and TSP by proposing the Pointer network, a sequence-to-sequence model with a modified attention mechanism, which allows one neural network to solve TSP instances composed of an arbitrary number of cities. Bello et al. 

[bello2016neural] presented a framework based on the Pointer Network and reinforcement learning to provide an appropriate paradigm for neural network training. Building on this research, Kool et al. [kool2018attention] replaced the recurrent unit in the Pointer Network and proposed a model based on a Transformer unit [vaswani2017attention]. Taking the advantage of self-attention on modeling the dependencies of each city, Kool et al.’s achieved significant improvement in term of solution quality. Although neural network methods to the TSP are still slightly worse than state-of-the-art, they show significant advantages in computation time and have the potential for further improvements in solution quality.

Ii-B Multiple TSP

As a natural extension to the TSP, mTSP can also be considered as an optimization problem and solved by TSP solvers based on exact algorithms and heuristic algorithms. However, in mTSP, cities need to be partitioned and allocated to each agent in addition to finding optimal sequences/tours. This added complexity makes state-of-the-art TSP solvers impractical for larger mTSP instances. For popular exact algorithms solvers like CPLEX [CPLEX], since there is no good handcrafted heuristic for mTSP, hundreds of hours are required to find an exact solution to even a ”small” 50-cities MinMAX mTSP instance [mtsplib]. As a result, meta-heuristic algorithms are the most popular method to solve mTSP instances. OR Tools [ORtools], developed by Google, is a highly optimized meta-heuristic algorithm. Although it does not guarantee optimal solutions, OR Tools is one of the most popular mTSP solvers because it effectively balances the trade-off between solution quality and computing time. However, it still suffers from the explosion of the search space with respect to the number of agents and cities.

While these conventional methods all rely on centralized combinatorial optimization, a few recent neural-based methods have also approached the mTSP in a decentralized manner. Notably, Hu et al. [hu2020reinforcement] proposed a model based on a shared graph neural network and distributed attention mechanism networks to first allocate cities to agents. OR Tools

can then be used to quickly generate the tour associated with each agent. A shared graph neural network is used to extract features of each city for all agents. Based on these city features, each agent uses its own network to output a policy, which is a probability distribution over all cities. An auction-based mechanism is then ran over these policies to allocate cities to agents. Although Hu et al.’s model achieved better performance than

OR Tools in large-scale mTSP and showed good scalability with respect to the number of cities, due to the fixed-number of agents of the distributed networks, their model cannot generalize to arbitrary team sizes (thus requiring retraining). Most relevant to our work, Park et al. [park_schedulenet_2021] presented an end-to-end decentralized model based on graph attention network, which could solve mTSP instances with arbitrary numbers of agents and cities. They used a type-ware graph attention mechanism (i.e., one city node only connects to other cities nodes in the graph, and so do agent nodes) to learn the dependencies between cities and agents, where the extracted agents’ feature and cities’ features are then concatenated during the embedding procedure before outputting the final policy. However, the performance of Park et al.’s work remains lower than that of Hu et al.’s method.

Iii Problem Formulation

The mTSP is defined on a graph , where is a set of nodes (cities), and is a set of edges. In this work, we consider is complete, i.e., for all . Node is defined as the depot, where all agents are initially placed, and the remaining nodes as cities to be visited. The cities must be visited exactly once by any agent. After all cities are visited, all agents return to the depot to finish their tour. The agent tours are evaluated via cost, which could be time, energy expenditure, or length of the tour. Following the usual mTSP statement in robotics [bektas_multiple_2006], we use the Euclidean distance between cities as edge weights, i.e., this work addresses the Euclidean mTSP. We note that, since a polynomial time approximation scheme was proposed for the Euclidean TSP to devise a high-quality approximate algorithm [arora1996polynomial], Euclidean TSP is generally regarded as a simpler variant of the TSP. However, mTSP requires us to optimize the cities allocation (in addition to constructing individual tours), which cannot be done in Euclidean space. Therefore, such approximate algorithms cannot extend to Euclidean mTSP, making this problem generally much harder to approach than its single-agent counterpart.

We define a solution to mTSP as a set of agent tours. Each agent tour is an ordered set of the cities visited by this agent, where and is the depot. denotes the number of cities in this agent tour, so since all agent tours involve the depot twice. Denoting the Euclidean distance between cities and as , the cost of agent ’s tour reads:


MinMax (minimizing the max cost among agents) and MinSum (minimizing the total tour length) are two common objectives for mTSP [kaempfer2018learning, hu2020reinforcement, park_schedulenet_2021]. In this paper, we consider MinMax as the objective of our model (i.e., makespan):


This choice of objective encourages agents to complete the task (i.e., the full tour) as soon as possible, which is more aligned with real-world applications.


In this section, we cast mTSP into a decentralized multi-agent reinforcement learning (MARL) framework. In our work, we consider mTSP as a cooperative task instead of an optimization task. We first formulate mTSP as a sequential decision making problem, which allows RL to tackle mTSP. We then detail the agents’ state and action spaces, and the reward structure used in our work.

Iv-a Sequential Decision Making

Building upon recent works on neural-based TSP solvers, we consider mTSP as a sequential decision-making problem. That is, we let agents interact with the environment by performing a sequence of decisions, which in mTSP are to select the next city to visit. These decisions are made sequentially and asynchronously by each agent based on their own observation, upon arriving at the next city along their tour, thus constructing a global solution collaboratively. Each decision (i.e., RL action) will transfer the agent from the current city to the next city it selects. Here we introduce a decision time gap between two decision-making steps, to account for the time needed for the agent to transit to the next city (i.e., enacting a form of event-based system). At the time step one agent selects a city, the decision time gap for this agent is initialized as the Euclidean distance between the current city the agent occupies and the next city of that agent’s tour. This decision time gap then decreases by one unit every time step. The agent can only make its next decision when the decision time gap has a value of 0, i.e., when it reaches the next city along its tour. This helps us avoid potential conflicts in the city selection process, by endowing agents with asynchronous decentralized decision-making abilities. We empirically found that this assumption significantly improves collaboration.

It should be noted that we allow agents to return to the depot at any time during their tour, although it is against the usual mTSP constraint that agents only return to the depot after all cities are visited. We observe that, if agents are not allowed to return to the depot in advance, at the end of the tour one agent might then be forced to visit one of the remaining unvisited cities even if it is far from this agent but very close to other agents (but these agents are not available currently since they are still in transit to their next selected city). This has a bad effect on exploration and on the overall learning process, leading to poor final policies. Since returning to the depot during the tour always increases the length, removing this constraint actually makes mTSP more difficult, so we note that our method does not gain any advantage over existing methods. Furthermore, we empirically observe that agents are able to learn to return to the depot only at the end of their portion of the tour, thus actual satisfying the mTSP constraints in practice.

number of agents , graph
Initialize mask , decision gap , and
empty tours starting at the depot ().
while  do
     for  do
         if  then
              Observe , of agent and outputs
              Select next city from ()
              Append to , ,
         end if
     end for
end while
Algorithm 1 Sequential decision making to solve mTSP.

Iv-B Observation

We consider a fully observable world where one agent can access the states of all cities and all agents. Although a partial observation is more common in decentralized MARL [zhang_multi-agent_2021], a global observation is necessary to make our model comparable to baseline algorithms, and partial observability will be considered in future works. The observation of each agent consist of three parts: the cities state, the agents state, and a global mask.

The cities state contains the Euclidean coordinates of all cities relative to the observing agent. Compared to absolute information, we empirically found that relative coordinates prevent premature convergence and lead to a better final policy.

The agents state , contains the Euclidean coordinates of all agents relative to the observing agent, and the agents’ decision time gaps . As mTSP is a cooperative task, one agent can benefit from observing other agent states, e.g., to predict other agents future decisions.

Finally, agents can observe a global mask : an

-dimensional, binary vector containing the visit history for all

cities. Each entry of is initially , and then set to 1 after any agent has visited the corresponding city.

Iv-C Action

At each decision step of agent , based on its current observation , our decentralized attention-based neural network outputs a stochastic policy , parameterized by the set of weights :


where denotes an unvisited city. Agent takes an action based on this policy to select the next city . By performing such actions iteratively, agent constructs its tour .

Iv-D Reward Structure

To show the advantage of reinforcement learning, we try to minimize the amount of domain knowledge introduced into our approach. In this work, the reward is simply the negative of the max tour length among agents, and all agents share it as a global reward. This reward structure is sparse, i.e., agents can only receive this reward after all agents finish their tours. The reward is formulated as:


Algorithm 1 shows how to solve mTSP as a RL problem. There, selecting the next city is performed in an entirely decentralized (and asynchronous) manner by each agent in our model, as detailed in the next section. It should be noted that, although the agents make decisions at different time steps from each other (due to the decision time gap), there are still times where multiple agents need to make a decision at the same time step, such at the first time step. In those cases, we make agents select cities sequentially, where subsequent agents know the decision of previous ones. We note that simply using a smaller decrement can make synchronous decision-making steps happen arbitrarily rarely after the first time step.

V Decentralized Attention-Based Neural Network

We propose a decentralized attention-based neural network (DAN), through which agents can select the next city according to their own observations. It consists of a city encoder, an agent encoder, a city-agent encoder, and a decoder. Its structure is used to model three kinds of dependencies in mTSP, i.e., the agent-agent dependencies, the city-city dependencies, and the agent-city dependencies. To achieve good collaboration in mTSP, it is important for agents to learn all of these dependencies to make decisions that benefit the whole team. Each agent uses its local DAN network to select the next city based on its own observation. Compared to existing attention-based TSP solvers, which only learn dependencies among cities and finds good individual tours, DAN further endows agents with the ability to predict each others’ future decision to improve agent-city allocation, by adding the agent encoder and the city-agent encoder.

The overall model is shown in Fig. 2. In general, based on the observations of the deciding agent, we first use the city encoder and the agent encoder to model the dependencies among cities and among agents respectively. Then in the city-agent encoder we update the city features by considering other agents’ potential decisions according to their features. Finally, in the decoder, based on the deciding agent’s current state and the updated city features, we allocate attention weights to each city, which we directly use as its policy. We detail this process in this section.

V-a Attention Layer

The Transformer attention layer [vaswani2017attention] is used as the fundamental building block in our model. The input of such an attention layer consists of the query source and the key-and-value source , which are both vectors with the same dimension. The attention layer updates the query source using the weighted sum of the value, where the attention weight depends on the similarity between query and key. We compute the query , key and value as:


where are all learnable matrices with size . Next, we compute the similarity between the query and the key using a scaled dot product:


Then we calculate the attention weights using a softmax:


Finally, we compute a weighted sum of these values as the output embedding from this attention layer:


The embedding content is then passed through the feed forward sublayer, which contains two linear layer with dimension (in practice

) and a ReLU activation:


Note that layer normalization [ba_layer_2016]

and residual connections 

[he_deep_2016] are used within these two sublayers as in [vaswani2017attention].

Figure 2: DAN consists of a city encoder, an agent encoder, a city-agent encoder and a final decoder, which allows each agent to process its inputs (the cities states, and the agents states) in a decentralized manner, to finally obtain its own city selection policy. In particular, the agent and city-agent encoders are introduced in this work to endow agents with the ability to predict each others’ future decision and improve the decentralized distribution of agents.

V-B City Encoder

The city encoder is used to extract features from the cities state and model their dependencies. The city encoder first embeds the relative Euclidean coordinates of city , into a -dimensional ( in practice) initial city embedding using a linear layer. Similarly, the depot’s Euclidean coordinates are embedded by another linear layer to . The initial city embedding is then passed through an attention layer. Here the query source and the key-and-value source are both the initial city embedding , as is commonly done in self-attention mechanisms. Self-attention achieved good performance to model the dependencies of cities in single TSP approaches [kool2018attention], and we propose to rely on the same fundamental idea to model the dependencies in mTSP.

We term the output of the city encoder, , the city embedding. Due to the self attention mechanism used, contains the dependencies between each city and all other cities.

V-C Agent Encoder

The agent encoder is used to extract features from the agents state and model their dependencies. A linear layer is used to separately embed each (3-dimensional) component of into the initial agent embedding . This embedding is then passed through an attention layer, as both the query source and the key-and-value source (i.e., self-attention again).

We term the output of the agent encoder, the agent embedding. It contains the dependencies between each agent and all other agents.

V-D City-agent Encoder

The city-agent encoder is used to model the dependencies between cities and agents. The city-agent encoder applies an attention layer with cross-attention, where the query source is the city embedding, and the key-and-value is the agent embedding.

We term the output of this encoder the city-agent embedding. It contains the relationship between each city and each agent . That is, it implicitly predicts whether city is likely to be selected by another agent , which is one of the keys to the improved performance of our model.

V-E Decoder

The decoder is used to decode the different embeddings into a policy for selecting the next city to visit. The decoder starts with encoding the deciding agent’s current state. Since we use a relative position associated with the agent position, the agent position coordinate is always . To avoid meaningless inputs, we instead choose to express the current agent state implicitly by computing an aggregated embedding which is the mean of the city embedding. This operation is similar to the graph embedding used in [kool2018attention].

The first attention layer then adds this agent embedding to the aggregated embedding. In doing so, it relates the state of the deciding agent to that of all other agents. Here the query source is and the key-and-value source is . This layer outputs the current state embedding .

After that, a second attention layer is used to compute the final candidate embedding , where the query source is the current state embedding , and the key-and-value is the city-agent embedding . This layer serves as a glimpse which is common to improve attention mechanisms [bello2016neural].

There, when computing the similarity, we rely on the global mask to manually set the similarity if the corresponding city has already been visited:


to ensure the attention weights of visited cities are .

The final candidate embedding then passes through a third and final attention layer. The query source is the current state embedding , and the key source is the final candidate embedding . For this final layer only only, following [vinyals2015pointer], we directly use the vector of attention weights as the final policy for the deciding agent.

The same masking operation Eq.(10) is also applied in this layer to satisfy the mTSP constraint. Besides, following [bello2016neural], the similarity is clipped within (in practice, ) to encourage exploration and improve the final policy by preventing premature convergence:


These similarities are normalized using a Softmax operation, to finally yield the probability distribution

for the next city to visit:


The deciding agent can select the next city to visit either by greedily selecting the city with the highest probability or by sampling based on the probability distribution .

Vi Training

In this section, we describe how DAN is trained, including the choice of hyperparameters and hardware used.

Vi-a REINFORCE with Rollout Baseline

In order to train our model, we define the policy loss:


where . The policy loss is the expectation of the negative of the max length among the tours of agents. The loss is optimized by gradient descent using the REINFORCE algorithm with greedy rollout baseline [kool2018attention]. That is, we re-run the same exact episode from the start a second time, and let all agents take decisions by greedily exploiting the best policy so far (i.e., the “baseline model” explained in Section VI-C below). The cumulative reward

of this baseline episode is then used to estimate the advantage function:


the cumulative reward at each state of the RL episode). This helps reduce the gradient variance and avoids the burden of training the model to explicitly estimate the state value, as in traditional actor-critic algorithms. The final gradient estimator for the policy loss reads:


Vi-B Parameter Sharing

As agents in mTSP are homogeneous, we train our model using parameter sharing, a general method for MARL [gupta_cooperative_nodate]. That is, we allow agents to share the parameters of a common neural network, thus making the training more efficient by relying on the sum of experience from all agents. Meanwhile, parameter sharing provides our model with natural scalability to the number of agents so it can handle arbitrary-scale mTSP instances (as shown in our results).

Vi-C Distributed Training

Our model is trained on a workstation equipped with a i9-10980XE CPU and four NVIDIA GeForce RTX 3090 GPUs. We train our model utilizing Ray, a distributed framework for machine learning 

[moritz2018ray] to accelerate training by parallelizing the code. With Ray, we run 8 mTSP instances in parallel and pass gradients to be applied to the global network under the A2C structure, a synchronous variant of A3C [noauthor_openai_2017].

At each training episode, the positions of cities are generated uniformly at random in the unit square and the decision time gap decreases by 0.1 every time step (defining the agent velocity). The number of agent is randomized within and the number of cites is randomized within during early training. After initial convergence of the policy, the number of cities is randomized within for further refinement.

We formulate one training batch after 8 mTSP instances are solved, and perform one gradient update for each agent. We train the model with the Adam optimizer [kingma_adam_2017] and use an initial learning rate of

and decay every 1024 steps by a factor of 0.96. Every 2048 steps we compare the current training model with the baseline model, and replace the baseline model if the improvement is significant according to a paired t-test on an mTSP test set with 2048 instances. Our full training and testing code is available at


Vii Experiments

We test our decentralized attention-based neural network (DAN) on numerous sets of 500 mTSP instances each, generated uniformly at random in the unit square .

We test two different variants of our model, since it is possible to construct the solution with two different strategies:

  • Greedy: each agent always selects the action with highest activation in its policy.

  • Sampling: each agent selects the city stochastically according to its policy.

For the sampling strategy, we run our model multiple times on the same instance and report the solution with the highest quality. While [kool2018attention] sample 1280 solutions for single TSP, we only sample 64 solutions (denoted as s.64) for each mTSP instance to balance the trade-off between computing time and solution quality. In our tests, the decision time gap decreases by 0.01 for each time step (i.e., agents move slower than during training), which slightly increases the computing time for each solution, but further improves the performance of our model by allowing more finely-grained asynchronous action selection.

Vii-a Results

We report the average MinMax (lower is better) for small-scale mTSP instances (from 50 to 200 cities) in Table I, which also shows the average computing time per instance for each of the considered solvers. The performance of our model is compared with both conventional methods and neural-based methods.

For conventional methods, we test OR Tools

, evolutionary algorithm (EA), and self organizing maps (SOM) 

[lupoaie_som-guided_2019] on the same test set. Similar to our model, OR Tools can obtain solutions using two different strategies: it can initially get a solution using meta-heuristic algorithms (denoted as OR); this solution can then be further improved by local search [gendreau_guided_2010] (denoted as OR+s.). This requires additional computing time, similar to our sampling strategy. We allow OR Tools to perform local search for a similar amount of time as our sampling strategy, for fair comparison.

Figure 3: Planning time for the different solvers from mTSP20 to mTSP140 where the number of agents is fixed to 5. The computing time of our model only increases linearly with respect to the number of cities, while the computing time of OR Tools increases exponentially.

For neural-based methods, we report Park et al.’s results [park_schedulenet_2021] and Hu et al.’s results [hu2020reinforcement] from their papers, since they did not make their code available publicly. Since Park et al.’s paper does not report the computing time of their approach, we leave the corresponding cells blank in Table I. Similarly, since Hu et al. did not provide any results for cases involving more than 100 cities or more than 10 agents, these cells are also left blank. Note that the test sets used by Park et al. and Hu et al. are likely different from ours, since they have not been made public. However, the values reported here from their paper are also averaged over 500 instances under the same exact conditions as the ones used in this work (random uniform placement of the cities in the unit square), which we believe allows for comparable results.

We then report the average MinMax for large-scale (from 400 to 1000 cities) mTSP in Table II, where the number of agents is fixed to (due to the limitation of Hu et al.’s model). When testing the sampling strategy, we set in the third decoder layer for efficient exploration (since the tour is much longer). Except DAN and Hu et al.’s model, other methods cannot handle such large-scale mTSP, but we still report the results of OR Tools from Hu et al.’s paper, as well as SOM results as the best-performing meta-heuristic algorithms for completeness.

Vii-B Discussion

We first notice that DAN significantly outperforms OR Tools in larger-scale mTSP instances with relatively more agents (, ), but is outperformed by OR Tools in smaller-scale mTSP instances (as can be expected). In smaller-scale mTSP, OR Tools can explore the search space sufficiently and produce near-optimal solutions. In this situation, our decentralized model finds it difficult to achieve the same level of performance. Note that this is for all decentralized methods considered.

Method n=50 m=5 n=50 m=7 n=50 m=10
Max. T(s) Max. T(s) Max. T(s)
EA 2.35 7.82 2.08 9.58 1.96 11.50
SOM 2.57 0.76 2.30 0.78 2.16 0.76
OR 2.11 1.07 2.05 1.06 2.05 1.06
OR+s. 2.04 12.00 1.96 12.00 1.96 12.00
Park et al. 2.37 2.18 2.10
Hu et al. 2.12 0.01 1.95 0.02
DAN(g.) 2.29 0.25 2.11 0.26 2.03 0.30
DAN(s.64) 2.12 7.87 1.99 9.38 1.95 11.26
Method n=100 m=5 n=100 m=10 n=100 m=15
Max. T(s) Max. T(s) Max. T(s)
EA 3.55 12.80 2.75 17.52 2.51 21.63
SOM 3.10 1.58 2.41 1.58 2.22 1.57
OR 2.41 10.79 2.29 11.09 2.27 11.32
OR+s. 2.36 18.00 2.29 18.00 2.25 18.00
Park et al. 2.88 2.23 2.16
Hu et al. 2.48 0.04 2.09 0.04
DAN(g.) 2.72 0.43 2.17 0.48 2.09 0.58
DAN(s.64) 2.55 12.18 2.05 14.81 2.00 19.13
Method n=200 m=10 n=200 m=15 n=200 m=20
Max. T(s) Max. T(s) Max. T(s)
EA 4.07 29.91 3.62 34.33 3.37 39.34
SOM 2.81 3.01 2.50 3.04 2.34 3.04
OR 2.57 63.70 2.59 60.29 2.59 61.74
Park et al. 2.50 2.38 2.44
DAN(g.) 2.40 0.93 2.20 0.98 2.15 1.07
DAN(s.64) 2.29 23.49 2.13 26.27 2.07 29.83
Table I: Results on random mTSP set (500 instances each). denotes the number of cities and denotes the number of agents
Method n=400 n=500 n=600 n=700 n=800 n=900 n=1000
Max. T(s) Max. T(s) Max. T(s) Max. T(s) Max. T(s) Max. T(s) Max. T(s)
OR [hu2020reinforcement] 4.59 1705 7.75 1800 9.64 1800 11.24 1800 12.34 1800 13.71 1800 14.84 1800
SOM [lupoaie_som-guided_2019] 3.53 6.10 3.86 7.63 4.24 9.39 4.54 10.86 4.93 14.28 5.21 16.65 5.53 17.89
Hu et al. [hu2020reinforcement] 2.99 0.32 3.32 0.56 3.65 0.81 3.95 1.22 4.20 1.69 4.59 2.21 4.81 2.87
DAN(g.) 2.98 1.76 3.29 2.15 3.60 2.58 3.91 3.03 4.23 3.36 4.55 3.81 4.84 4.21
DAN(s.64) 2.83 40.04 3.14 48.91 3.46 57.81 3.75 67.69 4.10 77.08 4.42 87.03 4.75 97.26
Table II: Results on the large-scale mTSP set (500 instances each) where the number of agents is fixed to 10.

Upon closer inspection, we note that our model seems very good at distributing cities to agents, but exhibits poorer performance at finding the optimal visiting sequence. This is likely due to its decentralized, iterative nature (see Fig 1 where one agent occasionally has a cross path). However, our sampling strategy still achieves performances close to OR Tools ( gap) in smaller-scale mTSP instances, which is very promising. Meanwhile, in larger-scale mTSP, OR Tools performs poorly as the dimension of the search space increases exponentially with the number of cites and agents. Thanks to its decentralized nature, our model surpasses OR Tools in larger-scale mTSP instances, where OR Tools can only yield sub-optimal solutions. In mTSP100(), our sampling strategy is 10% better than OR Tools with local search. The advantage of our model becomes more significant as the scale of the instance grows, even when using our greedy strategy. For instances on instances involving more than 400 cities, OR Tools becomes impractical even when allowing up to 1800s per instance, while our greedy approach still outputs solutions with good quality in a matter of seconds. In general, the computing time of our model increases linearly with respect to the scale of the mTSP instance, as shown in Fig. 3.

Second, we notice that DAN’s structure helps achieve better agent collaboration than the other two decentralized dRL methods, thus yielding better overall results. Compared to Park et al.’s model, our model achieves better performances across all experiments even when using our greedy strategy. Their model uses graph attention network, which is similar to self-attention, to extract agents’ and cities’ features separately. However although Park et al. designed a complex procedure to aggregate these features, their model seems to lack an efficient mechanism like our city-agent encoder to explicitly learn the dependencies between cities and agents. From our results, it appears that such mechanism plays a key rule in implicitly predicting the future decisions of other agents. We believe this city-agent encoder leads to a better collaboration, resulting in increased performance.

Compared to Hu et al.’s model, we remind the reader that DAN can handle mTSP instances with arbitrary team sizes, whereas Hu et al.’s model needs to be retrained if the number of agents changes. Despite this key difference, our model still produces (at least marginally) higher-quality solutions, especially in large-scale instances (400 to 1000 cities). Since Hu et al. only trained their model to allocate cities to the agents and then proposed to solve the resulting single-TSP using OR Tools for each agent, their method guarantees near-optimal routes for the allocated cites. Therefore, we conclude from our overall better performances that DAN is likely better at allocating cities. We believe this might be because their model runs an auction-based mechanism without explicitly considering agent collaboration/interactions. Thanks to our city-agent encoder, DAN agents can predict others’ decision to reach improved collaboration, which results in better overall performance.

Viii Conclusion

This work introduced DAN, a decentralized attention-based neural network to solve MinMax multiple travelling salesman problem. We consider mTSP as a cooperative task and formulate mTSP as a sequential decision making problem, where agents iteratively build a collaborative mTSP solution asynchronously. In doing so, our attention-based neural model allows agents to achieve implicit coordination to solve the mTSP instance together in a decentralized manner. Through our results, we showed that our model exhibits excellent performance for small- to large-scale mTSP instances, which involve to cities and to agents. Compared to state-of-the-art conventional baseline, our model achieves better performance both in terms of solution quality and computing time in large-scale mTSP instances, while achieving comparable performance in small-scale mTSP instances. When comparing to two state-of-the-art decentralized dRL methods, our model exhibits better collaboration between agents that lead to improved solutions. We believe that the developments made in the design of DAN can extend to more general robotic problems where agent allocation/distribution is key, such as multi-robot patrolling, distributed search/coverage, or collaborative manufacturing.

In addition to extending our approach to these robotic tasks, future work will focus on adapting DAN to dynamical mTSP, where new cities might appear and/or existing cities might move/disappear. Such problems are common in certain multi-robot applications, e.g., frontier-based exploration, yet few methods have been developed to date. There, we hypothesize that the reactive nature of our approach might have an edge over centralized planners.