I Introduction
The traveling salesman problem (TSP) is a challenging NPhard problem, where given a group of cities (i.e., nodes) of a given graph (often complete), an agent needs to find a complete tour of this graph, i.e., a closed path from a given starting node that visits all other nodes exactly once with minimal path length. TSP can be further extended to multiple traveling salesman problem (mTSP), where multiple agents collaborate with each other to visit all cities from a common starting node. Compared to TSP, mTSP has more general real world applications such as lastmile delivery, UAV patrolling and transportation planning [bektas_multiple_2006]. As classical combinatorial optimization problems, TSP and mTSP are commonly solved using exact or heuristic algorithms. Exact algorithms can theoretically guarantee optimal solutions [bektas_multiple_2006, CPLEX], but rely on centralized, exhaustive planning, and thus do not scale well with the number of agents and cities. On the other hand, heuristic algorithms [bektas_multiple_2006, ORtools] only find suboptimal solutions but are significantly faster than exact algorithms. In recent years, encouraged by the development of deep reinforcement learning (dRL), neuralbased methods have been developed to solve TSP instances [vinyals2015pointer, bello2016neural, kool2018attention] and showed promising advantages over heuristic algorithms. However, neuralbased methods for mTSP are scarce[park_schedulenet_2021, hu2020reinforcement], due to the high dimensionality of this problem. In this work, we introduce DAN, a decentralized attentionbased neural network to solve the MinMax mTSP. The MinMax objective is a standard metric for mTSP, which aims to minimize the max tour length among the agents, i.e., the time needed for the whole team to distributedly visit all cities and return to the depot (i.e., the makespan).
Instead of solving mTSP as a combinatorial optimization, we focus on solving it as a decentralized cooperation problem, where agents each construct their own tour towards a common objective. To this end, we rely on a threefold approach: first, we formulate mTSP as a sequential decision making problem and introduce a decision time gap that allows agents to make decisions asynchronously for enhanced collaboration. Second, we propose an attention based neural network to allow agents to make individual decisions according to their own observations, which provides agents with the ability to implicitly predict other agents’ future decisions, by modeling the dependencies of all the agents and cities. Third, we train our model using multiagent reinforcement learning with parameter sharing, which provides our model with natural scalability with the number of agents. We note that these tools are more general than mTSP, and could extend to other robotic problems that need to address agent allocation and/or distribution, such as multirobot patrolling, distributed search/coverage, or collaborative manufacturing.
We present test results on randomized mTSP instances involving 50 to 1000 cities and 5 to 20 agents, and compare DAN’s performance with that of metaheuristic and deep dRL methods. There, we experimentally demonstrate that our model achieves performance close to OR Tools, a highly optimized metaheuristic baseline [ORtools], in relatively smallscale mTSP (fewer than 100 cities). In relatively largescale mTSP, our model is able to significantly outperform OR Tools both in terms of solution quality and computing time. DAN also outperforms two recent dRL based methods in terms of solution quality for nearly all instances, while keeping computation times low.
The paper is structured as follows: Section II discusses existing TSP solvers and mTSP solvers. Section III formulates the specific mTSP considered. Section IV casts the mTSP into the RL framework. Section V introduces DAN, our decentralized attentionbased neural network solution to the mTSP, and Section VI describes how training is carried out. Finally, Section VII presents and discusses our simulation results, while Section VIII contains closing remarks.
Ii Prior Works
Iia Single TSP
There are three kinds of methods to solve TSP instances: exact algorithms, heuristic algorithms, and neuralbased approaches. Exact algorithms like dynamic programming and integer programming can theoretically guarantee optimal solutions. However, these algorithms do not scale well with the number of cities, : the complexity of the best known exact algorithm is
. Metaheuristics algorithms like genetic algorithms
[ea] and ant colony optimization [aco] have also been proposed to balance solution quality and computing time. Nevertheless, exact algorithms with handcrafted heuristics (e.g., CPLEX [CPLEX]) remain stateoftheart, since they can reduce the search space efficiently.Neural network methods for TSP became competitive after the recent advancements in dRL. Vinyals et al. [vinyals2015pointer]
first built the connection between deep learning and TSP by proposing the Pointer network, a sequencetosequence model with a modified attention mechanism, which allows one neural network to solve TSP instances composed of an arbitrary number of cities. Bello et al.
[bello2016neural] presented a framework based on the Pointer Network and reinforcement learning to provide an appropriate paradigm for neural network training. Building on this research, Kool et al. [kool2018attention] replaced the recurrent unit in the Pointer Network and proposed a model based on a Transformer unit [vaswani2017attention]. Taking the advantage of selfattention on modeling the dependencies of each city, Kool et al.’s achieved significant improvement in term of solution quality. Although neural network methods to the TSP are still slightly worse than stateoftheart, they show significant advantages in computation time and have the potential for further improvements in solution quality.IiB Multiple TSP
As a natural extension to the TSP, mTSP can also be considered as an optimization problem and solved by TSP solvers based on exact algorithms and heuristic algorithms. However, in mTSP, cities need to be partitioned and allocated to each agent in addition to finding optimal sequences/tours. This added complexity makes stateoftheart TSP solvers impractical for larger mTSP instances. For popular exact algorithms solvers like CPLEX [CPLEX], since there is no good handcrafted heuristic for mTSP, hundreds of hours are required to find an exact solution to even a ”small” 50cities MinMAX mTSP instance [mtsplib]. As a result, metaheuristic algorithms are the most popular method to solve mTSP instances. OR Tools [ORtools], developed by Google, is a highly optimized metaheuristic algorithm. Although it does not guarantee optimal solutions, OR Tools is one of the most popular mTSP solvers because it effectively balances the tradeoff between solution quality and computing time. However, it still suffers from the explosion of the search space with respect to the number of agents and cities.
While these conventional methods all rely on centralized combinatorial optimization, a few recent neuralbased methods have also approached the mTSP in a decentralized manner. Notably, Hu et al. [hu2020reinforcement] proposed a model based on a shared graph neural network and distributed attention mechanism networks to first allocate cities to agents. OR Tools
can then be used to quickly generate the tour associated with each agent. A shared graph neural network is used to extract features of each city for all agents. Based on these city features, each agent uses its own network to output a policy, which is a probability distribution over all cities. An auctionbased mechanism is then ran over these policies to allocate cities to agents. Although Hu et al.’s model achieved better performance than
OR Tools in largescale mTSP and showed good scalability with respect to the number of cities, due to the fixednumber of agents of the distributed networks, their model cannot generalize to arbitrary team sizes (thus requiring retraining). Most relevant to our work, Park et al. [park_schedulenet_2021] presented an endtoend decentralized model based on graph attention network, which could solve mTSP instances with arbitrary numbers of agents and cities. They used a typeware graph attention mechanism (i.e., one city node only connects to other cities nodes in the graph, and so do agent nodes) to learn the dependencies between cities and agents, where the extracted agents’ feature and cities’ features are then concatenated during the embedding procedure before outputting the final policy. However, the performance of Park et al.’s work remains lower than that of Hu et al.’s method.Iii Problem Formulation
The mTSP is defined on a graph , where is a set of nodes (cities), and is a set of edges. In this work, we consider is complete, i.e., for all . Node is defined as the depot, where all agents are initially placed, and the remaining nodes as cities to be visited. The cities must be visited exactly once by any agent. After all cities are visited, all agents return to the depot to finish their tour. The agent tours are evaluated via cost, which could be time, energy expenditure, or length of the tour. Following the usual mTSP statement in robotics [bektas_multiple_2006], we use the Euclidean distance between cities as edge weights, i.e., this work addresses the Euclidean mTSP. We note that, since a polynomial time approximation scheme was proposed for the Euclidean TSP to devise a highquality approximate algorithm [arora1996polynomial], Euclidean TSP is generally regarded as a simpler variant of the TSP. However, mTSP requires us to optimize the cities allocation (in addition to constructing individual tours), which cannot be done in Euclidean space. Therefore, such approximate algorithms cannot extend to Euclidean mTSP, making this problem generally much harder to approach than its singleagent counterpart.
We define a solution to mTSP as a set of agent tours. Each agent tour is an ordered set of the cities visited by this agent, where and is the depot. denotes the number of cities in this agent tour, so since all agent tours involve the depot twice. Denoting the Euclidean distance between cities and as , the cost of agent ’s tour reads:
(1) 
MinMax (minimizing the max cost among agents) and MinSum (minimizing the total tour length) are two common objectives for mTSP [kaempfer2018learning, hu2020reinforcement, park_schedulenet_2021]. In this paper, we consider MinMax as the objective of our model (i.e., makespan):
(2) 
This choice of objective encourages agents to complete the task (i.e., the full tour) as soon as possible, which is more aligned with realworld applications.
Iv mTSP AS A RL PROBLEM
In this section, we cast mTSP into a decentralized multiagent reinforcement learning (MARL) framework. In our work, we consider mTSP as a cooperative task instead of an optimization task. We first formulate mTSP as a sequential decision making problem, which allows RL to tackle mTSP. We then detail the agents’ state and action spaces, and the reward structure used in our work.
Iva Sequential Decision Making
Building upon recent works on neuralbased TSP solvers, we consider mTSP as a sequential decisionmaking problem. That is, we let agents interact with the environment by performing a sequence of decisions, which in mTSP are to select the next city to visit. These decisions are made sequentially and asynchronously by each agent based on their own observation, upon arriving at the next city along their tour, thus constructing a global solution collaboratively. Each decision (i.e., RL action) will transfer the agent from the current city to the next city it selects. Here we introduce a decision time gap between two decisionmaking steps, to account for the time needed for the agent to transit to the next city (i.e., enacting a form of eventbased system). At the time step one agent selects a city, the decision time gap for this agent is initialized as the Euclidean distance between the current city the agent occupies and the next city of that agent’s tour. This decision time gap then decreases by one unit every time step. The agent can only make its next decision when the decision time gap has a value of 0, i.e., when it reaches the next city along its tour. This helps us avoid potential conflicts in the city selection process, by endowing agents with asynchronous decentralized decisionmaking abilities. We empirically found that this assumption significantly improves collaboration.
It should be noted that we allow agents to return to the depot at any time during their tour, although it is against the usual mTSP constraint that agents only return to the depot after all cities are visited. We observe that, if agents are not allowed to return to the depot in advance, at the end of the tour one agent might then be forced to visit one of the remaining unvisited cities even if it is far from this agent but very close to other agents (but these agents are not available currently since they are still in transit to their next selected city). This has a bad effect on exploration and on the overall learning process, leading to poor final policies. Since returning to the depot during the tour always increases the length, removing this constraint actually makes mTSP more difficult, so we note that our method does not gain any advantage over existing methods. Furthermore, we empirically observe that agents are able to learn to return to the depot only at the end of their portion of the tour, thus actual satisfying the mTSP constraints in practice.
IvB Observation
We consider a fully observable world where one agent can access the states of all cities and all agents. Although a partial observation is more common in decentralized MARL [zhang_multiagent_2021], a global observation is necessary to make our model comparable to baseline algorithms, and partial observability will be considered in future works. The observation of each agent consist of three parts: the cities state, the agents state, and a global mask.
The cities state contains the Euclidean coordinates of all cities relative to the observing agent. Compared to absolute information, we empirically found that relative coordinates prevent premature convergence and lead to a better final policy.
The agents state , contains the Euclidean coordinates of all agents relative to the observing agent, and the agents’ decision time gaps . As mTSP is a cooperative task, one agent can benefit from observing other agent states, e.g., to predict other agents future decisions.
Finally, agents can observe a global mask : an
dimensional, binary vector containing the visit history for all
cities. Each entry of is initially , and then set to 1 after any agent has visited the corresponding city.IvC Action
At each decision step of agent , based on its current observation , our decentralized attentionbased neural network outputs a stochastic policy , parameterized by the set of weights :
(3) 
where denotes an unvisited city. Agent takes an action based on this policy to select the next city . By performing such actions iteratively, agent constructs its tour .
IvD Reward Structure
To show the advantage of reinforcement learning, we try to minimize the amount of domain knowledge introduced into our approach. In this work, the reward is simply the negative of the max tour length among agents, and all agents share it as a global reward. This reward structure is sparse, i.e., agents can only receive this reward after all agents finish their tours. The reward is formulated as:
(4) 
Algorithm 1 shows how to solve mTSP as a RL problem. There, selecting the next city is performed in an entirely decentralized (and asynchronous) manner by each agent in our model, as detailed in the next section. It should be noted that, although the agents make decisions at different time steps from each other (due to the decision time gap), there are still times where multiple agents need to make a decision at the same time step, such at the first time step. In those cases, we make agents select cities sequentially, where subsequent agents know the decision of previous ones. We note that simply using a smaller decrement can make synchronous decisionmaking steps happen arbitrarily rarely after the first time step.
V Decentralized AttentionBased Neural Network
We propose a decentralized attentionbased neural network (DAN), through which agents can select the next city according to their own observations. It consists of a city encoder, an agent encoder, a cityagent encoder, and a decoder. Its structure is used to model three kinds of dependencies in mTSP, i.e., the agentagent dependencies, the citycity dependencies, and the agentcity dependencies. To achieve good collaboration in mTSP, it is important for agents to learn all of these dependencies to make decisions that benefit the whole team. Each agent uses its local DAN network to select the next city based on its own observation. Compared to existing attentionbased TSP solvers, which only learn dependencies among cities and finds good individual tours, DAN further endows agents with the ability to predict each others’ future decision to improve agentcity allocation, by adding the agent encoder and the cityagent encoder.
The overall model is shown in Fig. 2. In general, based on the observations of the deciding agent, we first use the city encoder and the agent encoder to model the dependencies among cities and among agents respectively. Then in the cityagent encoder we update the city features by considering other agents’ potential decisions according to their features. Finally, in the decoder, based on the deciding agent’s current state and the updated city features, we allocate attention weights to each city, which we directly use as its policy. We detail this process in this section.
Va Attention Layer
The Transformer attention layer [vaswani2017attention] is used as the fundamental building block in our model. The input of such an attention layer consists of the query source and the keyandvalue source , which are both vectors with the same dimension. The attention layer updates the query source using the weighted sum of the value, where the attention weight depends on the similarity between query and key. We compute the query , key and value as:
(5) 
where are all learnable matrices with size . Next, we compute the similarity between the query and the key using a scaled dot product:
(6) 
Then we calculate the attention weights using a softmax:
(7) 
Finally, we compute a weighted sum of these values as the output embedding from this attention layer:
(8) 
The embedding content is then passed through the feed forward sublayer, which contains two linear layer with dimension (in practice
) and a ReLU activation:
(9) 
Note that layer normalization [ba_layer_2016]
[he_deep_2016] are used within these two sublayers as in [vaswani2017attention].VB City Encoder
The city encoder is used to extract features from the cities state and model their dependencies. The city encoder first embeds the relative Euclidean coordinates of city , into a dimensional ( in practice) initial city embedding using a linear layer. Similarly, the depot’s Euclidean coordinates are embedded by another linear layer to . The initial city embedding is then passed through an attention layer. Here the query source and the keyandvalue source are both the initial city embedding , as is commonly done in selfattention mechanisms. Selfattention achieved good performance to model the dependencies of cities in single TSP approaches [kool2018attention], and we propose to rely on the same fundamental idea to model the dependencies in mTSP.
We term the output of the city encoder, , the city embedding. Due to the self attention mechanism used, contains the dependencies between each city and all other cities.
VC Agent Encoder
The agent encoder is used to extract features from the agents state and model their dependencies. A linear layer is used to separately embed each (3dimensional) component of into the initial agent embedding . This embedding is then passed through an attention layer, as both the query source and the keyandvalue source (i.e., selfattention again).
We term the output of the agent encoder, the agent embedding. It contains the dependencies between each agent and all other agents.
VD Cityagent Encoder
The cityagent encoder is used to model the dependencies between cities and agents. The cityagent encoder applies an attention layer with crossattention, where the query source is the city embedding, and the keyandvalue is the agent embedding.
We term the output of this encoder the cityagent embedding. It contains the relationship between each city and each agent . That is, it implicitly predicts whether city is likely to be selected by another agent , which is one of the keys to the improved performance of our model.
VE Decoder
The decoder is used to decode the different embeddings into a policy for selecting the next city to visit. The decoder starts with encoding the deciding agent’s current state. Since we use a relative position associated with the agent position, the agent position coordinate is always . To avoid meaningless inputs, we instead choose to express the current agent state implicitly by computing an aggregated embedding which is the mean of the city embedding. This operation is similar to the graph embedding used in [kool2018attention].
The first attention layer then adds this agent embedding to the aggregated embedding. In doing so, it relates the state of the deciding agent to that of all other agents. Here the query source is and the keyandvalue source is . This layer outputs the current state embedding .
After that, a second attention layer is used to compute the final candidate embedding , where the query source is the current state embedding , and the keyandvalue is the cityagent embedding . This layer serves as a glimpse which is common to improve attention mechanisms [bello2016neural].
There, when computing the similarity, we rely on the global mask to manually set the similarity if the corresponding city has already been visited:
(10) 
to ensure the attention weights of visited cities are .
The final candidate embedding then passes through a third and final attention layer. The query source is the current state embedding , and the key source is the final candidate embedding . For this final layer only only, following [vinyals2015pointer], we directly use the vector of attention weights as the final policy for the deciding agent.
The same masking operation Eq.(10) is also applied in this layer to satisfy the mTSP constraint. Besides, following [bello2016neural], the similarity is clipped within (in practice, ) to encourage exploration and improve the final policy by preventing premature convergence:
(11) 
These similarities are normalized using a Softmax operation, to finally yield the probability distribution
for the next city to visit:
(12) 
The deciding agent can select the next city to visit either by greedily selecting the city with the highest probability or by sampling based on the probability distribution .
Vi Training
In this section, we describe how DAN is trained, including the choice of hyperparameters and hardware used.
Via REINFORCE with Rollout Baseline
In order to train our model, we define the policy loss:
(13) 
where . The policy loss is the expectation of the negative of the max length among the tours of agents. The loss is optimized by gradient descent using the REINFORCE algorithm with greedy rollout baseline [kool2018attention]. That is, we rerun the same exact episode from the start a second time, and let all agents take decisions by greedily exploiting the best policy so far (i.e., the “baseline model” explained in Section VIC below). The cumulative reward
of this baseline episode is then used to estimate the advantage function:
(withthe cumulative reward at each state of the RL episode). This helps reduce the gradient variance and avoids the burden of training the model to explicitly estimate the state value, as in traditional actorcritic algorithms. The final gradient estimator for the policy loss reads:
(14) 
ViB Parameter Sharing
As agents in mTSP are homogeneous, we train our model using parameter sharing, a general method for MARL [gupta_cooperative_nodate]. That is, we allow agents to share the parameters of a common neural network, thus making the training more efficient by relying on the sum of experience from all agents. Meanwhile, parameter sharing provides our model with natural scalability to the number of agents so it can handle arbitraryscale mTSP instances (as shown in our results).
ViC Distributed Training
Our model is trained on a workstation equipped with a i910980XE CPU and four NVIDIA GeForce RTX 3090 GPUs. We train our model utilizing Ray, a distributed framework for machine learning
[moritz2018ray] to accelerate training by parallelizing the code. With Ray, we run 8 mTSP instances in parallel and pass gradients to be applied to the global network under the A2C structure, a synchronous variant of A3C [noauthor_openai_2017].At each training episode, the positions of cities are generated uniformly at random in the unit square and the decision time gap decreases by 0.1 every time step (defining the agent velocity). The number of agent is randomized within and the number of cites is randomized within during early training. After initial convergence of the policy, the number of cities is randomized within for further refinement.
We formulate one training batch after 8 mTSP instances are solved, and perform one gradient update for each agent. We train the model with the Adam optimizer [kingma_adam_2017] and use an initial learning rate of
and decay every 1024 steps by a factor of 0.96. Every 2048 steps we compare the current training model with the baseline model, and replace the baseline model if the improvement is significant according to a paired ttest on an mTSP test set with 2048 instances. Our full training and testing code is available at
https://bit.ly/DAN_mTSP.Vii Experiments
We test our decentralized attentionbased neural network (DAN) on numerous sets of 500 mTSP instances each, generated uniformly at random in the unit square .
We test two different variants of our model, since it is possible to construct the solution with two different strategies:

Greedy: each agent always selects the action with highest activation in its policy.

Sampling: each agent selects the city stochastically according to its policy.
For the sampling strategy, we run our model multiple times on the same instance and report the solution with the highest quality. While [kool2018attention] sample 1280 solutions for single TSP, we only sample 64 solutions (denoted as s.64) for each mTSP instance to balance the tradeoff between computing time and solution quality. In our tests, the decision time gap decreases by 0.01 for each time step (i.e., agents move slower than during training), which slightly increases the computing time for each solution, but further improves the performance of our model by allowing more finelygrained asynchronous action selection.
Viia Results
We report the average MinMax (lower is better) for smallscale mTSP instances (from 50 to 200 cities) in Table I, which also shows the average computing time per instance for each of the considered solvers. The performance of our model is compared with both conventional methods and neuralbased methods.
For conventional methods, we test OR Tools
, evolutionary algorithm (EA), and self organizing maps (SOM)
[lupoaie_somguided_2019] on the same test set. Similar to our model, OR Tools can obtain solutions using two different strategies: it can initially get a solution using metaheuristic algorithms (denoted as OR); this solution can then be further improved by local search [gendreau_guided_2010] (denoted as OR+s.). This requires additional computing time, similar to our sampling strategy. We allow OR Tools to perform local search for a similar amount of time as our sampling strategy, for fair comparison.For neuralbased methods, we report Park et al.’s results [park_schedulenet_2021] and Hu et al.’s results [hu2020reinforcement] from their papers, since they did not make their code available publicly. Since Park et al.’s paper does not report the computing time of their approach, we leave the corresponding cells blank in Table I. Similarly, since Hu et al. did not provide any results for cases involving more than 100 cities or more than 10 agents, these cells are also left blank. Note that the test sets used by Park et al. and Hu et al. are likely different from ours, since they have not been made public. However, the values reported here from their paper are also averaged over 500 instances under the same exact conditions as the ones used in this work (random uniform placement of the cities in the unit square), which we believe allows for comparable results.
We then report the average MinMax for largescale (from 400 to 1000 cities) mTSP in Table II, where the number of agents is fixed to (due to the limitation of Hu et al.’s model). When testing the sampling strategy, we set in the third decoder layer for efficient exploration (since the tour is much longer). Except DAN and Hu et al.’s model, other methods cannot handle such largescale mTSP, but we still report the results of OR Tools from Hu et al.’s paper, as well as SOM results as the bestperforming metaheuristic algorithms for completeness.
ViiB Discussion
We first notice that DAN significantly outperforms OR Tools in largerscale mTSP instances with relatively more agents (, ), but is outperformed by OR Tools in smallerscale mTSP instances (as can be expected). In smallerscale mTSP, OR Tools can explore the search space sufficiently and produce nearoptimal solutions. In this situation, our decentralized model finds it difficult to achieve the same level of performance. Note that this is for all decentralized methods considered.
Method  n=50 m=5  n=50 m=7  n=50 m=10  
Max.  T(s)  Max.  T(s)  Max.  T(s)  
EA  2.35  7.82  2.08  9.58  1.96  11.50 
SOM  2.57  0.76  2.30  0.78  2.16  0.76 
OR  2.11  1.07  2.05  1.06  2.05  1.06 
OR+s.  2.04  12.00  1.96  12.00  1.96  12.00 
Park et al.  2.37  2.18  2.10  
Hu et al.  2.12  0.01  1.95  0.02  
DAN(g.)  2.29  0.25  2.11  0.26  2.03  0.30 
DAN(s.64)  2.12  7.87  1.99  9.38  1.95  11.26 
Method  n=100 m=5  n=100 m=10  n=100 m=15  
Max.  T(s)  Max.  T(s)  Max.  T(s)  
EA  3.55  12.80  2.75  17.52  2.51  21.63 
SOM  3.10  1.58  2.41  1.58  2.22  1.57 
OR  2.41  10.79  2.29  11.09  2.27  11.32 
OR+s.  2.36  18.00  2.29  18.00  2.25  18.00 
Park et al.  2.88  2.23  2.16  
Hu et al.  2.48  0.04  2.09  0.04  
DAN(g.)  2.72  0.43  2.17  0.48  2.09  0.58 
DAN(s.64)  2.55  12.18  2.05  14.81  2.00  19.13 
Method  n=200 m=10  n=200 m=15  n=200 m=20  
Max.  T(s)  Max.  T(s)  Max.  T(s)  
EA  4.07  29.91  3.62  34.33  3.37  39.34 
SOM  2.81  3.01  2.50  3.04  2.34  3.04 
OR  2.57  63.70  2.59  60.29  2.59  61.74 
Park et al.  2.50  2.38  2.44  
DAN(g.)  2.40  0.93  2.20  0.98  2.15  1.07 
DAN(s.64)  2.29  23.49  2.13  26.27  2.07  29.83 
Method  n=400  n=500  n=600  n=700  n=800  n=900  n=1000  
Max.  T(s)  Max.  T(s)  Max.  T(s)  Max.  T(s)  Max.  T(s)  Max.  T(s)  Max.  T(s)  
OR [hu2020reinforcement]  4.59  1705  7.75  1800  9.64  1800  11.24  1800  12.34  1800  13.71  1800  14.84  1800 
SOM [lupoaie_somguided_2019]  3.53  6.10  3.86  7.63  4.24  9.39  4.54  10.86  4.93  14.28  5.21  16.65  5.53  17.89 
Hu et al. [hu2020reinforcement]  2.99  0.32  3.32  0.56  3.65  0.81  3.95  1.22  4.20  1.69  4.59  2.21  4.81  2.87 
DAN(g.)  2.98  1.76  3.29  2.15  3.60  2.58  3.91  3.03  4.23  3.36  4.55  3.81  4.84  4.21 
DAN(s.64)  2.83  40.04  3.14  48.91  3.46  57.81  3.75  67.69  4.10  77.08  4.42  87.03  4.75  97.26 
Upon closer inspection, we note that our model seems very good at distributing cities to agents, but exhibits poorer performance at finding the optimal visiting sequence. This is likely due to its decentralized, iterative nature (see Fig 1 where one agent occasionally has a cross path). However, our sampling strategy still achieves performances close to OR Tools ( gap) in smallerscale mTSP instances, which is very promising. Meanwhile, in largerscale mTSP, OR Tools performs poorly as the dimension of the search space increases exponentially with the number of cites and agents. Thanks to its decentralized nature, our model surpasses OR Tools in largerscale mTSP instances, where OR Tools can only yield suboptimal solutions. In mTSP100(), our sampling strategy is 10% better than OR Tools with local search. The advantage of our model becomes more significant as the scale of the instance grows, even when using our greedy strategy. For instances on instances involving more than 400 cities, OR Tools becomes impractical even when allowing up to 1800s per instance, while our greedy approach still outputs solutions with good quality in a matter of seconds. In general, the computing time of our model increases linearly with respect to the scale of the mTSP instance, as shown in Fig. 3.
Second, we notice that DAN’s structure helps achieve better agent collaboration than the other two decentralized dRL methods, thus yielding better overall results. Compared to Park et al.’s model, our model achieves better performances across all experiments even when using our greedy strategy. Their model uses graph attention network, which is similar to selfattention, to extract agents’ and cities’ features separately. However although Park et al. designed a complex procedure to aggregate these features, their model seems to lack an efficient mechanism like our cityagent encoder to explicitly learn the dependencies between cities and agents. From our results, it appears that such mechanism plays a key rule in implicitly predicting the future decisions of other agents. We believe this cityagent encoder leads to a better collaboration, resulting in increased performance.
Compared to Hu et al.’s model, we remind the reader that DAN can handle mTSP instances with arbitrary team sizes, whereas Hu et al.’s model needs to be retrained if the number of agents changes. Despite this key difference, our model still produces (at least marginally) higherquality solutions, especially in largescale instances (400 to 1000 cities). Since Hu et al. only trained their model to allocate cities to the agents and then proposed to solve the resulting singleTSP using OR Tools for each agent, their method guarantees nearoptimal routes for the allocated cites. Therefore, we conclude from our overall better performances that DAN is likely better at allocating cities. We believe this might be because their model runs an auctionbased mechanism without explicitly considering agent collaboration/interactions. Thanks to our cityagent encoder, DAN agents can predict others’ decision to reach improved collaboration, which results in better overall performance.
Viii Conclusion
This work introduced DAN, a decentralized attentionbased neural network to solve MinMax multiple travelling salesman problem. We consider mTSP as a cooperative task and formulate mTSP as a sequential decision making problem, where agents iteratively build a collaborative mTSP solution asynchronously. In doing so, our attentionbased neural model allows agents to achieve implicit coordination to solve the mTSP instance together in a decentralized manner. Through our results, we showed that our model exhibits excellent performance for small to largescale mTSP instances, which involve to cities and to agents. Compared to stateoftheart conventional baseline, our model achieves better performance both in terms of solution quality and computing time in largescale mTSP instances, while achieving comparable performance in smallscale mTSP instances. When comparing to two stateoftheart decentralized dRL methods, our model exhibits better collaboration between agents that lead to improved solutions. We believe that the developments made in the design of DAN can extend to more general robotic problems where agent allocation/distribution is key, such as multirobot patrolling, distributed search/coverage, or collaborative manufacturing.
In addition to extending our approach to these robotic tasks, future work will focus on adapting DAN to dynamical mTSP, where new cities might appear and/or existing cities might move/disappear. Such problems are common in certain multirobot applications, e.g., frontierbased exploration, yet few methods have been developed to date. There, we hypothesize that the reactive nature of our approach might have an edge over centralized planners.
Comments
There are no comments yet.