attentiontsp
Attention based model for learning to solve the Travelling Salesman Problem
view repo
We propose a framework for solving combinatorial optimization problems of which the output can be represented as a sequence of input elements. As an alternative to the Pointer Network, we parameterize a policy by a model based entirely on (graph) attention layers, and train it efficiently using REINFORCE with a simple and robust baseline based on a deterministic (greedy) rollout of the best policy found during training. We significantly improve over stateoftheart results for learning algorithms for the 2D Euclidean TSP, reducing the optimality gap for a single tour construction by more than 75 0.33
READ FULL TEXT VIEW PDFAttention based model for learning to solve the Travelling Salesman Problem
Imagine yourself travelling to a scientific conference. The field is popular, and surely you do not want to miss out on anything. You have selected several posters you want to visit, and naturally you must return to the place where you are now: the coffee corner. In which order should you visit the posters, to minimize your time walking around? This is the Travelling Scientist Problem (TSP).
You realize that your problem is equivalent to the Travelling Salesman Problem (conveniently also TSP). This seems discouraging as you know the problem is (NP)hard (Garey & Johnson, 1979)
. Fortunately, complexity theory analyzes the worst case, and your Bayesian view considers this unlikely. In particular, you have a strong prior: the posters will probably be laid out regularly. You want a special algorithm that solves not any, but
this type of problem instance. You have some months left to prepare. As a machine learner, you wonder whether your algorithm can be learned?Machine learning algorithms have replaced humans as the engineers of algorithms to solve various tasks. A decade ago, computer vision algorithms used handcrafted features but today they are learned endtoend
by Deep Neural Networks (DNNs). DNNs have outperformed classic approaches in speech recognition, machine translation, image captioning and other problems, by learning from data
(LeCun et al., 2015). While DNNs are mainly used to make predictions, Reinforcement Learning (RL) has enabled algorithms to learn to make
decisions, either by interacting with an environment, e.g. to learn to play Atari games (Mnih et al., 2015), or by inducing knowledge through lookahead search: this was used to master the game of Go (Silver et al., 2017).The world is not a game, and we desire to train models that make decisions to solve real problems. These models must learn to select good solutions for a problem from a combinatorially large set of potential solutions. Classically, approaches to this problem of combinatorial optimization can be divided into exact methods, that guarantee finding optimal solutions, and heuristics, that trade off optimality for computational cost, although exact methods can use heuristics internally and vice versa. Heuristics are typically expressed in the form of rules, which can be interpreted as policies to make decisions. We believe that these policies can be parameterized using DNNs, and be trained to obtain new and stronger algorithms for many different combinatorial optimization problems, similar to the way DNNs have boosted performance in the applications mentioned before. In this paper, we focus on routing problems: an important class of practical combinatorial optimization problems.
The promising idea to learn heuristics has been tested on TSP (Bello et al., 2016). In order to push this idea, we need better models and better ways of training. Therefore, we propose to use a powerful model based on attention and we propose to train this model using REINFORCE with a simple but effective greedy rollout baseline. The goal of our method is not to outperform a nonlearned, specialized TSP algorithm such as Concorde (Applegate et al., 2006). Rather, we show the flexibility of our approach on multiple (routing) problems of reasonable size, with a single set of hyperparameters. This is important progress towards the situation where we can learn strong heuristics to solve a wide range of different practical problems for which no good heuristics exist.
The application of Neural Networks (NNs) for optimizing decisions in combinatorial optimization problems dates back to Hopfield & Tank (1985), who applied a Hopfieldnetwork for solving small TSP instances. NNs have been applied to many related problems (Smith, 1999), although in most cases in an online manner, starting ‘from scratch’ and ‘learning’ a solution for every instance. More recently, (D)NNs have also been used offline to learn about an entire class of problem instances.
Vinyals et al. (2015) introduce the Pointer Network (PN) as a model that uses attention to output a permutation of the input, and train this model offline to solve the (Euclidean) TSP, supervised by example solutions. Upon test time, their beam search procedure filters invalid tours. Bello et al. (2016)
introduce an ActorCritic algorithm to train the PN without supervised solutions. They consider each instance as a training sample and use the cost (tour length) of a sampled solution for an unbiased MonteCarlo estimate of the policy gradient. They introduce extra model depth in the decoder by an additional
glimpse (Vinyals et al., 2016) at the embeddings, masking nodes already visited. For small instances (), they get close to the results by Vinyals et al. (2015), they improve for and additionally include results for . Nazari et al. (2018) replace the LSTM encoder of the PN by elementwise projections, such that the updated embeddings after statechanges can be effectively computed. They apply this model on the Vehicle Routing Problem (VRP) with split deliveries and a stochastic variant.Dai et al. (2017) do not use a separate encoder and decoder, but a single model based on graph embeddings. They train the model to output the order in which nodes are inserted into a partial tour, using a helper function to insert at the best possible location. Their 1step DQN (Mnih et al., 2015) training method trains the algorithm per step and incremental rewards provided to the agent at every step effectively encourage greedy behavior. As mentioned in their appendix, they use the negative of the reward, which combined with discounting encourages the agent to insert the farthest nodes first, which is known to be an effective heuristic (Rosenkrantz et al., 2009).
Nowak et al. (2017) train a Graph Neural Network in a supervised manner to directly output a tour as an adjacency matrix, which is converted into a feasible solution by a beam search. The model is nonautoregressive, so cannot condition its output on the partial tour and the authors report an optimality gap of for , worse than autoregressive approaches mentioned in this section. Kaempfer & Wolf (2018) train a model based on the Transformer architecture (Vaswani et al., 2017) that outputs a fractional solution to the multiple TSP (mTSP). The result can be seen as a solution to the linear relaxation of the problem and they use a beam search to obtain a feasible integer solution.
Independently of our work, Deudon et al. (2018) presented a model for TSP using attention in the OR community. They show performance can improve using 2OPT local search, but do not show benefit of their model in direct comparison to the PN. We use a different decoder and improved training algorithm, both contributing to significantly improved results, without 2OPT and additionally show application to different problems. For a full discussion of the differences, we refer to Appendix B.4.
We define the Attention Model in terms of the TSP. For other problems, the model is the same but the input, mask and decoder context need to be defined accordingly, which is discussed in the Appendix. We define a problem instance
as a graph with nodes, where node is represented by features . For TSP, is the coordinate of node and the graph is fully connected (with selfconnections) but in general, the model can be considered a Graph Attention Network (Velickovic et al., 2018) and take graph structure into account by a masking procedure (see Appendix A). We define a solution (tour) as a permutation of the nodes, so and . Our attention based encoderdecoder model defines a stochastic policy for selecting a solution given a problem instance . It is factorized and parameterized by as(1) 
The encoder produces embeddings of all input nodes. The decoder produces the sequence of input nodes, one node at a time. It takes as input the encoder embeddings and a problem specific mask and context. For TSP, when a partial tour has been constructed, it cannot be changed and the ‘remaining’ problem is to find a path from the last node, through all unvisited nodes, to the first node. The order and coordinates of other nodes already visited are irrelevant. To know the first and last node, the decoder context consists (next to the graph embedding) of embeddings of the first and last node. Similar to Bello et al. (2016), the decoder observes a mask to know which nodes have been visited.
The encoder that we use (Figure 1) is similar to the encoder used in the Transformer architecture by Vaswani et al. (2017), but we do not use positional encoding such that the resulting node embeddings are invariant to the input order. From the dimensional input features (for TSP = 2), the encoder computes initial dimensional node embeddings (we use ) through a learned linear projection with parameters and : . The embeddings are updated using attention layers, each consisting of two sublayers. We denote with the node embeddings produced by layer . The encoder computes an aggregated embedding of the input graph as the mean of the final node embeddings : . Both the node embeddings and the graph embedding are used as input to the decoder.
Following the Transformer architecture (Vaswani et al., 2017), each attention layer consist of two sublayers: a multihead attention (MHA) layer that executes message passing between the nodes and a nodewise fully connected feedforward (FF) layer. Each sublayer adds a skipconnection (He et al., 2016)
and batch normalization (BN)
(Ioffe & Szegedy, 2015) (which we found to work better than layer normalization (Ba et al., 2016)):(2)  
(3) 
The layer index indicates that the layers do not share parameters. The MHA sublayer uses heads with dimensionality
, and the FF sublayer has one hidden (sub)sublayer with dimension 512 and ReLu activation. See Appendix
A for details.Decoding happens sequentially, and at timestep , the decoder outputs the node based on the embeddings from the encoder and the outputs generated at time . During decoding, we augment the graph with a special context node to represent the decoding context. The decoder computes an attention (sub)layer on top of the encoder, but with messages only to the context node for efficiency.^{1}^{1}1 attention between all nodes is expensive to compute in every step of the decoding process. The final probabilities are computed using a singlehead attention mechanism. See Figure 2 for an illustration of the decoding process.
The context of the decoder at time comes from the encoder and the output up to time . As mentioned, for the TSP it consists of the embedding of the graph, the previous (last) node and the first node . For we use learned dimensional parameters and as input placeholders:
(4) 
Here is the horizontal concatenation operator and we write the
dimensional result vector as
to indicate we interpret it as the embedding of the special context node and use the superscript to align with the node embeddings . We could project the embedding back to dimensions, but we absorb this transformation in the parameter in equation 5.Now we compute a new context node embedding using the (head) attention mechanism described in Appendix A. The keys and values come from the node embeddings , but we only compute a single query (per head) from the context node (we omit the for readability):
(5) 
We compute the compatibility of the query with all nodes, and mask (set ) nodes which cannot be visited at time . For TSP, this simply means we mask the nodes already visited:
(6) 
Here is the query/key dimensionality (see Appendix A). Again, we compute and for heads and compute the final multihead attention value for the context node using equations 12–14 from Appendix A, but with instead of . This mechanism is similar to our encoder, but does not use skipconnections, batch normalization or the feedforward sublayer for maximal efficiency. The result is similar to the glimpse described by Bello et al. (2016).
To compute output probabilities in equation 1, we add one final decoder layer with a single attention head ( so ). For this layer, we only compute the compatibilities using equation 6, but following Bello et al. (2016) we clip the result (before masking!) within (C = 10) using :
(7) 
We interpret these compatibilities as unnormalized logprobabilities (logits) and compute the final output probability vector
using a softmax (similar to equation 12 in Appendix A):(8) 
Section 3 defined our model that given an instance
defines a probability distribution
, from which we can sample to obtain a solution (tour) . In order to train our model, we define the loss : the expectation of the cost (tour length for TSP). We optimize by gradient descent, using the REINFORCE (Williams, 1992) gradient estimator with baseline :(9) 
A good baseline
reduces gradient variance and therefore increases speed of learning. A simple example is an exponential moving average
with decay . Here in the first iteration and gets updated as in subsequent iterations. A popular alternative is the use of a learned value function (critic) , where the parameters are learned from the observations . However, getting such actorcritic algorithms to work is nontrivial.We propose to use a rollout baseline in a way that is similar to selfcritical training by Rennie et al. (2017), but with periodic updates of the baseline policy. It is defined as follows: is the cost of a solution from a deterministic greedy rollout of the policy defined by the best model so far.
The goal of a baseline is to estimate the difficulty of the instance , such that it can relate to the cost to estimate the advantage of the solution selected by the model. We make the following key observation: The difficulty of an instance can (on average) be estimated by the performance of an algorithm applied to it. This follows from the assumption that (on average) an algorithm will have a higher cost on instances that are more difficult. Therefore we form a baseline by applying (rolling out) the algorithm defined by our model during training. To eliminate variance we force the result to be deterministic by selecting greedily the action with maximum probability.
As the model changes during training, we stabilize the baseline by freezing the greedy rollout policy
for a fixed number of steps (every epoch), similar to freezing of the target Qnetwork in DQN
(Mnih et al., 2015). A stronger algorithm defines a stronger baseline, so we compare (with greedy decoding) the current training policy with the baseline policy at the end of every epoch, and replace the parametersof the baseline policy only if the improvement is significant according to a paired ttest (
), on 10000 separate (evaluation) instances. If the baseline policy is updated, we sample new evaluation instances to prevent overfitting.With the greedy rollout as baseline , the function is negative if the sampled solution is better than the greedy rollout, causing actions to be reinforced, and vice versa. This way the model is trained to improve over its (greedy) self. We see similarities with selfplay improvement (Silver et al., 2017): sampling replaces tree search for exploration and the model is rewarded if it yields improvement (‘wins’) compared to the best model. Similar to AlphaGo, the evaluation at the end of each epoch ensures that we are always challenged by the best model.
Each rollout constitutes an additional forward pass, increasing computation by . However, as the baseline policy is fixed for an epoch, we can sample the data and compute baselines per epoch using larger batch sizes, allowed by the reduced memory requirement as the computations can run in pure inference mode. Empirically we find that it adds only (see Appendix B.5), taking up of total time. If desired, the baseline rollout can be computed in parallel such that there is no increase in time per iteration, as an easy way to benefit from an additional GPU.
We focus on routing problems: we consider the TSP, two variants of the VRP, the Orienteering Problem and the (Stochastic) Prize Collecting TSP. These provide a range of different challenges, constraints and objectives and are traditionally solved by different algorithms. For the Attention Model (AM), we adjust the input, mask, decoder context and objective function for each problem (see Appendix for details and data generation) and train on problem instances of , 50 and 100 nodes. For all problems, we use the same hyperparameters: those we found to work well on TSP.
We initialize parameters , with the input dimension. Every epoch we process 2500 batches of 512 instances (except for VRP with , where we use 2500 256 for memory constraints). For TSP, an epoch takes 5:30 minutes for , 16:20 for (single GPU 1080Ti) and 27:30 for (on 2 1080Ti’s). We train for 100 epochs using training data generated on the fly. We found training to be stable and results to be robust against different seeds, where only in one case (PCTSP with ) we had to restart training with a different seed because the run diverged. We use layers in the encoder, which we found is a good tradeoff between quality of the results and computational complexity. We use a constant learning rate . Training with a higher learning rate is possible and speeds up initial learning, but requires decay ( per epoch) to converge and may be a bit more unstable. See Appendix B.5. With the rollout baseline, we use an exponential baseline (
) during the first epoch, to stabilize initial learning, although in many cases learning also succeeds without this ‘warmup’. Our code in PyTorch
(Paszke et al., 2017) is publicly available.^{2}^{2}2https://github.com/wouterkool/attentionlearntorouteFor each problem, we report performance on 10000 test instances. At test time we use greedy decoding, where we select the best action (according to the model) at each step, or sampling, where we sample 1280 solutions (in s on a single GPU) and report the best. More sampling improves solution quality at increased computation. In Table 1 we compare greedy decoding against baselines that also construct a single solution, and compare sampling against baselines that also consider multiple solutions, either via sampling or (local) search. For each problem, we also report the ‘best possible solution’: either optimal via Gurobi (2018) (intractable for except for TSP) or a problem specific stateoftheart algorithm.
Run times are important but hard to compare: they can vary by two orders of magnitude as a result of implementation (Python vs C++) and hardware (GPU vs CPU). We take a practical view and report the time it takes to solve the test set of 10000 instances, either on a single GPU (1080Ti) or 32 instances in parallel on a 32 virtual CPU system (2 Xeon E52630). This is conservative: our model is parallelizable while most of the baselines are single thread CPU implementations which cannot parallelize when running individually. Also we note that after training our run time can likely be reduced by model compression (Hinton et al., 2015). In Table 1 we do not report running times for the results which were reported by others as they are not directly comparable but we note that in general our model and implementation is fast: for instance Bello et al. (2016) report 10.3s for sampling 1280 TSP solutions (K80 GPU) which we do in less than one second (on a 1080Ti). For most algorithms it is possible to trade off runtime for performance. As reporting full tradeoff curves is impractical we tried to pick reasonable spots, reporting the fastest if results were similar or reporting results with different time limits (for example we use Gurobi with time limits as heuristic).
Method  Obj.  Gap  Time  Obj.  Gap  Time  Obj.  Gap  Time  

TSP  Concorde  (1m)  (2m)  (3m)  
LKH3  (18s)  (5m)  (21m)  
Gurobi  (7s)  (2m)  (17m)  
Gurobi (1s)  (8s)  (2m)    
Nearest Insertion  (1s)  (2s)  (6s)  
Random Insertion  (0s)  (1s)  (3s)  
Farthest Insertion  (1s)  (2s)  (7s)  
Nearest Neighbor  (0s)  (0s)  (0s)  
Vinyals et al. (gr.)    
Bello et al. (gr.)  
Dai et al.  
Nowak et al.      
EAN (greedy)  (2m)  (5m)  (8m)  
AM (greedy)  (0s)  (2s)  (6s)  
OR Tools  
Chr.f. + 2OPT    
Bello et al. (s.)    
EAN (gr. + 2OPT)  (4m)  (26m)  (3h)  
EAN (sampling)  (5m)  (17m)  (56m)  
EAN (s. + 2OPT)  (6m)  (32m)  (5h)  
AM (sampling)  (5m)  (24m)  (1h)  
CVRP  Gurobi      
LKH3  (2h)  (7h)  (13h)  
RL (greedy)  
AM (greedy)  (1s)  (3s)  (8s)  
RL (beam 10)  
Random CW  
Random Sweep  
OR Tools  
AM (sampling)  (6m)  (28m)  (2h)  
SDVRP  RL (greedy)  
AM (greedy)  (1s)  (4s)  (11s)  
RL (beam 10)  
AM (sampling)  (9m)  (42m)  (3h)  
OP (distance)  Gurobi  (16m)      
Gurobi (1s)  (4m)  (6m)  (7m)  
Gurobi (10s)  (12m)  (51m)  (53m)  
Gurobi (30s)  (14m)  (2h)  (3h)  
Compass  (2m)  (5m)  (15m)  
Tsili (greedy)  (4s)  (4s)  (5s)  
AM (greedy)  (0s)  (1s)  (5s)  
GA (Python)  (10m)  (1h)  (5h)  
OR Tools (10s)  (52m)      
Tsili (sampling)  (28s)  (2m)  (6m)  
AM (sampling)  (4m)  (16m)  (53m)  
PCTSP  Gurobi  (2m)      
Gurobi (1s)  (1m)      
Gurobi (10s)  (2m)  (32m)    
Gurobi (30s)  (2m)  (54m)    
AM (greedy)  (0s)  (2s)  (5s)  
ILS (C++)  (16m)  (2h)  (12h)  
OR Tools (10s)  (52m)  (52m)  (52m)  
OR Tools (60s)  (5h)  (5h)  (5h)  
ILS (Python 10x)  (4m)  (3m)  (3m)  
AM (sampling)  (5m)  (19m)  (1h)  
SPCTSP  REOPT (all)  (17m)  (2h)  (12h)  
REOPT (half)  (25m)  (3h)  (16h)  
REOPT (first)  (1h)  (22h)    
AM (greedy)  (0s)  (2s)  (5s) 
For the TSP, we report optimal results by Gurobi, as well as by Concorde (Applegate et al., 2006) (faster than Gurobi as it is specialized for TSP) and LKH3 (Helsgaun, 2017), a stateoftheart heuristic solver that empirically also finds optimal solutions in time comparable to Gurobi. We compare against Nearest, Random and Farthest Insertion, as well as Nearest Neighbor, which is the only nonlearned baseline algorithm that also constructs a tour directly in order (i.e. is structurally similar to our model). For details, see Appendix B.3. Additionally we compare against the learned heuristics in Section 2, most importantly Bello et al. (2016), as well as OR Tools reported by Bello et al. (2016) and Christofides + 2OPT local search reported by Vinyals et al. (2015). Results for Dai et al. (2017) are (optimistically) computed from the optimality gaps they report on 1520, 4050 and 50100 node graphs, respectively. Using a single greedy construction we outperform traditional baselines and we are able to achieve significantly closer to optimal results than previous learned heuristics (from around 1.5% to 0.3% above optimal for ). Naturally, the difference with Bello et al. (2016) gets diluted when sampling many solutions (as with many samples even a random policy performs well), but we still obtain significantly better results, without tuning the softmax temperature. For completeness, we also report results from running the EncodeAttendNavigate (EAN) code^{3}^{3}3https://github.com/MichelDeudon/encodeattendnavigate which is concurrent work by Deudon et al. (2018) (for details see Appendix B.4). Our model outperforms EAN, even if EAN is improved with 2OPT local search. Appendix B.5 presents the results visually, including generalization results for different .
In the Capacitated VRP (CVRP) (Toth & Vigo, 2014), each node has a demand and multiple routes should be constructed (starting and ending at the depot), such that the total demand of the nodes in each route does not exceed the vehicle capacity. We also consider the Split Delivery VRP (SDVRP), which allows to split customer demands over multiple routes. We implement the datasets described by Nazari et al. (2018) and compare against their Reinforcement Learning (RL) framework and the strongest baselines they report. Comparing greedy decoding, we obtain significantly better results. We cannot directly compare our sampling (1280 samples) to their beam search with size 10 (they do not report sampling or larger beam sizes), but note that our greedy method also outperforms their beam search in most (larger) cases, getting (in 1 second/instance) much closer to LKH3 (Helsgaun, 2017), a stateoftheart algorithm which found best known solutions to CVRP benchmarks. See Appendix C.4 for greedy example solution plots.
The OP (Golden et al., 1987) is an important problem used to model many real world problems. Each node has an associated prize, and the goal is to construct a single tour (starting and ending at the depot) that maximizes the sum of prizes of nodes visited while being shorter than a maximum (given) length. We consider the prize distributions proposed in Fischetti et al. (1998): constant, uniform (in Appendix D.4), and increasing with the distance to the depot, which we report here as this is the hardest problem. As ‘best possible solution’ we report Gurobi (intractable for ) and Compass
, the recent stateoftheart Genetic Algorithm (GA) by
Kobeaga et al. (2018), which is only 2% better than sampling 1280 solutions with our method (objective is maximization). We outperform a Python GA^{4}^{4}4https://github.com/mcride/orienteering (which seems not to scale), as well the construction phase of the heuristic by Tsiligirides (1984) (comparing greedy or 1280 samples) which is structurally similar to the one learned by our model. OR Tools fails to find feasible solutions in a few percent of the cases for .In the PCTSP (Balas, 1989), each node has not only an associated prize, but also an associated penalty. The goal is to collect at least a minimum total prize, while minimizing the total tour length plus the sum of penalties of unvisited nodes. This problem is difficult as an algorithm has to trade off the penalty for not visiting a node with the marginal cost/tour length of visiting (which depends on the other nodes visited), while also satisfying the minimum total prize constraint. We compare against OR Tools with 10 or 60 seconds of local search, as well as open source C++^{5}^{5}5https://github.com/jordanamecler/PCTSP and Python^{6}^{6}6https://github.com/rafael2reis/salesman implementations of Iterated Local Search (ILS). Although the Attention Model does not find better solutions than OR Tools with 60s of local search, it finds almost equally good results in significantly less time. The results are also within 2% of the C++ ILS algorithm (but obtained much faster), which was the best opensource algorithm for PCTSP we could find.
The Stochastic variant of the PCTSP (SPCTSP) we consider shows how our model can deal with uncertainty naturally. In the SPCTSP, the expected node prize is known upfront, but the real collected prize only becomes known upon visitation. With penalties, this problem is a generalization of the stochastic kTSP (Ene et al., 2018). Since our model constructs a tour one node at the time, we only need to use the real prizes to compute the remaining prize constraint. By contrast, any algorithm that selects a fixed tour may fail to satisfy the prize constraint so an algorithm must be adaptive. As a baseline, we implement an algorithm that plans a tour, executes part of it and then reoptimizes using the C++ ILS algorithm. We either execute all node visits (so planning additional nodes if the result does not satisfy the prize constraint), half of the planned node visits (for replanning iterations) or only the first node visit, for maximum adaptivity. We observe that our model outperforms all baselines for . We think that failure to account for uncertainty (by the baselines) in the prize might result in the need to visit one or two additional nodes, which is relatively costly for small instances but relatively cheap for larger . Still, our method is beneficial as it provides competitive solutions at a fraction of the computational cost, which is important in online settings.
Figure 3 compares the performance of the TSP20 Attention Model (AM) and our implementation of the Pointer Network (PN) during training. We use a validation set of size 10000 with greedy decoding, and compare to using an exponential () and a critic (see Appendix B.1) baseline. We used two random seeds and a decaying learning rate of . This performs best for the PN, while for the AM results are similar to using (see Appendix B.5). This clearly illustrates how the improvement we obtain is the result of both the AM and the rollout baseline: the AM outperforms the PN using any baseline and the rollout baseline improves the quality and convergence speed for both AM and PN. For the PN with critic baseline, we are unable to reproduce the reported by Bello et al. (2016) (also when using an LSTM based critic), but our reproduction is closer than others have reported (Dai et al., 2017; Nazari et al., 2018). In Table 1 we compare against the original results. Compared to the rollout baseline, the exponential baseline is around 20% faster per epoch, whereas the critic baseline is around 13% slower (see Appendix B.5), so the picture does not change significantly if time is used as xaxis.
In this work we have introduced a model and training method which both contribute to significantly improved results on learned heuristics for TSP and additionally learned strong (single construction) heuristics for multiple routing problems, which are traditionally solved by problemspecific approaches. We believe that our method is a powerful starting point for learning heuristics for other combinatorial optimization problems defined on graphs, if their solutions can be described as sequential decisions. In practice, operational constraints often lead to many variants of problems for which no good (humandesigned) heuristics are available such that the ability to learn heuristics could be of great practical value.
Compared to previous works, by using attention instead of recurrence (LSTMs) we introduce invariance to the input order of the nodes, increasing learning efficiency. Also this enables parallelization, for increased computational efficiency. The multihead attention mechanism can be seen as a message passing algorithm that allows nodes to communicate relevant information over different channels, such that the node embeddings from the encoder can learn to include valuable information about the node in the context of the graph. This information is important in our setting where decisions relate directly to the nodes in a graph. Being a graph based method, our model has increased scaling potential (compared to LSTMs) as it can be applied on a sparse graph and operate locally.
Scaling to larger problem instances is an important direction for future research, where we think we have made an important first step by using a graph based method, which can be sparsified for improved computational efficiency. Another challenge is that many problems of practical importance have feasibility constraints that cannot be satisfied by a simple masking procedure, and we think it is promising to investigate if these problems can be addressed by a combination of heuristic learning and backtracking. This would unleash the potential of our method, already highly competitive to the popular Google OR Tools project, to an even larger class of difficult practical problems.
This research was funded by ORTEC Optimization Technology. We thank Thomas Kipf for helpful discussions and anonymous reviewers for comments that helped improve the paper. We thank DAS5 (Bal et al., 2016) for computational resources and we thank SURFsara (www.surfsara.nl) for the support in using the Lisa Compute Cluster.
International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research
, pp. 170–181. Springer, 2018.Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778, 2016.An efficient evolutionary algorithm for the orienteering problem.
Computers & Operations Research, 90:42–59, 2018.IEEE transactions on evolutionary computation
, 1(1):67–82, 1997.We interpret the attention mechanism by Vaswani et al. (2017) as a weighted message passing algorithm between nodes in a graph. The weight of the message value that a node receives from a neighbor depends on the compatibility of its query with the key of the neighbor, as illustrated in Figure 4. Formally, we define dimensions and and compute the key , value and query for each node by projecting the embedding :
(10) 
Here parameters and are matrices and has size . From the queries and keys, we compute the compatibility of the query of node with the key of node as the (scaled, see Vaswani et al. (2017)) dotproduct:
(11) 
In a general graph, defining the compatibility of nonadjacent nodes as prevents message passing between these nodes. From the compatibilities , we compute the attention weights using a softmax:
(12) 
Finally, the vector that is received by node is the convex combination of messages :
(13) 
As was noted by Vaswani et al. (2017) and Velickovic et al. (2018), it is beneficial to have multiple attention heads. This allows nodes to receive different types of messages from different neighbors. Especially, we compute the value in equation 13 times with different parameters, using . We denote the result vectors by for . These are projected back to a single dimensional vector using parameter matrices . The final multihead attention value for node is a function of through :
(14) 
The feedforward sublayer computes nodewise projections using a hidden (sub)sublayer with dimension and a ReLu activation:
(15) 
We use batch normalization with learnable dimensional affine parameters and :
(16) 
Here denotes the elementwise product and refers to batch normalization without affine transformation.
The critic network architecture uses 3 attention layers similar to our encoder, after which the node embeddings are averaged and processed by an MLP with one hidden layer with 128 neurons and ReLu activation and a single output. We used the same learning rate as for the AM/PN in all experiments.
For all TSP instances, the node locations are sampled uniformly at random in the unit square. This distribution is chosen to be neither easy nor artificially hard and to be able to compare to other learned heuristics.
This section describes details of the heuristics implemented for the TSP. All of the heuristics construct a single tour in a single pass, by extending a partial solution one node at the time.
The nearest neighbor heuristic represents the partial solution as a path with a start and end node. The initial path is formed by a single node, selected randomly, which becomes the start node but also the end node of the initial path. In each iteration, the next node is selected as the node nearest to the end node of the partial path. This node is added to the path and becomes the new end node. Finally, after all nodes are added this way, the end node is connected with the start node to form a tour. In our implementation, for deterministic results we always start with the first node in the input, which can be considered random as the instances are generated randomly.
The insertion heuristics represent a partial solution as a tour, and extends it by inserting nodes one node at the time. In our implementation, we always insert the node using the cheapest insertion cost. This means that when node is inserted, the place of insertion (between adjacent nodes and in the tour) is selected such that it minimizes the insertion costs , where , and represent the distances from node to , to and to , respectively.
The different variants of the insertion heuristic vary in the way in which the node which is inserted is selected. Let be the set of nodes in the partial tour. Nearest insertion inserts the node that is nearest to (any node in) the tour:
(17) 
Farthest insertion inserts the node such that the distance to the tour (i.e. the distance from to the nearest node in the tour) is maximized:
(18) 
Random insertion inserts a random node. Similar to nearest neighbor, we consider the input order random so we simply insert the nodes in this order.
Independently of our work, Deudon et al. (2018) also developed a model for TSP based on the Transformer (Vaswani et al., 2017). There are important differences to this paper:
As ‘context’ for the decoder, Deudon et al. (2018) use the embeddings of the last visited nodes. We use only the last (e.g. ) node but add the first visited node (as well as the graph embedding), since the first node is important (it is the destination) while the order of the other nodes is irrelevant as we explain in Section 3.
By adding 2OPT on top of the best sampled solution, Deudon et al. (2018) show that the model does not produce a local optimum and results can improve by using a ‘hybrid’ approach of a learned algorithm with local search. This is a nice example of combining learned and traditional heuristics, but it is not compared against using the Pointer Network (Bello et al., 2016) with 2OPT.
The model of Deudon et al. (2018) uses a higher dimensionality internally in the decoder (for details see their paper). Training is done with 20000 steps with a batch size of 256.
Deudon et al. (2018)
apply Principal Component Analysis (PCA) on the input coordinates to eliminate rotation symmetry whereas we directly input node coordinates.
Additionally to TSP, we also consider two variants of VRP, the OP with different prize distributions and the (stochastic) PCTSP.
We want to emphasize that this is independent work, but for completeness we include a full emperical comparison of performance. Since the results presented in the paper by Deudon et al. (2018) are not directly comparable, we ran their code^{7}^{7}7https://github.com/MichelDeudon/encodeattendnavigate and report results under the same circumstances: using greedy decoding and sampling 1280 solutions on our test dataset (which has exactly the same generative procedure, e.g. uniform in the unit square). Additionally, we include results of their model with 2OPT, showing that (even without 2OPT) final performance of our model is better. We use the hyperparameters in their code, but increase the batch size to 512 and number of training steps to for a fair comparison (this increased the performance of their model). As training with gave outofmemory errors, we train only on and and (following Deudon et al. (2018)) report results for using the model trained for . The training time as well as test run times are comparable.
We found in general that using a larger learning rate of works better with decay but may be unstable in some cases. A smaller learning rate is more stable and does not require decay. This is illustrated in Figure 6, which shows validation results over time using both and with and without decay for TSP20 and TSP50 (2 seeds). As can be seen, without decay the method has not yet fully converged after 100 epochs and results may improve even further with longer training.
Table 2 shows the results in absolute terms as well as the relative optimality gap compared to Gurobi, for all runs using seeds and with the two different learning rate schedules. We did not run final experiments for with the larger learning rate as we found training with the smaller learning rate to be more stable. It can be seen that in most cases the end results with different learning rate schedules are similar, except for the larger models (, ) where some of the runs diverged using the larger learning rate. Experiments with different number of layers show that and achieve best performance, and we find is a good tradeoff between quality of the results and computational complexity (runtime) of the model.
epoch  

time  seed = 1234  seed = 1235  seed = 1234  seed = 1235  
TSP20  5:30  
TSP50  16:20  
TSP100 (2GPUs)  27:30      
N = 0  3:10  
N = 1  3:50  
N = 2  5:00  
N = 3  5:30  
N = 5  7:00  
N = 8  10:10  
AM / Exponential  4:20  
AM / Critic  6:10  
AM / Rollout  5:30  
PN / Exponential  5:10  
PN / Critic  7:30  
PN / Rollout  6:40 
We test generalization performance on different than trained for, which we plot in Figure 5 in terms of the relative optimality gap compared to Gurobi. The train sizes are indicated with vertical marker bars. The models generalize when tested on different sizes, although quality degrades as the difference becomes bigger, which can be expected as there is no free lunch (Wolpert & Macready, 1997). Since the architectures are the same, these differences mean the models learn to specialize on the problem sizes trained for. We can make a strong overall algorithm by selecting the trained model with highest validation performance for each instance size (marked in Figure 5 by the red bar). For reference, we also include the baselines, where for the methods that perform search or sampling we do not connect the dots to prevent cluttering and to make the distinction with methods that consider only a single solution clear.


The Capacitated Vehicle Routing Problem (CVRP) is a generalization of the TSP in which case there is a depot and multiple routes should be created, each starting and ending at the depot. In our graph based formulation, we add a special depot node with index 0 and coordinates . A vehicle (route) has capacity and each (regular) node has a demand . Each route starts and ends at the depot and the total demand in each route should not exceed the capacity, so , where is the set of node indices assigned to route . Without loss of generality, we assume a normalized as we can use normalized demands .
The Split Delivery VRP (SDVRP) is a generalization of CVRP in which every node can be visited multiple times, and only a subset of the demand has to be delivered at each visit. Instances for both CVRP and SDVRP are specified in the same way: an instance with size as a depot location , node locations and (normalized) demands .
We follow Nazari et al. (2018) in the generation of instances for , but normalize the demands by the capacities. The depot location as well as node locations are sampled uniformly at random in the unit square. The demands are defined as where is discrete and sampled uniformly from and , and = 50.
In order to allow our Attention Model to distinguish the depot node from the regular nodes, we use separate parameters and to compute the initial embedding of the depot node. Additionally, we provide the normalized demand as input feature (and adjust the size of parameter accordingly):
(19) 
To facilitate the capacity constraints, we keep track of the remaining demands for the nodes and remaining vehicle capacity at time . At , these are initialized as and , after which they are updated as follows (recall that is the index of the node selected at decoding step ):
(20) 
(21) 
If we do not allow split deliveries, will be either 0 or for all .
The context for the decoder for the VRP at time is the current/last location and the remaining capacity . Compared to TSP, we do not need placeholders if as the route starts at the depot and we do not need to provide information about the first node as the route should end at the depot:
(22) 
The depot can be visited multiple times, but we do not allow it to be visited at two subsequent timesteps. Therefore, in both layers of the decoder, we change the masking for the depot and define if (and only if) or . The masking for the nodes depends on whether we allow split deliveries. Without split deliveries, we do not allow nodes to be visited if their remaining demand is 0 (if the node was already visited) or exceeds the remaining capacity, so for we define if (and only if) . With split deliveries, we only forbid delivery when the remaining demand is 0, so we define if (and only if) .
Without split deliveries, the remaining demand is either 0 or , corresponding to whether the node has been visited or not, and this information is conveyed to the model via the masking of the nodes already visited. However, when split deliveries are allowed, the remaining demand can take any value . This information cannot be included in the context node as it corresponds to individual nodes. Therefore we include it in the computation of the keys and values in both the attention layer (glimpse) and the output layer of the decoder, such that we compute queries, keys and values using:
(23) 
Here we and are parameter matrices and we define for the depot . Summing the projection of both and is equivalent to projecting the concatenation with a single matrix . However, using this formulation we only need to compute the first term once (instead for every ) and by the weight initialization this puts more importance on initially (which is otherwise just 1 of input values).
For the VRP, the length of the output of the model depends on the number of times the depot is visited. In general, the depot is visited multiple times, and in the case of SDVRP also some regular nodes are visited twice. Therefore the length of the solution is larger than , which requires more memory such that we find it necessary to limit the batch size to 256 for (on 2 GPUs). To keep training times tractable and the total number of parameter updates equal, we still process 2500 batches per epoch, for a total of 0.64M training instances per epoch.
For LKH3^{8}^{8}8http://akira.ruc.dk/~keld/research/LKH3/ by Helsgaun (2017) we build and run their code with the SPECIAL parameter as specified in their CVRP runscript^{9}^{9}9run_CVRP in http://akira.ruc.dk/~keld/research/LKH3/BENCHMARKS/CVRP.tgz. We perform 1 run with a maximum of 10000 trials, as we found performing 10 runs only marginally improves the quality of the results while taking much more time.
Figure 7 shows example solutions for the CVRP with that were obtained by a single construction using the model with greedy decoding. These visualizations give insight in the heuristic that the model has learned. In general we see that the model constructs the routes from the bottom to the top, starting below the depot. Most routes are densely packed, except for the last route that has to serve some remaining (close to each other) customers. In most cases, the node in the route that is farthest from the depot is somewhere in the middle of the route, such that customers are served on the way to and from the farthest nodes. In some cases, we see that the order of stops within some individual routes is suboptimal, which means that the method will likely benefit from simple further optimizations on top, such as a beam search, a postprocessing procedure based on local search (e.g. 2OPT) or solving the individual routes using a TSP solver.






In the Orienteering Problem (OP) each node has a prize and the goal is to maximize the total prize of nodes visited, while keeping the total length of the route below a maximum length . This problem is different from the TSP and the VRP because visiting each node is optional. Similar to the VRP, we add a special depot node with index 0 and coordinates . If the model selects the depot, we consider the route to be finished. In order to prevent infeasible solutions, we only allow to visit a node if after visiting that node a return to the depot is still possible within the maximum length constraint. Note that it is always suboptimal to visit the depot if additional nodes can be visited, but we do not enforce this knowledge.
The depot location as well as node locations are sampled uniformly at random in the unit square. For the distribution of the prizes, we consider three different variants described by Fischetti et al. (1998), but we normalize the prizes such that the normalized prizes are between 0 and 1.
. Every node has the same prize so the goal becomes to visit as many nodes as possible within the length constraint.
. Every node has a prize that is (discretized) uniform.
, where is the distance from the depot to node . Every node has a (discretized) prize that is proportional to the distance to the depot. This is designed to be challenging as the largest prizes are furthest away from the depot (Fischetti et al., 1998).
The maximum length for instances with nodes (and a depot) is chosen to be (on average) approximately half of the length of the average TSP tour for uniform TSP instances with nodes^{10}^{10}10The average length of the optimal TSP tour is 3.84, 5.70 and 7.76 for .. This idea is that this way approximately (a little more than) half of the nodes can be visited, which results in the most difficult problem instances (Vansteenwegen et al., 2011). This is because the number of possible node selections is maximized if and additionally determining the actual path is harder with more nodes selected. We set fixed maximum lengths , and instead of adjusting the constraint per instance, such that for some instances more or less nodes can be visited. Note that has the same unit as the node coordinates , so we do not normalize them.
Similar to the VRP, we use separate parameters for the depot node embedding. Additionally, we provide the node prize as input feature:
(24) 
In order to satisfy the max length constraint, we keep track of the remaining max length at time . Starting at , . Then for , is updated as
(25) 
Here is the distance from node to and we conveniently define = 0 as we start at the depot.
The context for the decoder for the OP at time is the current/last location and the remaining max length . Similar to VRP, we do not need placeholders if as the route starts at the depot and we do not need to provide information about the first node as the route should end at the depot. We do not need to provide information on the prizes gathered as this is irrelevant for the remaining decisions. The context is defined as:
(26) 
In the OP, the depot node can always be visited so is never masked. Regular nodes are masked (i.e. cannot be visited) if either they are already visited or if they cannot be visited within the remaining length constraint: