1 Introduction
Vehicle routing problem (VRP)[12] is a wellknown combinatorial optimization problem in which the objective is to find a set of routes with minimal total costs. For every route, the total demand cannot exceed the capacity of the vehicle. In literature, the algorithms for solving VRP can be divided into exact and heuristic algorithms. The exact algorithms provide optimal guaranteed solutions but are infeasible to tackle largescale instances due to high computational complexity, while the heuristic algorithms are often fast but without theoretical guarantee. Considering the tradeoff between optimality and computational costs, heuristic algorithms can find a suboptimal solution within an acceptable running time for largescale instances. However, it is nontrivial to design a good heuristic algorithm, since it requires substantial problemspecific expert knowledge and handcrafted features. Designing a heuristic algorithm is a tedious process, can we learn a heuristic automatically without human intervention?
Motivated by recent advancements in machine learning, especially deep learning, there have been some works
[3, 2, 15, 8, 5, 10] on using endtoend neural network to directly learn heuristics from data without any handengineered reasoning. Specifically, taking VRP for example, as shown in Fig. 1, the instance is a set of nodes, and the optimal solution is a permutation of these nodes, which can be seen as a sequence of decisions. Therefore, VRP can be viewed as a decision making problem that can be solved by reinforcement learning. From the perspective of reinforcement learning, typically, the state is viewed as the partial solution of instance and the features of each node, the action is the choice of next node to visit, the reward is the negative tour length, and the policy corresponds to heuristic strategy which is parameterized by a neural network. Then the policy is trained to make decisions for maximizing the reward. From the perspective of learning heuristics, given the instances from the distribution , a heuristics is learned to solve an unseen instance from the same distribution .Recently, an attention model (AM) [10] is proposed to solve routing problems. In AM, an instance is viewed as a graph, and node features are extracted to represent such a complex graph structure, which captures the properties of a node in the context of its graph neighborhoods. Based on these node features, the solution is constructed incrementally. In AM, the node features are encoded as an embedding which is fixed over time. However, at different construction steps, the state of instance is changed according to the decision the model made, and the node features should be updated correspondingly.
This paper proposes a dynamic attention model (AMD) with dynamic encoderdecoder architecture. The key of our improvement is to characterize each node dynamically in the context of the graph, which can explore and exploit hidden structure information effectively at different construction steps. To demonstrate the effectiveness of the proposed method, AMD is applied to a challenging combinatorial optimization problem, vehicle routing problem. The numerical experiments indicate that AMD performs significantly better than AM and obviously decreases the optimality gap.
2 Related Work
Learning heuristic based methods proposed in last serval years can be divided into two categories in terms of types of problems solved. The first category focuses on solving permutation based combinatorial optimization problems, such as VRP and TSP. The second category solves 01 based combinatorial optimization problems, such as SAT and knapsack problem.
For the first category, the pointer network (PN) is introduced in [23], it takes combinatorial optimization problems as sequence to sequence problems where the input is a sequence of nodes and the output is a permutation of the input. PN overcomes the limitation that the output length depends on input by a “pointer”, which is a variant of attention mechanism [1]. This sequence to sequence model [20] is trained by the supervised manner and the label is given by an approximate solver.
However, PN is sensitive to the quality of labels and optimal solutions are expensive. In [2], the neural combinatorial optimization framework is proposed to solve combinatorial optimization problems, and the REINFORCE algorithm [28] is used to train a policy modeled by PN without supervised signals. In[15], the LSTM encoder of PN is replaced by elementwise projections which are invariant to the input order and will not introduce redundant sequential information.
In [8], combinatorial optimization is taken as a graph problem, and graph embedding [4] is used to capture combinatorial structure information between nodes. The model is trained by 1step DQN [14] which is dataefficient, and the solution is constructed by the helper function.
In [5] and [10], graph attention network [22] is used to extract the features of each node in graph structure. In[5], an explicitly forgetting mechanism is introduced to construct a solution, which only requires the last three selected nodes per step. Then the constructed solution is improved by 2OPT local search [7]. In [10]
, a context vector is introduced to represent the decoding context, and the model is trained by the REINFORCE algorithm with a deterministic greedy rollout baseline.
For the second category, in [11], the graph convolutional network [17, 9]
is trained to estimate the likelihood, for each node in the instance, of whether this node is part of the optimal solution. In addition, the tree search is used to construct a large number of candidate solutions. In
[13], GCOMB is proposed to solve combinatorial optimization problems over large graph based on graph convolutional network and Qlearning. In [18] and [16], the model is taken as a classifier. In
[18], the message passing neural network [6] is trained to predict satisfiability on SAT problems. In [16], the graph neural network is used to solve decision TSP.Since this study is targeted at solving VRP, [15, 10] are the most related work with this paper. AM proposed recently in [10] for VRP is introduced as follows.
3 Attention Model for VRP
3.1 Problem Formulation and Preliminaries
This paper focuses on VRP. For the simplest form of the VRP, a single capacitated vehicle is responsible for delivering items to multiple customer nodes, and the vehicle must return to the depot to pick up additional items when it runs out of loads. The solution can be seen as a set of routes. In each route, it begins and ends at the depot.
Specifically, for VRP instance, the input is a set of nodes and is the depot. Each node consists of two elements , where is a 2dimensional coordinate of node in euclidean space and is its demand (). The solution is a sequence , where each customer node is visited exactly once and the depot can be visited multiple times. is the length of sequence that may be varied from different solutions.
VRP can be viewed as a sequential decision making problem, and encoderdecoder architecture [20]
is an effective framework for solving such kind of problems. Taking neural machine translation (NMT) for example, as shown in Fig.
2, the encoder extracts syntactic structure and semantic information from source language text. Then the decoder constructs target language text from the features given by encoder. Fig. 2 shows that the encoderdecoder architecture can also be applied to solve VRP. Firstly, the structural features of the input instance are extracted by the encoder. Then the solution is constructed incrementally by the decoder. Specifically, at each construction step, the decoder predicts a distribution over nodes, then one node is selected and appended to the end of the partial solution. Hence, corresponding to the parameters and input instance, the probability of solution
can be decomposed by chain rule as:
(1) 
3.2 Encoder
In encoder, graph attention network is used to encode node features to an embedding in context of graph. It is similar to the encoder in transformer architecture [21]. Firstly, for each dimensional (for VRP, , the coordinate and demand) input node , the dimensional () initial node embedding
is computed through a linear transformation with learnable parameters
and , separate parameters and are used for the depot:(2) 
These initial node embeddings are fed into the first layer of graph attention network and updated times with attention layers. For each layer, it consists of two sublayers: a multihead attention (MHA) sublayer and a fully connected feedforward (FF) sublayer.
3.2.1 MultiHead Attention Sublayer
As in [21], multihead attention is used to extract different types of information. In the layer , is denoted as the node embedding of each node , and the output of the layer is the input of the layer . The multihead attention vector of each node can be computed as:
(3) 
(4) 
(5) 
(6) 
(7) 
Here, the number of head is set , in each attention head , the query vector , key vector and value vector are computed with parameters , and respectively. And the final vector is computed with ().
Remark: the parameters , and do not share between each layer and the superscript is omitted for readability.
3.2.2 FeedForward Sublayer
In this sublayer, for each node , based on multihead attention vector, is computed by skipconnection and fully connected feedforward (FF) network. For each node :
(8) 
(9) 
(10) 
where is calculated with parameters , , and ().
After attention layers, for each node , the final node embedding is calculated as:
(11) 
Fig. 3 illustrates the stream of message between nodes. By aggregating the message of each node, the embedding of each node is updated according to the attention mechanism.
3.3 Decoder
In decoder, at each construction step , one node is selected to visit based on the partial solution and the embedding of each node. As in [10], the context vector is computed by head attention mechanism. Firstly, for VRP, a new vector is constructed as:
(12) 
where [ ; ] is concatenation operator, is the embedding of the node selected at construction step , is the remaining capacity of vehicle (), and is the graph embedding, which is the mean vector of the embedding over nodes that have not been visited (including depot) at construction step . Similar to the encoder, is computed with a single head attention layer, and only a single query (per head) is computed (the parameters do not share with encoder):
(13) 
(14) 
(15) 
(16) 
(17) 
As shown in Eq. (14), in order to construct a feasible solution, the node that violates the constraints will be masked. For VRP, the following masking conditions are used. First, the customer node whose demand greater than the remaining capacity of the vehicle is masked. Second, the customer node that already been visited is masked.
Remark: the depot node can be visited multiple times and it will be masked only when .
Finally, the probability is computed with a singlehead attention layer:
(18) 
(19) 
(20) 
where is used to clip the result within (). If node is selected to visit at construction step , the remaining capacity should be updated:
(21) 
Fig. 4 illustrates the details of decoder at construction step . According to the partial solution and node embedding, the context vector is computed by the attention mechanism. Based on the context vector and the embedding of remaining nodes, the decoder predicts a distribution over these nodes and selects one to visit.
4 Dynamic Attention Model for VRP
As mentioned in Section 3, the solution is constructed incrementally by the decoder. At different construction steps, the state of the instance is changed, and the feature embedding of each node should be updated. As shown in Fig. 5, when the model constructed a partial solution, the remaining nodes, which do not be included in the partial solution yet, can be seen as a new instance. Constructing the remaining solution is equivalent to solve this new instance. Since some nodes have already been visited, the structure of this new instance is different from the original instance. Therefore, the structure information is changed and the node features should be updated accordingly. But in vanilla encoderdecoder architecture in AM for VRP, as shown in Fig. 6, the feature embedding of each node is computed only once, which corresponds to the initial state of instance. This paper proposes a dynamic encoderdecoder architecture to characterize the feature embedding of each node dynamically at different construction steps.
The dynamic encoderdecoder architecture, as shown in Fig. 6, is similar to vanilla encoderdecoder architecture. The key difference is that the embedding of each node will be immediately recomputed when the vehicle returns to the depot. Specifically, for each node , the embedding can be updated at construction step as:
(22) 
where is the embedding of node at construction step , and the layer number is omitted. is similar to Eq. (11) that is computed with head attention layers. The only difference is that Eq. (4) is modified. In order to reflect that the structure of instance is changed, the nodes that have been visited are masked, and Eq. (4) is modified as:
(23) 
During decoding, at each step , the computation of Eqs. (13)(18) is based on the latest embedding of each node (the layer number is omitted, and is the construction step). As shown in Fig. 6, the entire architecture uses the encoder and decoder alternately to recode node embedding and construct a partial solution.
Given a distribution over nodes, there are two strategies to select the next node to visit. The one is sample rollout that selects a node using sampling. The other is greedy rollout that selects the node with maximum probability. The former is a stochastic policy and the latter is a deterministic policy.
5 Model Training
, solving combinatorial optimization problem is taken as Markov Decision Processes (MDP), and AMD is trained by policy gradient using REINFORCE algorithm
[28]. Given an instance , our training objective is the tour length of solution . Hence, based on instance , the gradients of parameters are defined as:(24) 
where is the tour length of solution , is a baseline function for estimating the expected tour length of instance
which can reduce the variance of gradients and accelerate convergence effectively. In this paper, as in
[10], the tour length of the greedy solution, which is constructed by greedy rollout, is taken as .During training, the instances are drawn from the same distribution . The gradients of parameters are approximated by Monte Carlo sampling as:
(25) 
where is the batch size, and are the solutions of instance constructed by sample rollout and greedy rollout respectively. The training algorithm is described in Algorithm 1.
6 Experiments
Experiments are conducted to investigate the performance of AMD on VRP with node size . AMD consists of two phases: training phase and testing phase. For each problem, in training phase, the model is trained with 30 epochs, and 10000 batches are processed in each epoch. In testing phase, the performance on 10000 test instances is reported, where the solution is constructed by greedy rollout, and the final results are the average length on all test instances.
6.1 Instances and Hyperparameters
As in [15] and [10], the instances are generated from a fixed distribution. For each node, the location are chosen randomly from the unit square , and the demand is a discrete number in chosen uniformly at random (the demand of depot is ). The capacity of vehicle for VRP with 20 customer nodes (denoted as VRP20), for VRP50, for VRP100, and the vehicle is located at the depot when . The batch size and learning rate for both VRP20 and VRP50, and for VRP100. Finally, for each problem, the experiment is conducted by GPU (single 1080Ti for VRP20, VRP50, 31080Ti for VRP100).
VRP20, Cap30  VRP50, Cap40  VRP100, Cap50  
Method  Len  Gap  Len  Gap  Len  Gap 
Gurobi  6.10  0.00%         
LKH3  6.14  0.58%  10.38  0.00%  15.65  0.00% 
RL (greedy)[15]  6.59  8.03%  11.39  9.78%  17.23  10.12% 
AM (greedy)[10]  6.40  4.97%  10.98  5.86%  16.80  7.34% 
AMD (greedy)  6.28  2.95%  10.78  3.85%  16.40  4.79% 
AMD (2OPT)  6.25  2.46%  10.73  3.37%  16.27  3.96% 
AMD ()      11.00  5.97%  17.37  10.99% 
AMD ()  6.48  6.23%      16.55  5.75% 
AMD ()  6.65  9.02%  11.04  6.36%     
VRP20  VRP50  VRP100  

Training time  14h  58h  250h 
Testing time (greedy)  0.29ms  2.51ms  15.92ms 
Testing time (2OPT)  0.05s  0.34s  2.21s 
6.2 Results and Discussions
6.2.1 Comparison Results
TABLE 1 shows the results of VRP. Compared with AM, the performance of AMD is notably improved for both VRP20 (2.02%), VRP50 (2.01%) and VRP100 (2.55%). AMD significantly outperforms other baseline models as well.
The numerical experiments indicate that AMD performs better than AM and other baseline methods. AMD introduces a dynamic encoderdecoder architecture to explore structure features dynamically and exploit hidden structure information effectively at different construction steps. Hence, more hidden and useful structure information is taken into account, thereby leading to a better solution.
6.2.2 Generalization to Larger or Smaller Instances
How does the performance of the learned heuristics generalize to test instances with larger or smaller customer node size? Experiments are conducted to investigate the generalization performance of AMD. Specifically, the model trained with instances with 20, 50 and 100 customer nodes are denoted as AMD (), AMD () and AMD (), respectively. AMD () is tested on instances with 50 and 100 customer nodes, AMD () is tested on instances with 20 and 100 customer nodes, and AMD () is tested on instances with 20 and 50 customer nodes, respectively.
The results are shown at the last three rows in TABLE 1. Specifically, on the one hand, the model trained with small instances () has a good performance on large instances (, ), and the results even better than some baseline methods. On the other hand, the model trained with large instance () performs good on small instance (, ) as well. The reason why AMD has a good generalization performance may be as follows. AMD constructs the solution incrementally, and this process can be divided into many stages. At each stage, only a partial solution is constructed, and thus the instance is transformed to a smaller one which is easier to solve.
6.2.3 Combination With Local Search
Local search is applied to further improve the results as in [5]. Firstly, for each instance, a solution is constructed by AMD (greedy), then the 2OPT local search algorithm is applied to improve this solution. The resultant method is named AMD (2OPT) and the results are shown in TABLE 1. The runtime of AMD (greedy) and AMD (2OPT) are also given in TABLE 2. The results indicate that the quality of the solution is improved by integrating local search, but the local search brings additional computational cost.
6.2.4 Discussions
Machine learning and optimization are closely related, machine learning is often used as an assistant or helper component to improve the performance of solution or reduce computational costs in many optimization algorithms [19]. Totally different from these methods, AMD is aiming to learn heuristics from data directly without human intervention. It means that knowledge or features can be extracted from the given problem instances automatically. Specifically, given an optimization problem and its instances generated from distribution , AMD can learn an approximation or heuristic algorithm and solve the problem on unseen instances generated from distribution .
AMD can be divided into training and testing phases like most of machine learning algorithms. The elapsed time of training and testing are shown in TABLE 2. Though the process of training is timeconsuming, it is upfront, offline computation and can be seen as searching in algorithm space. Then, the trained model can be used directly to solve unseen instances without retraining from scratch, which is online even realtime computation process. Taking VRP20 for example, it takes 14 hours in training phase, but the process is onetime. Once the model has been trained, it only spends 0.29 milliseconds for solving each instance without retraining. Thus, AMD is different from the classic heuristics, which search the solution iteratively in solution space from scratch for each instance.
The training phase of AMD is timeconsuming, thus it is trained only for problem instances with small and medium size due to the limitation of computing resources. It is promising to adopt existing parallel computing techniques to improve the computational efficiency for scaling to larger problem instances in the future.
7 Conclusion
This paper presents a dynamic attention model with dynamic encoderdecoder architecture for VRP. The key improvement is that the structure features of instances are explored dynamically, and hidden structure information is exploited effectively at different construction steps. Hence, more hidden and useful structure information is taken into account, for constructing a better solution. AMD is tested by a challenging NPhard problem, VRP. The results show that the performance of AMD is better than AM and other baseline models for both problems. In addition, AMD also shows a good generalization performance on different problem scales.
In the future, the proposed learning heuristic based method, AMD, can be extended to solve some realworld complex VRP variants [29, 26, 24, 25, 27, 30] by hybridizing with operations research method, such as VRP with time windows, which will open a new era for combinatorial optimization algorithms [3].
Acknowledgment
This work is supported by the National Key R&D Program of China
(2018AAA0101203), and the National Natural Science Foundation of China (61673403, U1611262).
References
 [1] (2015) Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations, Cited by: §2.
 [2] (2016) Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940. Cited by: §1, §2, §5.
 [3] (2018) Machine learning for combinatorial optimization: a methodological tour dhorizon. arXiv:1811.06128. Cited by: §1, §7.
 [4] (2016) Discriminative embeddings of latent variable models for structured data. In International Conference on Machine Learning, pp. 2702–2711. Cited by: §2.

[5]
(2018)
Learning heuristics for the TSP by policy gradient.
In
International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research
, pp. 170–181. Cited by: §1, §2, §5, §6.2.3.  [6] (2017) Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1263–1272. Cited by: §2.
 [7] (1990) Local optimization and the traveling salesman problem. In International Colloquium on Automata, Languages, and Programming, pp. 446–461. Cited by: §2.
 [8] (2017) Learning combinatorial optimization algorithms over graphs. In Advances in Neural Information Processing Systems, pp. 6348–6358. Cited by: §1, §2.
 [9] (2017) Semisupervised classification with graph convolutional networks. In International Conference on Learning Representations, Cited by: §2.
 [10] (2019) Attention, learn to solve routing problems!. In International Conference on Learning Representations, Cited by: §1, §1, §2, §2, §3.3, §5, §5, §6.1, Table 1.
 [11] (2018) Combinatorial optimization with graph convolutional networks and guided tree search. In Advances in Neural Information Processing Systems, pp. 539–548. Cited by: §2.
 [12] (201906) New shades of the vehicle routing problem: emerging problem formulations and computational intelligence solution methods. IEEE Transactions on Emerging Topics in Computational Intelligence 3 (3), pp. 230–244. External Links: Document, ISSN 2471285X Cited by: §1.
 [13] (2019) Learning heuristics over large graphs via deep reinforcement learning. arXiv preprint arXiv:1903.03332. Cited by: §2.
 [14] (2015) Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §2.
 [15] (2018) Reinforcement learning for solving the vehicle routing problem. In Advances in Neural Information Processing Systems, pp. 9839–9849. Cited by: §1, §2, §2, §5, §6.1, Table 1.
 [16] (2018) Learning to solve NPcomplete problemsa graph neural network for the decision TSP. arXiv preprint arXiv:1809.02721. Cited by: §2.
 [17] (2009) The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §2.
 [18] (2019) Learning a SAT solver from singlebit supervision. In International Conference on Learning Representations, Cited by: §2.
 [19] (20190601) A review on the self and dual interactions between machine learning and optimisation. Progress in Artificial Intelligence 8 (2), pp. 143–165. Cited by: §6.2.4.
 [20] (2014) Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pp. 3104–3112. Cited by: §2, §3.1.
 [21] (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §3.2.1, §3.2.
 [22] (2018) Graph attention networks. In International Conference on Learning Representations, Cited by: §2.
 [23] (2015) Pointer networks. In Advances in Neural Information Processing Systems, pp. 2692–2700. Cited by: §2.
 [24] (2018) A hybrid multiobjective memetic algorithm for multiobjective periodic vehicle routing problem with time windows. IEEE Transactions on Systems, Man, and Cybernetics: Systems (), pp. 1–14. External Links: Document, ISSN 21682216 Cited by: §7.

[25]
(201907)
A twostage multiobjective evolutionary algorithm for multiobjective multidepot vehicle routing problem with time windows
. IEEE Transactions on Cybernetics 49 (7), pp. 2467–2478. External Links: Document, ISSN 21682267 Cited by: §7.  [26] (2019) Multiobjective multiple neighborhood search algorithms for multiobjective fleet size and mix locationrouting problem with time windows. IEEE Transactions on Systems, Man, and Cybernetics: Systems (), pp. 1–15. External Links: Document, ISSN 21682216 Cited by: §7.
 [27] (201603) Multiobjective vehicle routing problems with simultaneous delivery and pickup and time windows: formulation, instances, and algorithms. IEEE Transactions on Cybernetics 46 (3), pp. 582–594. External Links: Document, ISSN 21682267 Cited by: §7.
 [28] (1992) Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning 8 (34), pp. 229–256. Cited by: §2, §5.

[29]
(2019)
A graphbased fuzzy evolutionary algorithm for solving twoechelon vehicle routing problems.
IEEE Transactions on Evolutionary Computation
(), pp. 1–1. External Links: Document, ISSN 1089778X Cited by: §7.  [30] (2015Sep.) A local searchbased multiobjective optimization algorithm for multiobjective vehicle routing problem with time windows. IEEE Systems Journal 9 (3), pp. 1100–1113. External Links: Document, ISSN 19328184 Cited by: §7.
Comments
There are no comments yet.