Vehicle routing problem (VRP) is a well-known combinatorial optimization problem in which the objective is to find a set of routes with minimal total costs. For every route, the total demand cannot exceed the capacity of the vehicle. In literature, the algorithms for solving VRP can be divided into exact and heuristic algorithms. The exact algorithms provide optimal guaranteed solutions but are infeasible to tackle large-scale instances due to high computational complexity, while the heuristic algorithms are often fast but without theoretical guarantee. Considering the trade-off between optimality and computational costs, heuristic algorithms can find a suboptimal solution within an acceptable running time for large-scale instances. However, it is non-trivial to design a good heuristic algorithm, since it requires substantial problem-specific expert knowledge and hand-crafted features. Designing a heuristic algorithm is a tedious process, can we learn a heuristic automatically without human intervention?
Motivated by recent advancements in machine learning, especially deep learning, there have been some works[3, 2, 15, 8, 5, 10] on using end-to-end neural network to directly learn heuristics from data without any hand-engineered reasoning. Specifically, taking VRP for example, as shown in Fig. 1, the instance is a set of nodes, and the optimal solution is a permutation of these nodes, which can be seen as a sequence of decisions. Therefore, VRP can be viewed as a decision making problem that can be solved by reinforcement learning. From the perspective of reinforcement learning, typically, the state is viewed as the partial solution of instance and the features of each node, the action is the choice of next node to visit, the reward is the negative tour length, and the policy corresponds to heuristic strategy which is parameterized by a neural network. Then the policy is trained to make decisions for maximizing the reward. From the perspective of learning heuristics, given the instances from the distribution , a heuristics is learned to solve an unseen instance from the same distribution .
Recently, an attention model (AM)  is proposed to solve routing problems. In AM, an instance is viewed as a graph, and node features are extracted to represent such a complex graph structure, which captures the properties of a node in the context of its graph neighborhoods. Based on these node features, the solution is constructed incrementally. In AM, the node features are encoded as an embedding which is fixed over time. However, at different construction steps, the state of instance is changed according to the decision the model made, and the node features should be updated correspondingly.
This paper proposes a dynamic attention model (AM-D) with dynamic encoder-decoder architecture. The key of our improvement is to characterize each node dynamically in the context of the graph, which can explore and exploit hidden structure information effectively at different construction steps. To demonstrate the effectiveness of the proposed method, AM-D is applied to a challenging combinatorial optimization problem, vehicle routing problem. The numerical experiments indicate that AM-D performs significantly better than AM and obviously decreases the optimality gap.
2 Related Work
Learning heuristic based methods proposed in last serval years can be divided into two categories in terms of types of problems solved. The first category focuses on solving permutation based combinatorial optimization problems, such as VRP and TSP. The second category solves 0-1 based combinatorial optimization problems, such as SAT and knapsack problem.
For the first category, the pointer network (PN) is introduced in , it takes combinatorial optimization problems as sequence to sequence problems where the input is a sequence of nodes and the output is a permutation of the input. PN overcomes the limitation that the output length depends on input by a “pointer”, which is a variant of attention mechanism . This sequence to sequence model  is trained by the supervised manner and the label is given by an approximate solver.
However, PN is sensitive to the quality of labels and optimal solutions are expensive. In , the neural combinatorial optimization framework is proposed to solve combinatorial optimization problems, and the REINFORCE algorithm  is used to train a policy modeled by PN without supervised signals. In, the LSTM encoder of PN is replaced by element-wise projections which are invariant to the input order and will not introduce redundant sequential information.
In , combinatorial optimization is taken as a graph problem, and graph embedding  is used to capture combinatorial structure information between nodes. The model is trained by 1-step DQN  which is data-efficient, and the solution is constructed by the helper function.
In  and , graph attention network  is used to extract the features of each node in graph structure. In, an explicitly forgetting mechanism is introduced to construct a solution, which only requires the last three selected nodes per step. Then the constructed solution is improved by 2OPT local search . In 
, a context vector is introduced to represent the decoding context, and the model is trained by the REINFORCE algorithm with a deterministic greedy rollout baseline.
is trained to estimate the likelihood, for each node in the instance, of whether this node is part of the optimal solution. In addition, the tree search is used to construct a large number of candidate solutions. In, GCOMB is proposed to solve combinatorial optimization problems over large graph based on graph convolutional network and Q-learning. In  and 
, the model is taken as a classifier. In, the message passing neural network  is trained to predict satisfiability on SAT problems. In , the graph neural network is used to solve decision TSP.
3 Attention Model for VRP
3.1 Problem Formulation and Preliminaries
This paper focuses on VRP. For the simplest form of the VRP, a single capacitated vehicle is responsible for delivering items to multiple customer nodes, and the vehicle must return to the depot to pick up additional items when it runs out of loads. The solution can be seen as a set of routes. In each route, it begins and ends at the depot.
Specifically, for VRP instance, the input is a set of nodes and is the depot. Each node consists of two elements , where is a 2-dimensional coordinate of node in euclidean space and is its demand (). The solution is a sequence , where each customer node is visited exactly once and the depot can be visited multiple times. is the length of sequence that may be varied from different solutions.
VRP can be viewed as a sequential decision making problem, and encoder-decoder architecture 
is an effective framework for solving such kind of problems. Taking neural machine translation (NMT) for example, as shown in Fig.2, the encoder extracts syntactic structure and semantic information from source language text. Then the decoder constructs target language text from the features given by encoder. Fig. 2 shows that the encoder-decoder architecture can also be applied to solve VRP. Firstly, the structural features of the input instance are extracted by the encoder. Then the solution is constructed incrementally by the decoder. Specifically, at each construction step, the decoder predicts a distribution over nodes, then one node is selected and appended to the end of the partial solution. Hence, corresponding to the parameters and input instance
, the probability of solution
can be decomposed by chain rule as:
In encoder, graph attention network is used to encode node features to an embedding in context of graph. It is similar to the encoder in transformer architecture . Firstly, for each -dimensional (for VRP, , the coordinate and demand) input node , the -dimensional () initial node embedding
is computed through a linear transformation with learnable parametersand , separate parameters and are used for the depot:
These initial node embeddings are fed into the first layer of graph attention network and updated times with attention layers. For each layer, it consists of two sublayers: a multi-head attention (MHA) sublayer and a fully connected feed-forward (FF) sublayer.
3.2.1 Multi-Head Attention Sublayer
As in , multi-head attention is used to extract different types of information. In the layer , is denoted as the node embedding of each node , and the output of the layer is the input of the layer . The multi-head attention vector of each node can be computed as:
Here, the number of head is set , in each attention head , the query vector , key vector and value vector are computed with parameters , and respectively. And the final vector is computed with ().
Remark: the parameters , and do not share between each layer and the superscript is omitted for readability.
3.2.2 Feed-Forward Sublayer
In this sublayer, for each node , based on multi-head attention vector, is computed by skip-connection and fully connected feed-forward (FF) network. For each node :
where is calculated with parameters , , and ().
After attention layers, for each node , the final node embedding is calculated as:
Fig. 3 illustrates the stream of message between nodes. By aggregating the message of each node, the embedding of each node is updated according to the attention mechanism.
In decoder, at each construction step , one node is selected to visit based on the partial solution and the embedding of each node. As in , the context vector is computed by -head attention mechanism. Firstly, for VRP, a new vector is constructed as:
where [ ; ] is concatenation operator, is the embedding of the node selected at construction step , is the remaining capacity of vehicle (), and is the graph embedding, which is the mean vector of the embedding over nodes that have not been visited (including depot) at construction step . Similar to the encoder, is computed with a single -head attention layer, and only a single query (per head) is computed (the parameters do not share with encoder):
As shown in Eq. (14), in order to construct a feasible solution, the node that violates the constraints will be masked. For VRP, the following masking conditions are used. First, the customer node whose demand greater than the remaining capacity of the vehicle is masked. Second, the customer node that already been visited is masked.
Remark: the depot node can be visited multiple times and it will be masked only when .
Finally, the probability is computed with a single-head attention layer:
where is used to clip the result within (). If node is selected to visit at construction step , the remaining capacity should be updated:
Fig. 4 illustrates the details of decoder at construction step . According to the partial solution and node embedding, the context vector is computed by the attention mechanism. Based on the context vector and the embedding of remaining nodes, the decoder predicts a distribution over these nodes and selects one to visit.
4 Dynamic Attention Model for VRP
As mentioned in Section 3, the solution is constructed incrementally by the decoder. At different construction steps, the state of the instance is changed, and the feature embedding of each node should be updated. As shown in Fig. 5, when the model constructed a partial solution, the remaining nodes, which do not be included in the partial solution yet, can be seen as a new instance. Constructing the remaining solution is equivalent to solve this new instance. Since some nodes have already been visited, the structure of this new instance is different from the original instance. Therefore, the structure information is changed and the node features should be updated accordingly. But in vanilla encoder-decoder architecture in AM for VRP, as shown in Fig. 6, the feature embedding of each node is computed only once, which corresponds to the initial state of instance. This paper proposes a dynamic encoder-decoder architecture to characterize the feature embedding of each node dynamically at different construction steps.
The dynamic encoder-decoder architecture, as shown in Fig. 6, is similar to vanilla encoder-decoder architecture. The key difference is that the embedding of each node will be immediately recomputed when the vehicle returns to the depot. Specifically, for each node , the embedding can be updated at construction step as:
where is the embedding of node at construction step , and the layer number is omitted. is similar to Eq. (11) that is computed with -head attention layers. The only difference is that Eq. (4) is modified. In order to reflect that the structure of instance is changed, the nodes that have been visited are masked, and Eq. (4) is modified as:
During decoding, at each step , the computation of Eqs. (13)-(18) is based on the latest embedding of each node (the layer number is omitted, and is the construction step). As shown in Fig. 6, the entire architecture uses the encoder and decoder alternately to recode node embedding and construct a partial solution.
Given a distribution over nodes, there are two strategies to select the next node to visit. The one is sample rollout that selects a node using sampling. The other is greedy rollout that selects the node with maximum probability. The former is a stochastic policy and the latter is a deterministic policy.
5 Model Training
, solving combinatorial optimization problem is taken as Markov Decision Processes (MDP), and AM-D is trained by policy gradient using REINFORCE algorithm. Given an instance , our training objective is the tour length of solution . Hence, based on instance , the gradients of parameters are defined as:
where is the tour length of solution , is a baseline function for estimating the expected tour length of instance
which can reduce the variance of gradients and accelerate convergence effectively. In this paper, as in, the tour length of the greedy solution, which is constructed by greedy rollout, is taken as .
During training, the instances are drawn from the same distribution . The gradients of parameters are approximated by Monte Carlo sampling as:
where is the batch size, and are the solutions of instance constructed by sample rollout and greedy rollout respectively. The training algorithm is described in Algorithm 1.
Experiments are conducted to investigate the performance of AM-D on VRP with node size . AM-D consists of two phases: training phase and testing phase. For each problem, in training phase, the model is trained with 30 epochs, and 10000 batches are processed in each epoch. In testing phase, the performance on 10000 test instances is reported, where the solution is constructed by greedy rollout, and the final results are the average length on all test instances.
6.1 Instances and Hyperparameters
As in  and , the instances are generated from a fixed distribution. For each node, the location are chosen randomly from the unit square , and the demand is a discrete number in chosen uniformly at random (the demand of depot is ). The capacity of vehicle for VRP with 20 customer nodes (denoted as VRP20), for VRP50, for VRP100, and the vehicle is located at the depot when . The batch size and learning rate for both VRP20 and VRP50, and for VRP100. Finally, for each problem, the experiment is conducted by GPU (single 1080Ti for VRP20, VRP50, 31080Ti for VRP100).
|VRP20, Cap30||VRP50, Cap40||VRP100, Cap50|
|Testing time (greedy)||0.29ms||2.51ms||15.92ms|
|Testing time (2OPT)||0.05s||0.34s||2.21s|
6.2 Results and Discussions
6.2.1 Comparison Results
TABLE 1 shows the results of VRP. Compared with AM, the performance of AM-D is notably improved for both VRP20 (2.02%), VRP50 (2.01%) and VRP100 (2.55%). AM-D significantly outperforms other baseline models as well.
The numerical experiments indicate that AM-D performs better than AM and other baseline methods. AM-D introduces a dynamic encoder-decoder architecture to explore structure features dynamically and exploit hidden structure information effectively at different construction steps. Hence, more hidden and useful structure information is taken into account, thereby leading to a better solution.
6.2.2 Generalization to Larger or Smaller Instances
How does the performance of the learned heuristics generalize to test instances with larger or smaller customer node size? Experiments are conducted to investigate the generalization performance of AM-D. Specifically, the model trained with instances with 20, 50 and 100 customer nodes are denoted as AM-D (), AM-D () and AM-D (), respectively. AM-D () is tested on instances with 50 and 100 customer nodes, AM-D () is tested on instances with 20 and 100 customer nodes, and AM-D () is tested on instances with 20 and 50 customer nodes, respectively.
The results are shown at the last three rows in TABLE 1. Specifically, on the one hand, the model trained with small instances () has a good performance on large instances (, ), and the results even better than some baseline methods. On the other hand, the model trained with large instance () performs good on small instance (, ) as well. The reason why AM-D has a good generalization performance may be as follows. AM-D constructs the solution incrementally, and this process can be divided into many stages. At each stage, only a partial solution is constructed, and thus the instance is transformed to a smaller one which is easier to solve.
6.2.3 Combination With Local Search
Local search is applied to further improve the results as in . Firstly, for each instance, a solution is constructed by AM-D (greedy), then the 2OPT local search algorithm is applied to improve this solution. The resultant method is named AM-D (2OPT) and the results are shown in TABLE 1. The runtime of AM-D (greedy) and AM-D (2OPT) are also given in TABLE 2. The results indicate that the quality of the solution is improved by integrating local search, but the local search brings additional computational cost.
Machine learning and optimization are closely related, machine learning is often used as an assistant or helper component to improve the performance of solution or reduce computational costs in many optimization algorithms . Totally different from these methods, AM-D is aiming to learn heuristics from data directly without human intervention. It means that knowledge or features can be extracted from the given problem instances automatically. Specifically, given an optimization problem and its instances generated from distribution , AM-D can learn an approximation or heuristic algorithm and solve the problem on unseen instances generated from distribution .
AM-D can be divided into training and testing phases like most of machine learning algorithms. The elapsed time of training and testing are shown in TABLE 2. Though the process of training is time-consuming, it is upfront, offline computation and can be seen as searching in algorithm space. Then, the trained model can be used directly to solve unseen instances without retraining from scratch, which is online even real-time computation process. Taking VRP20 for example, it takes 14 hours in training phase, but the process is one-time. Once the model has been trained, it only spends 0.29 milliseconds for solving each instance without retraining. Thus, AM-D is different from the classic heuristics, which search the solution iteratively in solution space from scratch for each instance.
The training phase of AM-D is time-consuming, thus it is trained only for problem instances with small and medium size due to the limitation of computing resources. It is promising to adopt existing parallel computing techniques to improve the computational efficiency for scaling to larger problem instances in the future.
This paper presents a dynamic attention model with dynamic encoder-decoder architecture for VRP. The key improvement is that the structure features of instances are explored dynamically, and hidden structure information is exploited effectively at different construction steps. Hence, more hidden and useful structure information is taken into account, for constructing a better solution. AM-D is tested by a challenging NP-hard problem, VRP. The results show that the performance of AM-D is better than AM and other baseline models for both problems. In addition, AM-D also shows a good generalization performance on different problem scales.
In the future, the proposed learning heuristic based method, AM-D, can be extended to solve some real-world complex VRP variants [29, 26, 24, 25, 27, 30] by hybridizing with operations research method, such as VRP with time windows, which will open a new era for combinatorial optimization algorithms .
This work is supported by the National Key R&D Program of China
(2018AAA0101203), and the National Natural Science Foundation of China (61673403, U1611262).
-  (2015) Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations, Cited by: §2.
-  (2016) Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940. Cited by: §1, §2, §5.
-  (2018) Machine learning for combinatorial optimization: a methodological tour dhorizon. arXiv:1811.06128. Cited by: §1, §7.
-  (2016) Discriminative embeddings of latent variable models for structured data. In International Conference on Machine Learning, pp. 2702–2711. Cited by: §2.
Learning heuristics for the TSP by policy gradient.
International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research, pp. 170–181. Cited by: §1, §2, §5, §6.2.3.
-  (2017) Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1263–1272. Cited by: §2.
-  (1990) Local optimization and the traveling salesman problem. In International Colloquium on Automata, Languages, and Programming, pp. 446–461. Cited by: §2.
-  (2017) Learning combinatorial optimization algorithms over graphs. In Advances in Neural Information Processing Systems, pp. 6348–6358. Cited by: §1, §2.
-  (2017) Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, Cited by: §2.
-  (2019) Attention, learn to solve routing problems!. In International Conference on Learning Representations, Cited by: §1, §1, §2, §2, §3.3, §5, §5, §6.1, Table 1.
-  (2018) Combinatorial optimization with graph convolutional networks and guided tree search. In Advances in Neural Information Processing Systems, pp. 539–548. Cited by: §2.
-  (2019-06) New shades of the vehicle routing problem: emerging problem formulations and computational intelligence solution methods. IEEE Transactions on Emerging Topics in Computational Intelligence 3 (3), pp. 230–244. External Links: Cited by: §1.
-  (2019) Learning heuristics over large graphs via deep reinforcement learning. arXiv preprint arXiv:1903.03332. Cited by: §2.
-  (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §2.
-  (2018) Reinforcement learning for solving the vehicle routing problem. In Advances in Neural Information Processing Systems, pp. 9839–9849. Cited by: §1, §2, §2, §5, §6.1, Table 1.
-  (2018) Learning to solve NP-complete problems-a graph neural network for the decision TSP. arXiv preprint arXiv:1809.02721. Cited by: §2.
-  (2009) The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §2.
-  (2019) Learning a SAT solver from single-bit supervision. In International Conference on Learning Representations, Cited by: §2.
-  (2019-06-01) A review on the self and dual interactions between machine learning and optimisation. Progress in Artificial Intelligence 8 (2), pp. 143–165. Cited by: §6.2.4.
-  (2014) Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pp. 3104–3112. Cited by: §2, §3.1.
-  (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §3.2.1, §3.2.
-  (2018) Graph attention networks. In International Conference on Learning Representations, Cited by: §2.
-  (2015) Pointer networks. In Advances in Neural Information Processing Systems, pp. 2692–2700. Cited by: §2.
-  (2018) A hybrid multiobjective memetic algorithm for multiobjective periodic vehicle routing problem with time windows. IEEE Transactions on Systems, Man, and Cybernetics: Systems (), pp. 1–14. External Links: Cited by: §7.
A two-stage multiobjective evolutionary algorithm for multiobjective multidepot vehicle routing problem with time windows. IEEE Transactions on Cybernetics 49 (7), pp. 2467–2478. External Links: Cited by: §7.
-  (2019) Multiobjective multiple neighborhood search algorithms for multiobjective fleet size and mix location-routing problem with time windows. IEEE Transactions on Systems, Man, and Cybernetics: Systems (), pp. 1–15. External Links: Cited by: §7.
-  (2016-03) Multiobjective vehicle routing problems with simultaneous delivery and pickup and time windows: formulation, instances, and algorithms. IEEE Transactions on Cybernetics 46 (3), pp. 582–594. External Links: Cited by: §7.
-  (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8 (3-4), pp. 229–256. Cited by: §2, §5.
A graph-based fuzzy evolutionary algorithm for solving two-echelon vehicle routing problems.
IEEE Transactions on Evolutionary Computation(), pp. 1–1. External Links: Cited by: §7.
-  (2015-Sep.) A local search-based multiobjective optimization algorithm for multiobjective vehicle routing problem with time windows. IEEE Systems Journal 9 (3), pp. 1100–1113. External Links: Cited by: §7.