Deep Reinforcement Learning for Solving the Heterogeneous Capacitated Vehicle Routing Problem

10/06/2021 ∙ by Jingwen Li, et al. ∙ National University of Singapore Microsoft Nanyang Technological University The Chinese University of Hong Kong 6

Existing deep reinforcement learning (DRL) based methods for solving the capacitated vehicle routing problem (CVRP) intrinsically cope with homogeneous vehicle fleet, in which the fleet is assumed as repetitions of a single vehicle. Hence, their key to construct a solution solely lies in the selection of the next node (customer) to visit excluding the selection of vehicle. However, vehicles in real-world scenarios are likely to be heterogeneous with different characteristics that affect their capacity (or travel speed), rendering existing DRL methods less effective. In this paper, we tackle heterogeneous CVRP (HCVRP), where vehicles are mainly characterized by different capacities. We consider both min-max and min-sum objectives for HCVRP, which aim to minimize the longest or total travel time of the vehicle(s) in the fleet. To solve those problems, we propose a DRL method based on the attention mechanism with a vehicle selection decoder accounting for the heterogeneous fleet constraint and a node selection decoder accounting for the route construction, which learns to construct a solution by automatically selecting both a vehicle and a node for this vehicle at each step. Experimental results based on randomly generated instances show that, with desirable generalization to various problem sizes, our method outperforms the state-of-the-art DRL method and most of the conventional heuristics, and also delivers competitive performance against the state-of-the-art heuristic method, i.e., SISR. Additionally, the results of extended experiments demonstrate that our method is also able to solve CVRPLib instances with satisfactory performance.



There are no comments yet.


page 1

page 11

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The Capacitated Vehicle Routing Problem (CVRP) is a classical combinatorial optimization problem, which aims to optimize the routes for a fleet of vehicles with capacity constraints to serve a set of customers with demands. Compared with the assumption of multiple identical vehicles in homogeneous CVRP, the settings of vehicles with different capacities (or speeds) are more in line with the real-world practice, which leads to the heterogeneous CVRP (HCVRP) 

[17, 23]

. According to the objectives, CVRP can also be classified as min-max and min-sum ones, respectively. The former objective requires that the longest (worst-case) travel time (or distance) for a vehicle in the fleet should be as satisfying as possible since fairness is crucial in many real-world applications

[53, 14, 6, 31, 1, 20, 29], and the latter objective aims to minimize the total travel time (or distance) incurred by the whole fleet [19, 30, 45, 54]. In this paper, we study the problem of HCVRP with both min-max and min-sum objectives, i.e., MM-HCVRP and MS-HCVRP.

Conventional methods for solving HCVRP include exact and heuristic ones. Exact methods usually adopt branch-and-bound or its variants as the framework and perform well on small-scale problems [3, 30, 16, 19], but may consume prohibitively long time on large-scale ones given the exponential computation complexity. Heuristic methods usually exploit certain hand-engineered searching rules to guide the solving processes, which often consume much shorter time and are more desirable for large-scale problems in reality [34, 27, 45, 61]. However, such hand-engineered rules largely rely on human experience and domain knowledge, thus might be incapable of engendering solutions with high quality. Moreover, both conventional exact and heuristic methods always solve the problem instances independently, and fail to exploit the patterns that potentially shared among the instances.

Recently, researchers tend to apply deep reinforcement learning (DRL) to automatically learn the searching rules in heuristic methods for solving routing problems including CVRP and TSP [5, 59, 37, 24, 9, 26]

, by discovering the underlying patterns from a large number of instances. Generally, those DRL models are categorized as two classes, i.e., construction and improvement methods, respectively. Starting with an empty solution, the former constructs a solution by sequentially assigning each customer to a vehicle until all customers are served. Starting with a complete initial solution, the latter selects either candidate nodes (customers or depot) or heuristic operators, or both to improve and update the solution at each step, which are repeated until termination. By further leveraging the advanced deep learning architectures like attention mechanism to guide the selection, those DRL models are able to efficiently generate solutions with much higher quality compared to conventional heuristics. However, existing works only focus on solving homogeneous CVRP which intrinsically cope with vehicles of the same characteristics, in the sense that the complete route of the fleet could be derived by repeatedly dispatching a single vehicle. Consequently, the key in those works is to solely select the next node to visit excluding the selection of vehicles, since there is only one vehicle essentially. Evidently, those works would be far less effective when applied to solve the more practical HCVRP, given the following issues: 1) The assumption of homogeneous vehicles is unable to capture the discrepancy in vehicles; 2) The vehicle selection is not explicitly considered, which should be of equal importance to the node selection in HCVRP; 3) The contextual information in the attention scheme is insufficient as it lacks states of other vehicles and (partially) constructed routes, which may render it incapable of engendering high-quality solutions in view of the complexity of HCVRP.

In this paper, we aim to solve the HCVRP with both min-sum and min-max objectives while emphasizing on addressing the aforementioned issues. We propose a novel neural architecture integrated with the attention mechanism to improve the DRL based construction method, which combines the decision-making for vehicle selection and node selection together to engender solutions of higher quality. Different from the existing works that construct the routes for each vehicle of the homogeneous fleet in sequence, our policy network is able to automatically and flexibly select a vehicle from a heterogeneous fleet at each step. In specific, our policy network adopts a Transformer-style [48] encoder-decoder structure, where the decoder consists of two parts, i.e., vehicle selection decoder and node selection decoder, respectively. With the problem features (i.e., customer location, customer demand and vehicle capacity) processed by the encoder for better representation, the policy network first selects a vehicle from the fleet using the vehicle selection decoder based on the states of all vehicles and partial routes, and then selects a node for this vehicle using the node selection decoder at each decoding step. This process is repeated until all customers are served.

Accordingly, the major contribution of this paper is that we present a deep reinforcement learning method to solve CVRP with multiple heterogeneous vehicles, which is intrinsically different from the homogeneous ones in existing works, as the latter is lacking in selecting vehicles from a fleet. Specifically, we propose an effective neural architecture that integrates the vehicle selection and node selection together, with rich contextual information for selection among the heterogeneous vehicles, where every vehicle in the fleet has the chance to be selected at each step. We test both min-max and min-sum objectives with various sizes of vehicles and customers. Results show that our method is superior to most of the conventional heuristics and competitive to the state-of-the-art heuristic (i.e., SISR) with much shorter computation time. With comparable computation time, our method achieves much better solution quality than that of other DRL method. In addition, our method generalizes well to problems with larger customer sizes.

The remainder of the paper is organized as follows. Section II briefly reviews conventional methods and deep models for routing problems. Section III introduces the mathematical formulation of MM-HCVRP and MS-HCVRP and the reformulation in the RL (reinforcement learning) manner. Section IV elaborates our DRL framework. Section V provides the computational experiments and analysis. Finally, Section VI concludes the paper and presents future works.

Ii Related Works

In this section, we briefly review the conventional methods for solving HCVRP with different objective functions, and deep models for solving the general VRPs.

The heterogeneous CVRP (HCVRP) was first studied in [17]

, where the Clarke and Wright procedure and partition algorithms were applied to generate the lower bound and estimate optimal solution. An efficient constructive heuristic was adopted to solve HCVRP in 

[41] by merging small start trips for each customer into a complete one, which was also capable for multi-trip cases. Baldacci and Mingozzi [3] presented a unified exact method to solve HCVRP, reducing the number of variables by using three bounding procedures. Feng et al. [15] proposed a novel evolutionary multitasking algorithm to tackle the HCVRP with time window, and occasional driver, which can also solve multiple optimization tasks simultaneously.

The CVRP with min-sum objective was first proposed by Dantzig and Ramsey [13], which was assumed as the generalization of Travelling Salesman Problem (TSP) with capacity constraints. To address the large-scale multi-objective optimization problem (MOP), a competitive swarm optimizer (CSO) based search method was proposed in [46, 11, 10], which conceived a new particle updating strategy to improve the search accuracy and efficiency. By transforming the large-scale CVRP (LSCVRP) into a large-scale MOP, an evolutionary multi-objective route grouping method was introduced in [57]

, which employed a multi-objective evolutionary algorithm to decompose the LSCVRP into small tractable sub-components. The min-max objective was considered in a multiple Travelling Salesman Problem (TSP) 

[16], which was solved by a tabu search heuristic and two exact search schemes. An ant colony optimization method was proposed to address the min-max Single Depot CVRP (SDCVRP) [35]. The problem was further extended to the min-max multi-depot CVRP [36], which could be reduced to SDCVRP using an equitable region partitioning approach. A swarm intelligence based heuristic algorithm was presented to address the rich min-max CVRP [61]. The min-max cumulative capacitated vehicle routing problem, aiming to minimize the last arrival time at customers, was first studied in [44, 18], where a two-stage adaptive variable neighbourhood search (AVNS) algorithm was introduced and also tested in min-sum objective to verify generalization.

The first deep model for routing problems is Pointer Network, which used supervised learning to solve TSP

[51] and was later extended to reinforcement learning [5]. Afterwards, Pointer Network was adopted to solve CVRP in [37]

, where the Recurrence Neural Network architecture in the encoder was removed to reduce computation complexity without degrading solution quality. To further improve the performance, a Transformer based architecture was incorporated by integrating self-attention in both the encoder and decoder 

[24]. Different from the above methods which learn constructive heuristics, NeuRewriter was proposed to learn how to pick the next solution in a local search framework [9]. Despite their promising results, these methods are less effective for tackling the heterogeneous fleet in HCVRP. Recently, some learning based methods have been proposed to solve HCVRP. Inspired by multi-agent RL, Vera and Abad [49] made the first attempt to solve the min-sum HCVRP through cooperative actions of multiple agents for route construction. Qin et al. [42] proposed a reinforcement learning based controller to select among several meta-heuristics with different characteristics to solve min-sum HCVRP. Although yielding better performance than conventional heuristics, they are unable to well handle either the min-max objective or heterogeneous speed of vehicles.

Iii Problem Formulation

In this section, we first introduce the mathematical formulation of HCVRP with both min-max and min-sum objectives, and then reformulate it as the form of reinforcement learning.

Iii-a Mathematical Formulation of HCVRP

Particularly, with nodes (customers and depot) represented as and node denoting the depot, the customer set is assumed to be . Each node is defined as , where contains the 2-dim location coordinates of node , and refers to its demand (the demand for depot is 0). Here, we take heterogeneous vehicles with different capacities into account, which respects the real-world situations. Accordingly, let represent the heterogeneous fleet of vehicles, where each element is defined as , i.e., the capacity of vehicle . The HCVRP problem describes a process that all fully loaded vehicles start from depot, and sequentially visit the locations of customers to satisfy their demands, with the constraints that each customer can be visited exactly once, and the loading amount for a vehicle during a single trip can never exceed its capacity.

Let be the Euclidean distance between and . Let

be a binary variable, which equals to 1 if vehicle

travels directly from customer to , and 0 otherwise. Let be the remaining capacity of the vehicle before travelling from customer to customer . For simplification, we assume that all vehicles have the same speed , which could be easily extended to take different values. Then, the MM-HCVRP could be naturally defined as follows,


subject to the following six constraints,


The objective of the formulation is to minimize the maximum travel time among all vehicles. Constraint (2) and (3) ensure that each customer is visited exactly once and each route is completed by the same vehicle. Constraint (4) guarantees that the difference between amount of goods loaded by a vehicle before and after serving a customer equals to the demand of that customer. Constraint (5) enforces that the amount of goods for any vehicle is able to meet the demands of the corresponding customers and never exceed its capacity. Constraint (6) defines the binary variable and constraint (7) imposes the non-negativity of the variables.

The MS-HCVRP shares the same constraints with MM-HCVRP, while the objective is formulated as follows,


where represents the speed of vehicle , and it may vary with different vehicles. Thereby, it is actually minimizing the total travel time incurred by the whole fleet.

Iii-B Reformulation as RL Form

Reinforcement learning (RL) was originally proposed for sequential decision-making problems, such as self-driving cars, robotics, games, etc [28, 33, 52, 2, 55, 38]

. The construction of routes for HCVRP step by step can be also deemed as a sequential decision-making problem. In our work, we model such process as a Markov Decision Process (MDP)

[4] defined by 4-tuple (An example of the MDP is illustrated in the supplementary material). Meanwhile, the detailed definition of the state space , the action space , the state transition rule , and the reward function are introduced as follows.

State: In our MDP, each state is composed of two parts. The first part is the vehicle state , which is expressed as , where and represent the remaining capacity and the accumulate travel time of the vehicle at step , respectively. represents the partial route of the vehicle at step , where refers to the node visited by the vehicle at step . Note that the dimension of partial routes (the number of nodes in a route) for all vehicles keeps the same, i.e., if the vehicle is selected to serve the node at step , other vehicles still select their last served nodes. Upon departure from the depot (i.e., ), the initial vehicle state is set to where is the maximum capacity of vehicle . The second part is the node state , which is expressed as , where

is a 2-dim vector representing the locations of the node, and

is a scalar representing the demand of node ( will become 0 once that node has been served). Here, we do not consider demand splitting, and only nodes with need to be served.

Action: The action in our method is defined as selecting a vehicle and a node (a customer or the depot) to visit. In specific, the action is represented as , i.e., the selected node will be served (or visited) by the vehicle at step . Note that only one vehicle is selected at each step.

Transition: The transition rule will transit the previous state to the next state based on the performed action , i.e., . The elements in vehicle state are updated as follows,


where is the last element in , i.e., last visited customer by vehicle at step , and is the concatenation operator. The element in node state is updated as follows,


where each demand will retain 0 after being visited.

Reward: For the MM-HCVRP, to minimize the maximum travel time of all vehicles, the reward is defined as the negative value of this maximum, where the reward is calculated by accumulating the travel time of multiple trips for each vehicle, respectively. Accordingly, the reward is represented as , where is the incremental travel time for all vehicles at step . Similarly, for the MS-HCVRP, the reward is defined as the negative value of the total travel time of all vehicles, i.e., . Particularly, assume that node and are selected at step and , respectively, which are both served by the vehicle , then the reward is expressed as a -dim vector as follows,


where is the time consumed by the vehicle for traveling from node to , with other elements in equal to 0.

Iv Methodology

In this section, we introduce our deep reinforcement learning (DRL) based approach for solving HCVRP with both min-max and min-sum objectives. We first propose a novel attention-based deep neural network to represent the policy, which enables both vehicle selection and node selection at each decision step. Then we describe the procedure for training our policy network.

Fig. 1: The framework of our policy network. With raw features of the instance processed by the encoder, our policy network first selects a vehicle () using the vehicle selection decoder and then a node () using the node selection decoder for this vehicle to visit at each route construction step . Both the selected vehicle and node constitute the action at that step, i.e., , where the partial solution and state are updated accordingly. To a single instance, the encoder is executed once, while the vehicle and node selection decoders are executed multiple times to construct the solution.
Fig. 2: Architecture of our policy network with heterogeneous vehicles and

customers. It is worth noting that our vehicle selection decoder leverages the vehicle features (last node location and accumulated travel time), the route features (max pooling of the routes for

vehicles), and their combinations to compute the probability of selecting each vehicle.

Iv-a Framework of Our Policy Network

In our approach, we focus on learning a stochastic policy represented by a deep neural network with trainable parameter . Starting from the initial state , i.e., an empty solution, we follow the policy to construct the solution by complying with the MDP in section III-B until the terminate state is reached, i.e., all customers are served by the whole fleet of vehicles. The is possibly longer than

due to the fact that sometimes vehicles need to return to the depot for replenishment. Accordingly, the joint probability of this process is factorized based on the chain rule as follows,


where always holds since we adopt the deterministic state transition rule.

As illustrated in Fig. 1, our policy network is composed of an encoder, a vehicle selection decoder and a node selection decoder. Since a given problem instance itself remains unchanged throughout the decision process, the encoder is executed only once at the first step () to simplify the computation, while its outputs could be reused in subsequent steps () for route construction. To solve the instance, with raw features processed by the encoder for better representation, our policy network first selects a vehicle () from the whole fleet via the vehicle selection decoder and identify its index, then selects a node () for this vehicle to visit via the node selection decoder at each route construction step. The selected vehicle and node constitute the action for that step, which is further used to update the states. This process is repeated until all customers are served.

Iv-B Architecture of Our Policy Network

Originating from the field of natural language processing 

[48], the Transformer model has been successfully extended to many other domains such as image processing [25, 62], recommendation systems [43, 8] and vehicle routing problems [24, 58]

due to its desirable capability to handle sequential data. Rather than the sequential recurrent or convolutional structures, the Transformer mainly hinges on the self-attention mechanism to learn the relations between arbitrary two elements in a sequence, which allows more efficient parallelization and better feature extraction without the limitation of sequence-aligned recurrence. Regarding the general vehicle routing problems, the input is a sequence of customers characterized by locations and demands, and the construction of routes could be deemed as a sequential decision-making, where the Transformer has desirable potential to engender high quality solutions with short computation time. Specially, the Transformer-style models 

[24, 58] adopt an encoder-decoder structure, where the encoder aims to compute a representation of the input sequence based on the multi-head attention mechanism for better feature extraction and the decoder sequentially outputs a customer at each step based on the problem-related contextual information until all customers are visited. To solve the HCVRP with both min-max and min-sum objectives, we also propose a Transformer-style models as our policy network, which is designed as follows.

As depicted in Fig. 2, our policy network adopts an encoder-decoder structure and the decoder consists of two parts, i.e., vehicle selection decoder and node selection decoder. Based on the stipulation that any vehicle has the opportunity to be selected at each step, our policy network is able to search in a more rational and broader action space given the characteristics of HCVRP. Moreover, we enrich the contextual information for the vehicle selection decoder by adding the features extracted from all vehicles and existing (partial) routes. In doing so, it allows the policy network to capture the heterogeneous roles of vehicles, so that decisions would be made more effectively from a global perspective. To better illustrate our method, an example of two instances with seven nodes and three vehicles is presented in Fig. 3. Next, we introduce the details of our encoder, vehicle selection decoder, and node selection decoder, respectively.

Encoder. The encoder embeds the raw features of a problem instance (i.e., customer location, customer demand, and vehicle capacity) into a higher-dimensional space, and then processes them through attention layers for better feature extraction. We normalize the demand of customer by dividing the capacity of each vehicle to reflect the differences of vehicles in the heterogeneous fleet, i.e., . Similar to the encoder of Transformer in [48, 24], the enhanced node feature is then linearly projected to in a high dimensional space with dimension . Afterwards, is further transformed to through attention layers for better feature representation, each of which consists of a multi-head attention (MHA) sublayer and a feed-forward (FF) sublayer.

The -th MHA sublayer uses a multi-head self-attention network to process the node embeddings . We stipulate that is the query/key dimension, is the value dimension, and is the number of heads in the attention. The -th MHA sublayer first calculates the attention value for each head and then concatenates all these heads and projects them into a new feature space which has the same dimension as the input . Concretely, we show these steps as follows,


where , , and are trainable parameters in layer and are independent across different attention layers.

Afterwards, the output of the -th MHA sublayer is fed to the

-th FF sublayer with ReLU activation function to get the next embeddings

. Here, a skip-connection [21]

and a batch normalization (BN) layer 

[22] are used for both MHA and FF sublayers, which are summarised as follows,


Finally, we define the final output of the encoder, i.e., , as the node embeddings of the problem instance, and the mean of the node embeddings, i.e., , as the graph embedding of the problem instance, which will be reused for multiple times in the decoders.

Fig. 3: An illustration of our policy network for two instances with seven nodes and three vehicles, where the red frame indicates the two stacked instances with same data structure. Given the current state , the features of nodes and vehicles are processed through the encoder to compute the node embeddings and the graph embedding. In the vehicle selection decoder, the node embeddings of the three tours for three vehicles in current state , i.e., , are processed for route feature extraction, and the current location and the accumulated travel time of three vehicles are processed for vehicle feature extraction, which are then concatenated to compute the probability of selecting a vehicle. With the selected vehicle in this example, the current node embedding and the current loading ability of this vehicle are first concatenated and linearly propagated, then added with the graph embedding, which are further used to compute the probability of selecting a node with masked softmax, i.e., . With the selected node in this example, the action is represented as and the state is updated and transited to .

Vehicle Selection Decoder.

Vehicle selection decoder outputs a probability distribution for selecting a particular vehicle, which mainly leverages two embeddings, i.e.,

vehicle feature embedding and route feature embedding, respectively.

1) Vehicle Feature Embedding: To capture the states of each vehicle at current step, we define the vehicle feature context at step as follows,


where denotes the 2-dim location of the last node in the partial route of vehicle at step , and is the accumulated travel time of vehicle till step . Afterwards, the vehicle feature context is linearly projected with trainable parameters and and further processed by a 512-dim FF layer with ReLU activation function to engender the vehicle feature embedding at step as follows,


2) Route Feature Embedding: Route feature embedding extracts information from existing partial routes of all vehicles, which helps the policy network intrinsically learn from the visited nodes in previous steps, instead of simply masking them as did in previous works [5, 37, 51, 24]. For each vehicle at step , we define its route feature context as an arrangement of the node embeddimgs (i.e., is the node embeddimgs for node ), corresponding to the node in its partial route . Specifically, the route feature context for each vehicle , is defined as follows,


where (the first dimension is of size since should have elements at step ) and represents the corresponding node embeddings in of the -th node in partial route of vehicle . For example, assume and the partial route of vehicle is , then the route feature context of this vehicle at step would be . Afterwards, the route feature context of all vehicles is aggregated by a max-pooling and then concatenated to yield the route context for the whole fleet, which is further processed by a linear projection with trainable parameters and and a 512-dim FF layer to engender the route feature embedding at step as follows,


Finally, the vehicle feature embedding and route feature embedding are concatenated and linearly projected with parameter and , which is further processed by a softmax function to compute the probability vector as follows,


where and its element represents the probability of selecting vehicle at time step . Depending on different strategies, the vehicle can be selected either by retrieving the maximum probability greedily, or sampling according to the vector . The selected vehicle is then used as input to the node selection decoder.

Node Selection Decoder. Given node embeddings from the encoder and the selected vehicle from the vehicle selection decoder, the node selection decoder outputs a probability distribution over all unserved nodes (the nodes served in previous steps are masked), which is used to identify a node for the selected vehicle to visit. Similar to [24], we first define a context vector as follows, and it consists of the graph embedding , node embedding of the last (previous) node visited by the selected vehicle, and the remaining capacity of this vehicle,


where the second element has the same meaning as the one defined in Eq. (22) and is replaced with trainable parameters for . The designed context vector highlights the features of the selected vehicle at the current decision step, with the consideration of graph embedding of the instance from the global perspective. The context vector and the node embeddings are then fed into a multi-head attention (MHA) layer to synthesis a new context vector as a glimpse of the node embeddings [50]. Different from the self-attention in the encoder, the query of this attention comes from the context vector, while the key/value of the attention come from the node embeddings as shown below,


where and are trainable parameters similar to Eq. (17). We then generate the probability distribution by comparing the relationship between the enhanced context and the node embeddings through a Compatibility Layer. The compatibility of all nodes with context at step is computed as follows,


where and are trainable parameters, and is set to 10 to control the entropy of . Finally, the probability vector is computed in Eq. (31) where all nodes visited in the previous steps are masked for feasibility and element represents the probability of selecting node served by the selected vehicle at step as follows,


Similar to the decoding strategy of vehicle selection, the nodes could be selected by always retrieving the maximum , or sampling according to the vector in a less greedy manner.

input : Initial parameters for policy network ;
initial parameters for baseline network ;
number of iterations ;
iteration size ; number of batches ;
maximum training steps ; significance .
1 foreach  do
2        Sample problem instances randomly;
3        foreach  do
4               Retrieve batch ;
5               foreach  do
6                      Pick an action ;
7                      Observe reward and next state ;
9               end foreach
10              ;
11               GreedyRollout with baseline and compute its reward ;
12               ;
13               ;
15        end foreach
17 end foreach
18if OneSidedPairedTTest()  then
19       ;
20 end if
ALGORITHM 1 Deep Reinforcement Learning Algorithm

Iv-C Training Algorithm

The proposed deep reinforcement learning method is summarized in Algorithm 1, where we adopt the policy gradient with baseline to train the policy of vehicle selection and node selection for route construction. The policy gradient are characterized by two networks: 1) the policy network, i.e., the policy network aforementioned, picks an action and generates probability vectors for both vehicles and nodes with respect to this action at each decoding step; 2) the baseline network , a greedy roll-out baseline with a similar structure as the policy network, but computes the reward by always selecting vehicles and nodes with maximum probability. A Monte Carlo method is applied to update the parameters to improve the policy iteratively. At each iteration, we construct routes for each problem instance and calculate the reward with respect to this solution in line 9, and the parameters of the policy network are updated in line 13. In addition, the expected reward of the baseline network

comes from a greedy roll-out of the policy in line 10. The parameters of the baseline network will be replaced by the parameters of the latest policy network if the latter significantly outperforms the former according to a paired t-test on several instances in line 15. By updating the two networks, the policy

is improved iteratively towards finding higher-quality solutions.

, , ,
, , , ,
, , , ,
, , ,
, , , ,
  • The iterations are increased as the problem size scales up following the original paper.

  • The original papers use same iterations for all problem sizes. We linearly increase the iterations as the problem size scales up.

TABLE I: Parameter Settings of Heuristic Methods.

V Computational Experiments

In this section, we conduct experiments to evaluate our DRL method. Particularly, a heterogeneous fleet of fully loaded vehicles with different capacities, start at a depot node and depart to satisfy the demands of all customers by following certain routes, the objective of which is to minimize the longest or total travel time incurred by the vehicles. Moreover, we further verify our method by extending the experiments to benchmark instances from the CVRPLib [47]. Note that, the HCVRP with min-max and min-sum objectives are both NP-hard problems, and the theoretical computation complexity grows exponentially as problem size scales up.

V3-C40 V3-C60 V3-C80 V3-C100 V3-C120
Method Obj. Gap Time Obj. Gap Time Obj. Gap Time Obj. Gap Time Obj. Gap Time


SISR 4.00 0% 245s 5.58 0% 468s 7.27 0% 752s 8.89 0% 1135s 10.42 0% 1657s
VNS 4.17 4.25% 115s 5.80 3.94% 294s 7.57 4.13% 612s 9.20 3.49% 927s 10.81 3.74% 1378s
ACO 4.31 7.75% 209s 6.18 10.75% 317s 8.14 11.97% 601s 10.05 13.05% 878s 11.79 13.15% 1242s
FA 4.49 12.25% 168s 6.30 12.90% 285s 8.32 14.44% 397s 10.11 13.72% 522s 11.98 14.97% 667s
AM(Greedy) 4.85 21.25% 0.37s 6.57 17.74% 0.54s 8.32 14.44% 0.82s 9.98 12.26% 1.07s 11.63 11.61% 1.28s
AM(Sample1280) 4.36 9.00% 0.88s 5.99 7.39% 1.19s 7.73 6.33% 1.81s 9.36 5.29% 2.51s 10.94 4.99% 3.37s
AM(Sample12800) 4.31 7.75% 1.35s 5.92 6.09% 2.46s 7.66 5.36% 3.67s 9.28 4.39% 5.17s 10.85 4.13% 6.93s
DRL(Greedy) 4.45 11.25% 0.70s 6.08 8.96% 0.82s 7.82 7.57% 1.11s 9.42 5.96% 1.44s 10.98 5.37% 1.94s
DRL(Sample1280) 4.17 4.25% 1.25s 5.77 3.41% 1.43s 7.48 2.89% 2.25s 9.07 2.02% 3.42s 10.62 1.92% 4.52s
DRL(Sample12800) 4.14 3.50% 1.64s 5.73 2.69% 2.97s 7.44 2.34% 4.56s 9.02 1.46% 6.65s 10.56 1.34% 8.78s


Exact-solver 55.43* 0% 71s 78.47* 0% 214s 102.42* 0% 793s 124.61* 0% 2512s - - -
SISR 55.79 0.65% (254s) 79.12 0.83% (478s) 103.41 0.97% (763s) 126.19 1.27% (1140s) 149.10 0% (1667s)
VNS 57.54 3.81% 109s 81.44 3.78% 291s 106.18 3.67% 547s 129.32 3.78% 828s 152.56 2.32% 1217s
ACO 60.11 8.44% 196s 86.05 9.66% 302s 113.75 11.06% 593s 140.61 12.84% 859s 166.50 11.67% 1189s
FA 59.94 8.14% 164s 85.36 8.78% 272s 112.81 10.14% 388s 138.92 11.48% 518s 164.53 10.35% 653s
AM(Greedy) 66.54 20.04% 0.49s 91.19 16.21% 0.83s 117.22 14.45% 1.01s 141.14 13.27% 1.23s 164.57 10.38% 1.41s
AM(Sample1280) 60.95 9.96% 0.92s 85.74 9.26% 1.17s 111.78 9.14% 1.79s 135.61 8.83% 2.49s 159.11 6.71% 3.30s
AM(Sample12800) 60.26 8.71% 1.35s 84.96 8,27% 2.31s 110.94 8.32% 3.61s 134.72 8.11% 5.19s 158.19 6.10% 6.86s
DRL(Greedy) 58.99 6.42% 0.61s 83.06 5.85% 1.02s 108.44 5.88% 1.11s 131.75 5.73% 1.56s 154.56 3.66% 1.96s
DRL(Sample1280) 57.05 2.92% 1.18s 80.46 2.54% 1.49s 105.29 2.80% 2.34s 128.63 3.23% 3.38s 151.23 1.43% 4.61s
DRL(Sample12800) 56.84 2.54% 1.65s 79.92 1.85% 2.99s 104.63 2.16% 4.63s 128.19 2.87% 6.74s 150.73 1.09% 9.11s
  • The mark () indicates that the time is computed based on JAVA implementation which is publicly available. For VNS, ACO and FA, we re-implement them in Python since their original codes are not available, where C++ or JAVA was used in the original papers. So the reported time might be different from the ones in the original papers.

  • The mark * indicates that all instances are solved optimally.

TABLE II: DRL Method v.s. Baselines for Three Vehicles (V3).
V5-C80 V5-C100 V5-C120 V5-C140 V5-C160
Method Obj. Gap Time Obj. Gap Time Obj. Gap Time Obj. Gap Time Obj. Gap Time


SISR 3.90 0% (727s) 4.72 0% (1091s) 5.48 0% (1572s) 6.33 0% (1863s) 7.16 0% (2521s)
VNS 4.15 6.41% 725s 4.98 7.19% 1046s 5.81 6.02% 1454s 6.67 5.37% 2213s 7.53 5.17% 3321s
ACO 4.50 15.38% 612s 5.56 17.80% 890s 6.47 18.07% 1285s 7.52 18.80% 2081s 8.51 18.85% 2898s
FA 4.61 18.21% 412s 5.62 19.07% 541s 6.58 20.07% 682s 7.60 20.06% 822s 8.64 20.67% 964s
AM(Greedy) 4.84 24.10% 1.08s 5.70 20.76% 1.31s 6.57 19.89% 1.74s 7.49 18.33% 1.93s 8.34 16.48% 2.15s
AM(Sample1280) 4.32 10.77% 1.88s 5.18 8.75% 2.64s 6.03 10.04% 3.38s 6.93 9.48% 4.47s 7.75 8.24% 5.73s
AM(Sample12800) 4.25 8.97% 3.71s 5.11 8.26% 5.19s 5.95 8.58% 6.94s 6.86 8.37% 8.73s 7.69 7.40% 10.69s
DRL(Greedy) 4.36 11.79% 1.29s 5.20 10.17% 1.64s 5.94 8.39% 2.38s 6.78 7.11% 2.43s 7.61 6.28% 3.02s
DRL(Sample1280) 4.08 4.62% 2.66s 4.91 4.03% 3.66s 5.66 3.28% 5.08s 6.51 2.84% 6.48s 7.34 2.51% 8.52s
DRL(Sample12800) 4.04 3.59% 5.06s 4.87 3.18% 7.20s 5.62 2.55% 9.65s 6.47 2.21% 10.93s 7.30 1.96% 13.76s


Exact-solver 102.42* 0% 1787s 124.63* 0% 6085s - - - - - - - - -
SISR 103.49 1.04% (735s) 126.35 1.38% (1107s) 149.18 0% (1580s) 172.88 0% (1881s) 196.51 0% (2539s)
VNS 109.91 7.31% 538s 133.28 6.94% 811s 156.37 4.82% 1386s 180.08 4.16% 2080s 203.95 3.79% 2896s
ACO 118.58 15.78% 608s 146.51 17.56% 865s 171.82 15.18% 1269s 200.73 16.11% 1922s 229.64 16.86% 2803s
FA 116.13 13.39% 401s 142.39 14.25% 532s 167.87 12.53% 677s 196.48 13.65% 801s 223.49 13.73% 955s
AM(Greedy) 128.31 25.28% 0.82s 152.91 22.69% 1.28s 177.39 18.91% 1.45s 201.85 16.76% 1.69s 227.10 15.57% 1.81s
AM(Sample1280) 119.41 16.59% 1.83s 144.23 15.73% 2.66s 168.95 13.25% 3.63s 193.65 12.01% 4.68s 218.67 11.28% 5.49s
AM(Sample12800) 118.04 15.25% 3.74s 142.79 14.57% 5.20s 167.45 12.25% 7.02s 192.13 11.13% 8.93s 217.14 10.50% 11.01s
DRL(Greedy) 108.43 5.87% 1.26s 131.90 5.83% 1.73s 154.71 3.71% 2.11s 178.78 3.41% 3.06s 202.87 3.24% 3.60s
DRL(Sample1280) 105.54 3.05% 2.70s 128.63 3.21% 4.15s 151.39 1.48% 5.37s 175.29 1.39% 6.83s 199.16 1.35% 8.68s
DRL(Sample12800) 104.88 2.40% 5.38s 128.17 2.84% 7.61s 150.86 1.26% 9.24s 174.80 1.11% 11.30s 198.66 1.09% 13.83s
TABLE III: DRL Method v.s. Baselines for Five Vehicles (V5).

V-a Experiment Settings for HCVRP

We describe the settings and the data generation method (which we mainly follow the classic ways in [5, 37, 51, 24]) for our experiments. Pertaining to MM-HCVRP, the coordinates of depot and customers are randomly sampled within the unit square

using the uniform distribution. The demands of customers are discrete numbers randomly chosen from set

(demand of depot is 0). To comprehensively verify the performance, we consider two settings of heterogeneous fleets. The first fleet considers three heterogeneous vehicles (named V3), the capacity of which are set to 20, 25, and 30, respectively. The second fleet considers five heterogeneous vehicles (named V5), the capacity of which are set to 20, 25, 30, 35, and 40, respectively. Our method is evaluated with different customer sizes for the two fleets, where we consider 40, 60, 80, 100 and 120 for V3; and 80, 100, 120, 140 and 160 for V5. In MM-HCVRP, we set the vehicle speed for all vehicles to be 1.0 for simplification. However, our method is capable of coping with different speeds which is verified in MS-HCVRP. Pertaining to MS-HCVRP, most of settings are the same as the MM-HCVRP except for the vehicle speeds, which are inversely proportional to their capacities. In doing so, it avoids that only vehicle with the largest capacity is selected to serve all customers to minimize total travel time. Particularly, the speeds are set to , , and for V3, and , , , , and for V5, respectively.

The hyperparameters are shared to train the policy for all problem sizes. Similar to 

[24], the training instances are randomly generated on the fly with iteration size of 1,280,000 and are split into 2500 batches for each iteration. Pertaining to the number of iterations, normally more iterations lead to better performance. However, after training with an amount of iterations, if the improvement in the performance is not much significant, we could stop the training before full convergence, which still could deliver competitive performance although not the best. For example, regarding the model of V5-C160 (5 vehicles and 160 customers) with min-max objective trained for 50 iterations, 5 more iterations can only reduce the gap by less than 0.03%, then we will stop the training. In our experiments, we use 50 iterations for all problem sizes to demonstrate the effectiveness of our method, while more iterations could be adopted for better performance in practice. The features of nodes and vehicles are embedded into a 128-dimensional space before fed into the vehicle selection and node selection decoder, and we set the dimension of hidden layers in decoder to be 128 [5, 37, 24]. In addition, Adam optimizer is employed to train the policy parameters, with initial learning rate and decaying 0.995 per iteration for convergence. The norm of all gradient vectors are clipped to be within 3.0 and in Section IV-C

is set to 0.05. Each iteration consumes average training time of 31.61m (minutes), 70.52m (with single 2080Ti GPU), 93.02m, 143.62m (with two GPUs) and 170.14m (with three GPUs) for problem size of 40, 60, 80, 100 and 120 regarding V3, and 105.25m, 135.49m, (with two GPUs) 189.15m, 264.45m and 346.52m (with three GPUs) for problem size of 80, 100, 120, 140 and 160 regarding V5. Pertaining to testing, 1,280 instances are randomly generated for each problem size from the uniform distribution, and are fixed for our method and the baselines. Our DRL code in PyTorch is available.


V-B Comparison Analysis of HCVRP

For the MM-HCVRP, it is prohibitively time-consuming to find optimal solutions, especially for large problem size. Therefore, we adopt a variety of improved classical heuristic methods as baselines, which include: 1) Slack Induction by String Removals (SISR) [12], a state-of-the-art heuristic method for CVRP and its variants; 2) Variable Neighborhood Search (VNS), a efficient heuristic method for solving the consistent VRP [60]; 3) Ant Colony Optimization (ACO), an improved version of ant colony system for solving HCVRP with time windows [39], where we run the solution construction for all ants in parallel to reduce computation time; 4) Firefly Algorithm (FA), an improved version of standard FA method for solving the heterogeneous fixed fleet vehicle routing problem [32]

; 5) the state-of-the-art DRL based attention model (AM) 

[24], learning a policy of node selection to construct a solution for TSP and CVRP. We adapt the objectives and relevant settings of all baselines so that they share the same one with MM-HCVRP. We have fine tuned the parameters of the conventional heuristic methods using the grid search [7] for adjustable parameters in their original works, such as the number of shifted points in shaking process, the discounting rate of the pheromones and the scale of the population, and report the best ones in Table I. Regarding the iterations, we linearly increase the original ones for VNS, ACO and FA as the problem size scales up for better performance, while the original settings adopt identical iterations across all problem sizes. For SISR, we follow its original setting, where the iterations are increased as the problem size grows up. To fairly compare with AM, we tentatively leverage two external strategies to select a vehicle at each decoding step for AM, i.e., by turns and randomly, since it did not cope with vehicle selection originally. The results indicate that vehicle selection by turns is better for AM, which is thereby adopted for both min-max and min-sum objectives in our experiments. Note that, we do not compare with OR-Tools as there is no built-in library or function that can directly solve MM-HCVRP. Moreover, we do not compare with Gurobi or CPLEX either, as our experience shows that they consume days to optimally solve a MM-HCVRP instance even with 3 vehicles and 15 customers. For the MS-HCVRP, with the same three heuristic baselines and AM as used for the MM-HCVRP, we additionally adopted a generic exact solver for solving vehicle routing problems with min-sum objective [40]. The baselines including VNS, ACO, and FA are implemented in Python. For SISR, we adopt a publicly available version222 implemented in JAVA. Note that, the running efficiency of the same algorithm implemented in C++, JAVA, and Python could be considerably different, which will also be analyzed later for running time comparison333A program implemented in C/C++ might be 20-50 times faster than that of Python, especially for large-scale problem instances. The running efficiency of Java could be comparable to C/C++ with highly optimized coding but could be slightly slower in general.. All these baselines are executed on CPU servers equipped with the Intel i9-10940X CPU at 3.30 GHz. For those which consume much longer running time, we deploy them on multiple identical servers.

Regarding our DRL method and AM, we apply two types of decoders during testing: 1) Greedy, which always selects the vehicle and node with the maximum probability at each decoding step; 2) Sampling, which engenders solutions by sampling according to the probability computed in Eq. (27) and Eq. (31), and then retrieves the best one. We set to 1280 and 12800, and term them as Sample1280 and Sample12800, respectively. Then we record the performance of our methods and baselines on all sizes of MM-HCVRP and MS-HCVRP instances for both three vehicles and five vehicles in Table II and Table III, respectively, which include average objective value, gap and computation time of an instance. Given the fact that it is prohibitively time-consuming to optimally solve MM-HCVRP, the gap here is calculated by comparing the objective value of a method with the best one found among all methods.

Fig. 4: The converge curve of DRL method and SISR (V3).
Fig. 5: The converge curve of DRL method and SISR (V5).

From Table II, we can observe that for the MS-HCVRP with three vehicles, the Exact-solver achieves the smallest objective value and gap and consumes shorter computation time than heuristic methods on V3-C40 and V3-C60. However, its computation time grows exponentially as the problem size scales up. We did not show the results of the Exact-solver for solving instances with more than 100 customers regarding both three vehicles and five vehicles, which consumes prohibitively long time. Among the three variants of DRL method, our DRL(Greedy) outperforms FA and AM(Greedy) in terms of objective values and gap. Although with slightly longer computation time, both Sample1280 and Sample12800 achieve smaller objective values and gaps than Greedy, which demonstrates the effectiveness of sampling strategy in improving the solution quality. Specifically, our DRL(Sample1280) can outstrip all AM variants and ACO. It also outperforms VNS in most cases except for V3-C40, where our DRL(Sample1280) achieves the same gap with VNS. With Sample12800, our DRL further outperforms VNS in terms of objective value and gap and is slightly inferior to the state-of-the-art heuristic, i.e., SISR and the Exact-solver. Pertaining to the running efficiency, although the computation time of our DRL method is slightly longer than that of AM, it is significantly shorter (at least an order of magnitude faster) than that of conventional methods, even if we eliminate the impact of different programming language via roughly dividing the reported running time by a constant (e.g., 30), especially for large problem sizes. Regarding the MM-HCVRP, similarly, our DRL method outperforms VNS, ACO, FA and all AM variants, which performs slightly inferior to SISR but consumes much shorter running time. Among all heuristic and learning based methods, the state-of-the-art method SISR achieves lowest objective value and gap, however, the computation time of SISR grows almost exponentially as the problem scale increases and our DRL(Sample12800) grows almost linearly, which is more obvious in large-scale problem sizes.

In Table III, similar patterns could be observed in comparison with that of three vehicles, where the superiority of DRL(Sample12800) to VNS, ACO, FA and AM becomes more obvious. Meanwhile, our DRL method is still competitive to the state-of-the-art method, i.e., SISR, on larger problem sizes in comparison with Table II. Combining both Tables, our DRL method with Sample12800 achieves better overall performance than conventional heuristics and AM on both the MM-HCVRP and MS-HCVRP, and also performs competitively well against SISR, with satisfactory computation time.

To further investigate the efficiency of our DRL method against SISR, we evaluate their performance with a bounded time budget, i.e., 500 seconds. It is much longer than the computation time of our method, given that our DRL computes a solution in a construction fashion rather than the improvement fashion as in SISR. In Fig. 4, we record the performance of our DRL method and SISR for MS-HCVRP with three vehicles on the same instances in Table II and Table III, where the horizontal coordinate refers to the computation time and the vertical one refers to the objective value. We depict the computation time of our DRL method using hollow circle, and then horizontally extend it for better comparison. We plot the curve of SISR over time since it improves an initial yet complete solution iteratively. We also record the time when SISR achieves same objective value as our method using filled circle. We can observe that SISR needs longer computation time to catch up the DRL method as the problems size scales up. When the computation time reaches 500 seconds for SISR, our DRL method achieves only slightly inferior objective values with much shorter computation time. For example, the DRL method only needs 9.1 seconds for solving a V3-C120 instance, while SISR needs about 453 seconds to achieve the same objective value. In Fig. 5, we record the results of our DRL method and SISR for five vehicles also with 500 seconds. Similar patterns could be observed to that of three vehicles, where the superiority of our DRL method is more obvious, especially for large-scale problem sizes. For example, on V5-C140 and V5-C160, our DRL method with 11.3 and 13.84 seconds even outperforms the SISR with 500 seconds, respectively. Combining all results above, we can conclude that with a relatively short time limit, our DRL method tends to achieve better performance than the state-of-the-art method, i.e., SISR, and the superiority of our DRL method is more obvious for larger-scale problem sizes. Even with time limit much longer than 500 seconds, our DRL method still achieves competitive performance against SISR.

Fig. 6: Generalization performance for MM-HCVRP (V3).
Fig. 7: Generalization performance for MS-HCVRP (V3).
Fig. 8: Generalization performance for MM-HCVRP (V5).
Fig. 9: Generalization performance for MS-HCVRP (V5).

V-C Generalization Analysis of HCVRP

To verify the generalization of our method, we conduct experiments to apply the policy learnt for a customer size to larger ones since generalizing to larger customer sizes is more meaningful in real-life situations, where we mainly focus on Sample12800 in our method.

In Fig. 6, we record the results of the MM-HCVRP for the fleet of three vehicles, where the horizontal coordinate refers to the problems to be solved, and the vertical one refers to the average objective values of different methods. We observe that for each customer size, the corresponding policy achieves the smallest objective values in comparison with those learnt for other sizes. However, they still outperform AM and those classical heuristic methods except for V3-C40 in solving problem sizes larger than or equal to 80, where V3-C40 is comparable with the best performed baseline (i.e., VNS), if we refer to Table II. Moreover, we also notice that most of the policies learnt for proximal customer sizes tend to perform better than that of more different ones, e.g., the policies for V3-C80 and V3-C100 perform better than that of V3-C40 and V3-C60 in solving V3-C120. The rationales behind this observation might be that proximal customers sizes may lead to similar distributions of customer locations. In Fig. 7, we record the results of the MS-HCVRP for the fleet of three vehicles, where similar patterns could be found in comparison with that of the MM-HCVRP, and the policies learnt for other customer sizes outperform all the classical heuristic methods and AM.

In Fig. 8 and Fig. 9, we record the generalization performance of the MM-HCVRP and MS-HCVRP for five vehicles, respectively. Similar patterns to the three vehicles could be observed, where for each customer size, the corresponding policy achieves smaller objective value in comparison with those learnt for other sizes. However, they still outperform most classical heuristic methods and AM in all cases, and are only slightly inferior to the corresponding policies.

Dist. Instance Opt. Ours VNS SISR



P-n60-k10 - 306 308 293
A-n61-k9 - 319 307 299
E-n76-k7 - 372 375 362
A-n80-k10 - 795 813 776
E-n101-k8 - 446 455 428
Avg. Gap - 4.11% 4.49% 0%


B-n41-k6 - 385 371 359
B-n51-k7 - 392 378 369
B-n63-k10 - 564 558 540
M-n101-k10 - 419 401 391
CMT111 - 878 869 858
Avg. Gap - 5.48% 2.59% 0%



P-n60-k10 4009* 4045 4265 4013
A-n61-k9 3984* 4041 4252 3995
E-n76-k7 4740* 5035 5222 4947
A-n80-k10 11149* 11454 11466 11186
E-n101-k8 5653* 5972 6114 5727
Avg. Gap 0% 2.08% 5.51% 1.28%


B-n41-k6 4948* 5327 5015 4948
B-n51-k7 5235* 5434 5363 5236
B-n63-k10 7706* 7805 7825 7727
M-n101-k10 5443* 5707 5687 5507
CMT111 - 12526 12183 11910
Avg. Gap 0% 4.55% 2.42% 0.29%
  • CMT11 has 1 depot and 120 customers (121 nodes).

TABLE IV: Our Method v.s. Baselines on CVRPLib.

V-D Discussion

To comprehensively evaluate the performance of our DRL method, we further apply our trained model to solve the instances randomly selected from the well-known CVRPLib444CVRPLib ( is a well-known online benchmark repository of VRP instances used for algorithm comparison in VRP literature. More details of CVRPLib can be found in [47]. In our experiment, we select 10 instances and adapt them to our MM-HCVRP and MS-HCVRP settings by adopting their customer locations and demands. benchmark, half of which follow uniform distribution regarding the customer locations, and the remaining half do not.

In Table IV, we record the comparison results on CVRPlib, where the Exact-solver is used to optimally solve the MS-HCVRP. Regarding the DRL method, we directly exploit the trained models as in Table II and Table III to solve the CVRPLib instances, where the model with the closest size to the instances is adopted. For example, we use the model trained for V3-C60 to solve B-n63-k10. We select SISR and VNS as baselines for both MM-HCVRP and MS-HCVRP, which perform better than other heuristics in previous experiments. Each reported objective value is averaged over 10 independent runs with different random seeds. Note that it is prohibitively long for the Exact-solver to solve MS-HCVRP with more than 100 customers (i.e., CMT11). From Table IV, we can observe that our DRL method tends to perform better than VNS on uniformly distributed instances, and slightly inferior on the instances of non-uniform distribution for both MM-HCVRP and MS-HCVRP. Although inferior to SISR, our DRL method is able to engender solutions of comparable solution quality with much shorter computation time. For example, SISR consumes 1598 seconds to solve the CMT11, while our DRL method only needs 9.0 seconds. We also notice that, our DRL method tends to perform better on uniform distributed instances than that of non-uniform ones if we refer to the gap between our method and the exact method for MS-HCVRP and the SISR for MM-HCVRP.

This observation about different distributions indeed makes sense especially given that as described in Section V-A, the customer locations in all training instances follow the uniform distributions, the setting of which is widely adopted in this line of research (e.g., [5, 37, 24, 9, 58]). Since our DRL model is a learning method in nature, it does have favorable potential to deliver superior performance when both the training and testing instances are from the same (or similar) uniform distribution. It also explains why our DRL method outperform most of the conventional heuristic methods in Table II and Table III. When it comes to non-uniform distribution for testing, this superiority does not necessarily preserve, as indicated by the results in Table IV. However, it is a fundamental out-of-distribution challenge to all learning methods, including our DRL method. The purpose of Table IV is to reveal when our DRL method may perform inferior to others. Considering that addressing the out-of-distribution challenge is not in the scope of this paper, we will investigate it in future.

Vi Conclusion and Future Work

In this paper, we cope with the heterogeneous CVRP for both min-max and min-sum objectives. To solve this problem, we propose a learning based constructive heuristic method, which integrates deep reinforcement learning and attention mechanism to learn a policy for the route construction. In specific, the policy network leverages an encoder, a vehicle selection decoder and a node selection decoder to pick a vehicle and a node for this vehicle at each step. Experimental results show that the overall performance of our method is superior to most of the conventional heuristics and competitive to the state-of-the-art heuristic method, i.e., SISR with much shorter computation time. With comparable computation time, our method also significantly outperforms the other learning based method. Moreover, the proposed method generalizes well to problems with larger number of customers for both MM-HCVRP and MS-HCVRP.

One major purpose of our work is to nourish the development of deep reinforcement learning (DRL) based methods for solving the vehicle routing problems, which have emerged lately. Following the same setting adopted in this line of works [5, 37, 24, 9, 51], we randomly generate the locations within a square of [0,1] for training and testing. The proposed method works well for HCVRP with both min-max and min-sum objectives, but may perform inferior for other types of VRPs, such as VRP with time window constraint and dynamic customer requests. Taking into account the above concerns and other potential limitations that our method may have, in future, we will consider and study the following aspects, 1) time window constraint, and dynamic customer requests or stochastic traffic conditions; 2) generalization to different number of vehicles; 3) evaluation with other classical or realistic benchmark datasets with instances of different distributions (e.g.,; and 4) improvement over SISR by integrating with active search [5] or other improvement approaches (e.g., [56]).

Vii Acknowledgement

This work is supported by the National Natural Science Foundation of China (Grant No. 61803104, 62102228), and Young Scholar Future Plan of Shandong University (Grant No. 62420089964188).


  • [1] C. Alabas-Uslu (2008) A self-tuning heuristic for a multi-objective vehicle routing problem. Journal of the Operational Research Society 59 (7), pp. 988–996. Cited by: §I.
  • [2] W. Bai, Q. Zhou, T. Li, and H. Li (2019) Adaptive reinforcement learning neural network control for uncertain nonlinear system with input saturation. IEEE Transactions on Cybernetics 50 (8), pp. 3433 – 3443. Cited by: §III-B.
  • [3] R. Baldacci and A. Mingozzi (2009) A unified exact method for solving different classes of vehicle routing problems. Mathematical Programming 120 (2), pp. 347–380. Cited by: §I, §II.
  • [4] R. Bellman (1957) A markovian decision process. Journal of Mathematics and Mechanics, pp. 679–684. Cited by: §III-B.
  • [5] I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio (2017) Neural combinatorial optimization with reinforcement learning. In International Conference on Learning Representations, Cited by: §I, §II, §IV-B, §V-A, §V-A, §V-D, §VI.
  • [6] L. Bertazzi, B. Golden, and X. Wang (2015) Min–max vs. min–sum vehicle routing: a worst-case analysis. European Journal of Operational Research 240 (2), pp. 372–381. Cited by: §I.
  • [7] J. A. Brito, F. E. McNeill, C. E. Webber, and D. R. Chettle (2005) Grid search: an innovative method for the estimation of the rates of lead exchange between body compartments. Journal of Environmental Monitoring 7 (3), pp. 241–247. Cited by: §V-B.
  • [8] Q. Chen, H. Zhao, W. Li, P. Huang, and W. Ou (2019) Behavior sequence transformer for e-commerce recommendation in alibaba. In Proceedings of the 1st International Workshop on Deep Learning Practice for High-Dimensional Sparse Data, pp. 1–4. Cited by: §IV-B.
  • [9] X. Chen and Y. Tian (2019) Learning to perform local rewriting for combinatorial optimization. In Advances in Neural Information Processing Systems, pp. 6278–6289. Cited by: §I, §II, §V-D, §VI.
  • [10] R. Cheng, Y. Jin, M. Olhofer, et al. (2016) Test problems for large-scale multiobjective and many-objective optimization. IEEE Transactions on Cybernetics 47 (12), pp. 4108–4121. Cited by: §II.
  • [11] R. Cheng and Y. Jin (2014) A competitive swarm optimizer for large scale optimization. IEEE Transactions on Cybernetics 45 (2), pp. 191–204. Cited by: §II.
  • [12] J. Christiaens and G. Vanden Berghe (2020) Slack induction by string removals for vehicle routing problems. Transportation Science 54 (2), pp. 417–433. Cited by: §V-B.
  • [13] G. B. Dantzig and J. H. Ramser (1959) The truck dispatching problem. Management Science 6 (1), pp. 80–91. Cited by: §II.
  • [14] S. Duran, M. A. Gutierrez, and P. Keskinocak (2011) Pre-positioning of emergency items for care international. Interfaces 41 (3), pp. 223–237. Cited by: §I.
  • [15] L. Feng, L. Zhou, A. Gupta, J. Zhong, Z. Zhu, K. Tan, and K. Qin (2019) Solving generalized vehicle routing problem with occasional drivers via evolutionary multitasking. IEEE Transactions on Cybernetics, pp. 1 – 14. Cited by: §II.
  • [16] P. M. França, M. Gendreau, G. Laporte, and F. M. Müller (1995) The m-traveling salesman problem with minmax objective. Transportation Science 29 (3), pp. 267–275. Cited by: §I, §II.
  • [17] B. Golden, A. Assad, L. Levy, and F. Gheysens (1984) The fleet size and mix vehicle routing problem. Computers & Operations Research 11 (1), pp. 49–66. Cited by: §I, §II.
  • [18] B. L. Golden, G. Laporte, and É. D. Taillard (1997) An adaptive memory heuristic for a class of vehicle routing problems with minmax objective. Computers & Operations Research 24 (5), pp. 445–452. Cited by: §II.
  • [19] M. Haimovich and A. H. Rinnooy Kan (1985) Bounds and heuristics for capacitated routing problems. Mathematics of Operations Research 10 (4), pp. 527–542. Cited by: §I, §I.
  • [20] E. K. Hashi, M. R. Hasan, and M. S. U. Zaman (2016) GIS based heuristic solution of the vehicle routing problem to optimize the school bus routing and scheduling. In International Conference on Computer and Information Technology (ICCIT), pp. 56–60. Cited by: §I.
  • [21] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 770–778. Cited by: §IV-B.
  • [22] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In

    International Conference on Machine Learning

    pp. 448–456. Cited by: §IV-B.
  • [23] Ç. Koç, T. Bektaş, O. Jabali, and G. Laporte (2015) A hybrid evolutionary algorithm for heterogeneous fleet vehicle routing problems with time windows. Computers & Operations Research 64, pp. 11–27. Cited by: §I.
  • [24] W. Kool, H. van Hoof, and M. Welling (2018) Attention, learn to solve routing problems!. In International Conference on Learning Representations, Cited by: §I, §II, §IV-B, §IV-B, §IV-B, §IV-B, §V-A, §V-A, §V-B, §V-D, §VI.
  • [25] G. Li, L. Zhu, P. Liu, and Y. Yang (2019)

    Entangled transformer for image captioning

    In Proceedings of the IEEE International Conference on Computer Vision, pp. 8928–8937. Cited by: §IV-B.
  • [26] J. Li, L. Xin, Z. Cao, A. Lim, W. Song, and J. Zhang (2021) Heterogeneous attentions for solving pickup and delivery problem via deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems. Cited by: §I.
  • [27] X. Li, P. Tian, and Y. Aneja (2010) An adaptive memory programming metaheuristic for the heterogeneous fixed fleet vehicle routing problem. Transportation Research Part E: Logistics and Transportation Review 46 (6), pp. 1111–1127. Cited by: §I.
  • [28] D. Liu, X. Yang, D. Wang, and Q. Wei (2015) Reinforcement-learning-based robust controller design for continuous-time uncertain nonlinear systems subject to input constraints. IEEE Transactions on Cybernetics 45 (7), pp. 1372–1385. Cited by: §III-B.
  • [29] X. Liu, H. Qi, and Y. Chen (2006) Optimization of special vehicle routing problem based on ant colony system. In International Conference on Intelligent Computing, pp. 1228–1233. Cited by: §I.
  • [30] J. Lysgaard, A. N. Letchford, and R. W. Eglese (2004) A new branch-and-cut algorithm for the capacitated vehicle routing problem. Mathematical Programming 100 (2), pp. 423–445. Cited by: §I, §I.
  • [31] X. Ma, Y. Song, and J. Huang (2010) Min-max robust optimization for the wounded transfer problem in large-scale emergencies. In Chinese Control and Decision Conference, pp. 901–904. Cited by: §I.
  • [32] P. Matthopoulos and S. Sofianopoulou (2019) A firefly algorithm for the heterogeneous fixed fleet vehicle routing problem. International Journal of Industrial and Systems Engineering 33 (2), pp. 204–224. Cited by: §V-B.
  • [33] H. Modares, I. Ranatunga, F. L. Lewis, and D. O. Popa (2015) Optimized assistive human–robot interaction using reinforcement learning. IEEE Transactions on Cybernetics 46 (3), pp. 655–667. Cited by: §III-B.
  • [34] N. Mostafa and A. Eltawil (2017)

    Solving the heterogeneous capacitated vehicle routing problem using k-means clustering and valid inequalities

    In Proceedings of the International Conference on Industrial Engineering and Operations Management, Cited by: §I.
  • [35] K. S. V. Narasimha and M. Kumar (2011) Ant colony optimization technique to solve the min-max single depot vehicle routing problem. In Proceedings of the American Control Conference, pp. 3257–3262. Cited by: §II.
  • [36] K. V. Narasimha, E. Kivelevitch, B. Sharma, and M. Kumar (2013) An ant colony optimization technique for solving min–max multi-depot vehicle routing problem.

    Swarm and Evolutionary Computation

    13, pp. 63–73.
    Cited by: §II.
  • [37] M. Nazari, A. Oroojlooy, L. Snyder, and M. Takác (2018) Reinforcement learning for solving the vehicle routing problem. In Advances in Neural Information Processing Systems, pp. 9839–9849. Cited by: §I, §II, §IV-B, §V-A, §V-A, §V-D, §VI.
  • [38] T. T. Nguyen, N. D. Nguyen, and S. Nahavandi (2020) Deep reinforcement learning for multiagent systems: a review of challenges, solutions, and applications. IEEE Transactions on Cybernetics 50 (9), pp. 3826 – 3839. Cited by: §III-B.
  • [39] A. Palma-Blanco, E. R. González, and C. D. Paternina-Arboleda (2019) A two-pheromone trail ant colony system approach for the heterogeneous vehicle routing problem with time windows, multiple products and product incompatibility. In International Conference on Computational Logistics, pp. 248–264. Cited by: §V-B.
  • [40] A. Pessoa, R. Sadykov, E. Uchoa, and F. Vanderbeck (2020) A generic exact solver for vehicle routing and related problems. Mathematical Programming 183 (1), pp. 483–523. Cited by: §V-B.
  • [41] C. Prins (2002) Efficient heuristics for the heterogeneous fleet multitrip vrp with application to a large-scale real case. Journal of Mathematical Modelling and Algorithms 1 (2), pp. 135–150. Cited by: §II.
  • [42] W. Qin, Z. Zhuang, Z. Huang, and H. Huang (2021) A novel reinforcement learning-based hyper-heuristic for heterogeneous vehicle routing problem. Computers & Industrial Engineering 156. External Links: ISSN 0360-8352, Document Cited by: §II.
  • [43] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, and P. Jiang (2019) BERT4Rec: sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1441–1450. Cited by: §IV-B.
  • [44] J. F. Sze, S. Salhi, and N. Wassan (2017) The cumulative capacitated vehicle routing problem with min-sum and min-max objectives: an effective hybridisation of adaptive variable neighbourhood search and large neighbourhood search. Transportation Research Part B: Methodological 101, pp. 162–184. Cited by: §II.
  • [45] W. Y. Szeto, Y. Wu, and S. C. Ho (2011) An artificial bee colony algorithm for the capacitated vehicle routing problem. European Journal of Operational Research 215 (1), pp. 126–135. Cited by: §I, §I.
  • [46] Y. Tian, X. Zheng, X. Zhang, and Y. Jin (2019) Efficient large-scale multiobjective optimization based on a competitive swarm optimizer. IEEE Transactions on Cybernetics 50 (8), pp. 3696 – 3708. Cited by: §II.
  • [47] E. Uchoa, D. Pecin, A. Pessoa, M. Poggi, T. Vidal, and A. Subramanian (2017) New benchmark instances for the capacitated vehicle routing problem. European Journal of Operational Research 257 (3), pp. 845–858. Cited by: §V, footnote 4.
  • [48] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §I, §IV-B, §IV-B.
  • [49] J. M. Vera and A. G. Abad (2019) Deep reinforcement learning for routing a heterogeneous fleet of vehicles. In IEEE Latin American Conference on Computational Intelligence (LA-CCI), pp. 1–6. Cited by: §II.
  • [50] O. Vinyals, S. Bengio, and M. Kudlur (2015) Order matters: sequence to sequence for sets. arXiv preprint arXiv:1511.06391. Cited by: §IV-B.
  • [51] O. Vinyals, M. Fortunato, and N. Jaitly (2015) Pointer networks. In Advances in Neural Information Processing Systems, pp. 2692–2700. Cited by: §II, §IV-B, §V-A, §VI.
  • [52] H. Wang, T. Huang, X. Liao, H. Abu-Rub, and G. Chen (2016) Reinforcement learning for constrained energy trading games with incomplete information. IEEE Transactions on Cybernetics 47 (10), pp. 3404–3416. Cited by: §III-B.
  • [53] X. Wang, B. Golden, E. Wasil, and R. Zhang (2016) The min–max split delivery multi-depot vehicle routing problem with minimum service time requirement. Computers & Operations Research 71, pp. 110–126. Cited by: §I.
  • [54] X. Wang, S. Poikonen, and B. Golden (2017) The vehicle routing problem with drones: several worst-case results. Optimization Letters 11 (4), pp. 679–697. Cited by: §I.
  • [55] Y. Wen, J. Si, A. Brandt, X. Gao, and H. H. Huang (2019) Online reinforcement learning control for the personalization of a robotic knee prosthesis. IEEE Transactions on Cybernetics 50 (6), pp. 2346–2356. Cited by: §III-B.
  • [56] Y. Wu, W. Song, Z. Cao, J. Zhang, and A. Lim (2021) Learning improvement heuristics for solving routing problems. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §VI.
  • [57] J. Xiao, T. Zhang, J. Du, and X. Zhang (2019) An evolutionary multiobjective route grouping-based heuristic algorithm for large-scale capacitated vehicle routing problems. IEEE Transactions on Cybernetics, pp. 1 – 14. Cited by: §II.
  • [58] L. Xin, W. Song, Z. Cao, and J. Zhang (2020) Step-wise deep learning models for solving routing problems. IEEE Transactions on Industrial Informatics 17 (7), pp. 4861–4871. Cited by: §IV-B, §V-D.
  • [59] L. Xin, W. Song, Z. Cao, and J. Zhang (2021) Multi-decoder attention model with embedding glimpse for solving vehicle routing problems. In Proceedings of 35th AAAI Conference on Artificial Intelligence, Cited by: §I.
  • [60] Z. Xu and Y. Cai (2018) Variable neighborhood search for consistent vehicle routing problem. Expert Systems with Applications 113, pp. 66–76. Cited by: §V-B.
  • [61] E. Yakıcı (2017) A heuristic approach for solving a rich min-max vehicle routing problem with mixed fleet and mixed demand. Computers & Industrial Engineering 109, pp. 288–294. Cited by: §I, §II.
  • [62] J. Yu, J. Li, Z. Yu, and Q. Huang (2019) Multimodal transformer with multi-view visual representation for image captioning. IEEE transactions on circuits and systems for video technology 30 (12), pp. 4467–4480. Cited by: §IV-B.